Datasets
- Social Spammer
- Description:
- This anonymized dataset was collected from the Tagged.com social network website. It contains 5.6 million users and 858 million links between them. Each user has 4 features and is manually labeled as "spammer" or "not spammer". Each link represents an action between two users and includes a timestamp and a type. The network contains 7 anonymized types of links. The original task on the dataset is to identify (i.e., classify) the spammer users based on their relational and non-relational features.
- Download link:
- Related Papers:
- Drug-Target Interaction
- Description:
- This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network.
- Download link:
- Related papers:
- Stance Classification
- Description:
- This dataset contains threads containing short user posts on debate topics across multiple online forums. The well-studied forums are 4Forums.com (on average 340 users per topic and 19 posts per user) and CreateDebate.org (310 users per topic and 4 posts per user). The key task is classifying the users’ stances towards discussion topics and classifying the polarity of replies between users. 4Forums.com has crowd-sourced annotations with high inter-annotator agreement for stances of users in each topic and dis/agreement between users that reply to one another. CreateDebate.org supports self-labeling for stance and dis/agreement, but for each post authored by a user.
- Download link:
- Related papers:
- CiteSeer for Document Classification
- The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details.
- Download link:
- Related papers:
- CiteSeer for Entity Resolution
- The CiteSeer dataset contains 1504 machine learning documents with 2892 author references to 165 author entities. For this dataset, the only attribute information available is author name. The full last name is always given, and in some cases the author’s full first name and middle name are given and other times only the initials are given.
- Download link:
- Related papers:
- Cora
- The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details.
- Download Link:
- Related Papers:
- ArXiv
- The arXiv dataset describes high energy physics publications. It was originally used in KDD Cup 2003 . It contains 29555 papers with 58515 references to 9200 authors. The attribute information available for this dataset is also just the author name, with the same variations in form as described above
- Download Link:
- Related Papers:
- PubMed Diabetes
- The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
- Download Link:
- Related Papers:
- WebKB
- The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details.
- Download Link:
- Related Papers:
- Terrorists
- This dataset contains information about terrorists and their relationships. This dataset was designed for classification experiments aimed at classifying the relationships among terrorists. The dataset contains 851 relationships, each described by a 0/1-valued vector of attributes where each entry indicates the absence/presence of a feature. There are a total of 1224 distinct features. Each relationship can be assigned one or more labels out of a maximum of four labels making this dataset suitable for multi-label classification tasks. The README file provides more details.
- Download Link:
- Related Papers:
- Terrorist Attacks
- This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details.
- Download Link:
- Related Papers: