Network datasets#

  1. Pajek Datasets
    When publishing results obtained using this data set the original authors should be cited. In addition this collection should be cited as:
    Vladimir Batagelj and Andrej Mrvar (2006): Pajek datasets. <url:>
  2. Newman's Network data
  3. Stanford Large Network Dataset Collection
    The datasets available on the website were mostly collected (scraped) for the purposes of our research.
    Please cite:
  4. Social Network Dataset
  5. DBLP
    The DBLP Computer Science Bibliography
  6. Tweet social graphs
    Tweeter followship graph
  7. Online Social Network data
    Flickr(users,links,group,groupmembership), LiveJournal, Orkut, Youtube(users,links,groups,groupmembership)
  8. Datamob / Datasets / social networks
  9. Extracted DBLP Dataset
    In this dataset, we extracted out more than 18w papers with title,authors,year,venue and topics. There are totally 25 topics that are identified by SVM classifers. It was used in our paper, Which Topic will You Follow? (ECML-PKDD2012), and Towards topic following in heterogeneous information networks (ASONAM2015)

Recommendation Datasets#

  1. Weibo tag and followship dataset and Douban movie/user tag and rating dataset
    These two datasets were used in the experiments of Weibo followship recommendation and Douban movie recommendation (used in our paper in ASONAM2018, ICDM2018, DASFAA2019).
  2. Prostate cancer
  3. Cross-domain recommendation dataset of Diabetes (used in our papers in CIKM2015 and ICDM2015)
  4. Douban Movie dataset (used in our papers in CIKM2015 and ICDM2015)
  5. Human assessment survey for Douban Movie recommendation (50 samples used in our papers in CIKM2015 and ICDM2015)

Entity Resolution#

  1. arXive hep-th: KDD Cup 2003 publication dataset, hep-th portion of arXive
  2. CiteSeer: collection of research publications
  3. Cora: a citation dataset from RIDDLE data repository
  4. Cora: a citation dataset from Andrew McCallum's data repository
  5. DBLP: collection of bibliographic entries
  6. DMOZ ontology: a large downloadable ontology
  7. Enron Email Dataset: a dataset of Enron emails
  8. FEBRL Database: Freely Extensible Biomedical Record Linkage
  9. Freedb CD Dataset: Info on various CDs
  10. IMDb: collection of movie-related entries
  11. PubMed/MEDLINE: over 20 Million bibliographic entries for biomedical literature
  12. RIDDLE Repository: various data cleaning-related datasets
  13. SPOKE Challenge: (registration is required) collection of labeled webpages for SPOKE Challenge
  14. Stanford Movie Dataset: collection of movie-related entries
  15. UC Irvine Machine Learning Repository: collection of various ML datasets
  16. UIS Database Generator: generates synthetic names and addresses by injecting errors into clean records
  17. U.S. Census Names: frequently occurring first names and surnames from the 1990 Census
  18. Web Disambiguation: collection of labeled webpages used by Bekkerman and McCallum in WWW'05
  19. WEPS Corpus: collection of labeled webpages used by Artiles, Gonzalo, and Verdejo in SIGIR'05
  20. Wiktionary: downloadable free-content multilingual dictionary

Add new attachment

Only authorized users are allowed to upload new attachments.

List of attachments

Kind Attachment Name Size Version Date Modified Author Change note
25topics.txt 0.6 kB 1 07-Oct-2014 22:29 yangdeqing
Data_Diabetes.rar 85,956.2 kB 1 10-Nov-2014 14:06 yangdeqing
Douban_br.sql 38,930.3 kB 1 09-Jan-2018 15:26 yangdeqing
zip 43.4 kB 1 02-Nov-2015 00:14 yangdeqing
dblp.rar 6,779.8 kB 1 12-Mar-2013 19:48 fd_yangdq DBLP dataset(with topics)
zip 76,593.6 kB 1 03-Oct-2017 07:18 yangdeqing 5k users and 42k movies of Douban
movie2.txt 785.9 kB 1 02-Apr-2015 09:19 yangdeqing Douban Movies
zip 145,489.4 kB 1 03-Oct-2017 07:24 yangdeqing Weibo user tags and followships
« This page (revision-) was last changed on 19-1月-2019 23:06 by yangdeqing