Network datasets#

  1. Pajek Datasets
    When publishing results obtained using this data set the original authors should be cited. In addition this collection should be cited as:
    Vladimir Batagelj and Andrej Mrvar (2006): Pajek datasets. <url: http://vlado.fmf.uni-lj.si/pub/networks/data/>
  2. Newman's Network data
  3. Stanford Large Network Dataset Collection
    The datasets available on the website were mostly collected (scraped) for the purposes of our research.
    Please cite: http://snap.stanford.edu/data/
  4. Social Network Dataset
  5. DBLP
    The DBLP Computer Science Bibliography
  6. Tweet social graphs
    Tweeter followship graph
  7. Online Social Network data
    Flickr(users,links,group,groupmembership), LiveJournal, Orkut, Youtube(users,links,groups,groupmembership)
  8. Datamob / Datasets / social networks
  9. Extracted DBLP Dataset
    In this dataset, we extracted out more than 18w papers with title,authors,year,venue and topics. There are totally 25 topics that are identified by SVM classifers. It was used in our paper, Which Topic will You Follow? (ECML-PKDD2012), and Towards topic following in heterogeneous information networks (ASONAM2015)

Data Mining & Recommendation Data Set#

  1. Prostate cancer
  2. Cross-domain recommendation dataset of Diabetes (used in our papers in CIKM2015 and ICDM2015)
  3. Douban Movie dataset (used in our papers in CIKM2015 and ICDM2015)
  4. Human assessment survey for Douban Movie recommendation (50 samples used in our papers in CIKM2015 and ICDM2015)

Entity Resolution#

  1. arXive hep-th: KDD Cup 2003 publication dataset, hep-th portion of arXive
  2. CiteSeer: collection of research publications
  3. Cora: a citation dataset from RIDDLE data repository
  4. Cora: a citation dataset from Andrew McCallum's data repository
  5. DBLP: collection of bibliographic entries
  6. DMOZ ontology: a large downloadable ontology
  7. Enron Email Dataset: a dataset of Enron emails
  8. FEBRL Database: Freely Extensible Biomedical Record Linkage
  9. Freedb CD Dataset: Info on various CDs
  10. IMDb: collection of movie-related entries
  11. PubMed/MEDLINE: over 20 Million bibliographic entries for biomedical literature
  12. RIDDLE Repository: various data cleaning-related datasets
  13. SPOKE Challenge: (registration is required) collection of labeled webpages for SPOKE Challenge
  14. Stanford Movie Dataset: collection of movie-related entries
  15. UC Irvine Machine Learning Repository: collection of various ML datasets
  16. UIS Database Generator: generates synthetic names and addresses by injecting errors into clean records
  17. U.S. Census Names: frequently occurring first names and surnames from the 1990 Census
  18. Web Disambiguation: collection of labeled webpages used by Bekkerman and McCallum in WWW'05
  19. WEPS Corpus: collection of labeled webpages used by Artiles, Gonzalo, and Verdejo in SIGIR'05
  20. Wiktionary: downloadable free-content multilingual dictionary

添加新附件

只有授权的用户才能上传新附件。

附件列表

类型 附件名称 大小 版本 修改日期 作者 变更注释
txt
25topics.txt 0.6 kB 1 07-十月-2014 22:29 yangdeqing
rar
Data_Diabetes.rar 85,956.2 kB 1 10-十一月-2014 14:06 yangdeqing
zip
HumanAssess_DoubanMovie.zip 43.4 kB 1 02-十一月-2015 00:14 yangdeqing
rar
dblp.rar 6,779.8 kB 1 12-三月-2013 19:48 fd_yangdq DBLP dataset(with topics)
txt
movie2.txt 785.9 kB 1 02-四月-2015 09:19 yangdeqing Douban Movies
« 该页面(修订版 )最后由 yangdeqing 在 02-十一月-2015 00:17 修改。