Real Datasets


Data come from article by Monica Chagoyen, et al. [LINK] Data were restored based on the informations and the sources from the article. It was a matrix containing the number of occurrences of words in the context of genes. Genes were selected from SGD8 database (Saccharomyces cerevisiae genome) and each associated with one of eight broad biological processes (each of which described by GO Ontology term):

  • cell cycle (GO:0007049),
  • cell wall organization and biogenesis (GO:0007047),
  • DNA metabolism (GO:0006259),
  • lipid metabolism (GO:0006629),
  • protein biosynthesis (GO:0042158),
  • response to stress (GO:0006950),
  • signal transduction (GO:0007165),
  • transport (GO:0006810).

All genes were annotated by the experts with 7080 articles. At least one article with one gene. We download all documents listed in article from PubMed database. Single document is constructed by concatenating the titles and the abstracts. After removing very frequent terms (appears in more than 80% of genes), and very rare terms (less than 4%), we obtain 3031 words. Term frequencies were weighted by IDF measure, which stands for inverse document frequency.

.VMATRIX


Data taken from Rosenwald A, at el. publication [LINK] is gene expression matrix consist of 240 microarray experiments. Each experiment is taken from different leukemia patient. Each patient in test group has one of six subtype of leukemia. Such define data set could have been consist of six bi-clusters, each associated with a different kind of disease.

.ARFF

.VMATRIX


Data taken from Yeoh EJ, at el. publication [LINK] is gene expression matrix consist of 233 microarray experiments

.VMATRIX


Data taken from vant’t Veer LJ, at el. publication [LINK] is gene expression matrix consist of 98 microarray experiments

.VMATRIX


Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.