DREEM ------------------------------------------------------------------------------------------------------------------------ 1. Disease-related marker gene collection and releasement: DREEM comprehensively retrieves all at present available Disease-RElatEd Marker (DREEM) genes in human gut microbiome from the whole-genome sequencing (WGS) data released in Gene Bank and EMBL. Short reads with the size of 18.63T consisting of 1729 samples are processed with unified procedure, involving the state-of-the-art bioinformatics tools and well-designed statistical analysis. Finally, a total of 1,953,046 DREEM genes were collected as a universal data set of DREEM, which covers the six types of diseases. In addition, DREEM genes corresponding to each disease were respectively picked to form six data sets of DREEM (DREEM_T2D, DREEM_Obesity, DREEM_Crohn, DREEM_UC, DREEM_LC, and DREEM_Athero, corresponding to each disease). Additionally, 5100 DRMGenes were found shared by the 5 types of diseases (T2D, Crohns disease, liver cirrhosis, obesity and UC),and were specially organized as a data set (Core-DREEM genes). Totally, 8 data sets are available on webpage for reading and downloading. Every gene ID was designed as serial number in DREEM, followed by its short-read source series in Gene Bank or EMBL, disease categories, integrity description, species classification and functional annotation. For those complete Genes with complete 3end and 5end, length information was also recorded in IDs. Every entry in DREEM is constituted by a gene ID followed by nucleotide sequence. ------------------------------------------------------------------------------------------------------------------------ 2. Data illustration: 1) Rarefaction curves of Significant Genes among samples from each publication. The number of samples and Significant Genes were both normalized by the maximum value (supplementary table 1).As the AUC (Supplementary Table 6) values exceeded 0.85, the sample number is sufficient for almost all the potential Significant Genes and DRMGenes related to the 6 diseases. 2) Distribution of DREEM gene length in four integrality categories. All the DRMGenes are classified into four groups by the integrality. The distribution of each of the four groups are denoted in boxplots, which label the median, mean value and normal range for the data distribution. For complete genes, the average length is 964bp. The average length of all the DRMGenes is 795bp. 3) Number of DREEM genes shared between multiple data sets of DREEM. U, C, O, L and T stand for data set of DREEM_UC genes, DREEM_Crohn genes, DREEM_Obesity genes, DREEM_LC genes and DREEM_T2D genes separately. There are 5,100 Core-DREEM genes, which are shared by other five data sets. Most DREEM genes are unique to one data set. Nevertheless, DREEM_UC and DREEM_Crohn share the largest number of genes compared with other pairs of data sets, indicating a stronger correlation between the two IBD types. 4) Taxonomic annotation of DREEM genes in DREEM (at phylum level). S3A. Taxonomic annotation of the integrated DRMGenes. S3B. Taxonomic annotation of Core-DREEM genes. The area in pie shows the proportion of different phylums. Phylums of Proteobacteria, Firmicutes and Bacteroidetes dominate the gut microbial community. 5) Functional annotation via BLAST against COG database (e-value 10-5). S4A. Functional annotation of integrated DREEM genes. Most of the DREEM genes are functionally classified into metabolism, cellular processes and signaling cellular. S4B. Functional annotation of core genes. 4255 (83%) out of 5100 genes are conferred with certain functions, among which about 41.1% are responsible for metabolism, while 23.1% for cellular processes and signaling cellular, and 27.8% for information storage and procession. 6) Metabolic network of Core-DREEM genes. The metabolic network of Core-DREEM genes gives insights into the complexity of connections among DREEM genes belonging to various COG categories as bubbles in the figure (N.A. denotes the set of Core-DREEM genes without any matched COG category). The height of peripheric columns shows activeness of the corresponding COG category, while the width of lines reflects degree of interaction between the 2 categories. It suggests that the membrane transport related categories, i.e. amino acid transport and metabolism (E), inorganic ions transport and metabolism (P), carbohydrate transport and metabolism (G), nucleotide and lipid transport and metabolism (F, I), are the most complex groups and connected with each other more intensively. ------------------------------------------------------------------------------------------------------------------------ 3. Search page: For Core-DREEM genes, every item is manually linked to its best hit in NCBI according to sequence similarity. ------------------------------------------------------------------------------------------------------------------------ 4. Potential Users: As the first released database focusing on DREEM genes of gut microbiome, DREEM provides wide and deep vision into the microbial genetic diversity related to relevant diseases. We hope that DREEM could serve as reference catalogues for future studies of pathophysiological role of gut microbiomes in human health. Moreover, we expect that based on intestinal microbiota and with the help of DREEM new diagnostic and therapeutic strategies of host disease could be invented. Core-DREEM genes have shown potential usefulness of designing microbiota-targeted biomarkers, which may be a powerful tool for disease detection and treatment. Nevertheless, the increasing rapid expansion of related research work will continue to challenge our capacity to compile the latest sequenced samples of other relevant diseases, such as depression, IBS and so on. We will keep updating so that the database stays current as new disease-related gut metagenomic study published. We plan to evolve DREEM by adding significant functionality. One valuable direction of ongoing development is to establish disease prediction models to help identify relevant diseases based on the DREEM genes. As DREEM supplies references for designing animal experiment model to explore whether these effects interact in ways that influence outcome, drug targets may also be recommended over time. ------------------------------------------------------------------------------------------------------------------------ Last but not least, any questions, contact us at any time! Email: xucmpku@163.com