Tools and Databases
DMA
DMA (Diabetes Mellitus & Microbiota Associations) contains the host-microbe associations recognized by cross-measurement association analyses of multiomics datasets stratified by an important clinical parameter, stool type, identified in our recently published work. This dataset discloses the heterogeneity in host-microbe associations and will help to learn stratified microbial effects within the T2DM patient population.
ABVFCD
The ABVFCD database has collected 190,539 sequences of virulence genes of 1,942 strains of Acinetobacter baumannii which were collected from 25 provinces in mainland China.By classifying these virulence gene sequences,we aim to provide support for the spatiotemporal evolution research of Acinetobacter baumannii virulence genes.The current classification information in this database includes the spationtemporal distribution of virulence genes nationwide and the clinical status of these genes.In the future,we will incorporate classification information of potential unknown pathogenic factors predicted using host-bacteria and protein-protein interaction algorithms.
IPEV
IPEV applied CNN to distinguish prokaryotic and eukaryotic Virus from virome data. It is built on Python3.8.6 , Tensorflow 2.3.1. IPEV calculates a set of scores that reflect the probability that the input sequence fragments are prokaryotic and eukaryotic viral sequences. By using parallelism and algorithmic optimization, IPEV gets the results of the calculations very quickly.
DREEM
We construct a disease-related marker genes database, named DREEM, in human gut microbiome by comprehensively retrieving all at present available data resources with the 18.63T WGS data of 1,729 samples, involving the state-of-the-art bioinformatics tools and well-designed statistical analysis. A total of 1,953,046 DREEM genes is built with covering six diseases, and further classified into six groups corresponding to each disease. Moreover, 5,100 Core-DREEM genes are defined as a common set shared by the diseases.
InteMAP
We developed a pipeline named InteMAP (Integrated Metagenome Assembly Pipeline for short reads) for integrating individual assemblers that complemented the advantages mutually in assembling metagenomic sequences. By comparing the performance of InteMAP with individual assemblers on both synthetic and real NGS metagenomic data, we showed that the InteMAP pipeline is able to achieve high performance of better assembly with a longer total contig length, the higher contiguity, and containing more genes than individual assemblers.
LncADeep
We propose an ab initio lncRNA identification and functional annotation tool named LncADeep. LncADeep has outperformed state-of-the-art tools on predicting lncRNAs and lncRNA-protein interactions, and can automatically provide informative functional annotations for lncRNAs.
MetaComp
MetaComp is capable to process all meta-omics data, such as metagenomics, metatranscriptomics, metaproteomics and metabolomics data, respectively.
MAP
A de novo assembly approach and its implementation based on an improved Overlap/Layout/Consensus (OLC) strategy incorporated with several special algorithms.MAP uses the mate pair information, resulting in being more applicable to shotgun DNA reads (recommended as > 200 bp) currently widely-used in metagenome projects. Results of extensive tests on simulated data show that MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, as well as has an evident advantage over Celera, Newbler, and the newest Genovo, for typical shorter reads by 454 sequencing.
MetaTISA
A tool with an aim to improve translationa initiation sites (TISs) prediction of current gene-finders for metagenomes. The method employs a two-step strategy to predict TISs by first clustering metagenomic fragments into phylogenetic groups and then predicting TISs independently for each group in an unsupervised manner. As evaluated on experimentally verified TISs, MetaTISA greatly improves the accuracies of TIS prediction of current gene-finders.
MED2.1
MED2.1 is a non-supervised prokaryotic gene prediction method which integrates MED2.0 and TriTISA, an iterative self-learning translation initiation site (TIS) prediction algorithm. As the update of MED2.0, MED2.1 modifies the TIS model by replacing the previous one to TriTISA, which imoroves the prediction accuracies for both 3' and 5' ends.
MetaGUN
MetaGUN is a novel gene prediction protocol for metagenomic fragments based on a machine learning approach of SVM. It can predict accurate results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among current metagenomic gene finders. Application to two samples of human gut microbiome indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
MID
MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome.
PROPER
PROPER is a stand-alone and cross-platform tool for predicting operon and prokaryotic transcription units, providing the visualization of results.
ProTISA
ProTISA is intended to collect confirmed translation initiation sites (TISs) for prokaryotic genomes. As of Oct 2008, it includes data for 728 genomes (676 Bacteria and 52 Archaea) with more than 700, 000 confirmed TISs. The confirmed data has supporting evidence from different sources, including experiments records in the public protein database Swiss-Prot, literature, conserved domain search and sequence alignment among orthologous genes. Combing with predictions from the-state-of-the-art TIS predictor MED-Start/MED-StartPlus and TriTISA and annotations on potential regulatory signals, the database can serve as a refined annotation resource for the public database RefSeq.
SigmaPromoter
SigmaPromoterA novel promoter prediction method based on multiple sigma factors model for bacterial genomesABSTRACTThough non-housekeeping sigma subunits in bacteria play an important role in extracellular stimuli response, currently no accurate high-throughput techniques are able to annotate the type of sigma promoter utilized by a transcript. In this work, we present a novel method, which pro...
TriTISA1
TriTISA is an TIS post-processor to refine annotation/prediction of translation initiation site (TIS) from an existing system for microbial genomes. The current version provides options for post-processing genome annotations from public databases such as GenBank and RefSeq, gene predictions from widely used gene finders such as GeneMark and Glimmer
LightCUD
lightCUD was a validated, high-performance program based on a machine-learning algorithm (lightGBM). LightCUD was implemented in the python language and packaged to be used free of installation with embedded customized databases. With WGS data or 16S rRNA sequencing data of gut samples as input, lightCUD can discriminate IBD from healthy controls with high accuracy and further identified the sp...
PPR-Meta
PPR-Meta is designed to identify metagenomic sequences as phages, chromosomes or plasmids. The program calculate three score reflecting the likelihood of each input fragment as phage, chromosome or plasmid. PPR-Meta can run either on the virtual machine or physical host. For non-computer professionals, we recommend running the virtual machine version of PPR-Meta on local PC. In this way, users do not need to install any dependency package. If GPU is available, you can also choose to run the physical host version. This version can automatically speed up with GPU and is more suitable to handle large scale data.
PlasGUN
PlasGUN is the gene prediction tool for plasmid metagenomic short reads using deep learning. PlasGUN takes the short reads file in "fasta" format as input and output a tabular file that contains the coordinates of the predicted ORFs. This software is suitable for metagenomic data in which plasmid DNA is enriched using either experimental or computational approach. PlasGUN presents better performance on plasmid short read data than traditional tools, which are designed primarily for chromosomal short reads. Tests also showed that PlasGUN could identify more potential novel genes than other gene prediction tools, which was important for plasmid study.
Deephage
DeePhage is designed to identify virome sequences as temperate phage-derived or virulent phage-derived fragments. The program calculate a score between 0 and 1 for each input fragment. The sequence with a score higher than 0.5 would be regarded as a virulent phage-derived fragment and the sequence with a score lower than 0.5 would be regarded as a temperate phage-derived fragment. DeePhage can run either on the virtual machine or physical host. For non-computer professionals, we recommend running the virtual machine version of DeePhage on local PC. In this way, users do not need to install any dependency package. If GPU is available, you can also choose to run the physical host version. This version can automatically speed up with GPU and is more suitable to handle large scale data.
LightCUD
lightCUD was a validated, high-performance program based on a machine-learning algorithm (lightGBM). LightCUD was implemented in the python language and packaged to be used free of installation with embedded customized databases. With WGS data or 16S rRNA sequencing data of gut samples as input, lightCUD can discriminate IBD from healthy controls with high accuracy and further identified the specific form of IBD。
VirGenFunD
Viruses are important component of the human gut microbiota and have a key role in maintaining the well-being of the gut function. However, the role of viruses are neglected in the extensively microbiome-based studies which mainly focused on bacteria. Thus, the role of viruses in the microbial community remains incompletely understood, especially in the context of diseases. VirGenFunD (gut Viral Genes and Functional classification Database) is a database of disease-related viral genes with multi-level annotations. With well-designed pipeline for precise viral genes identification, the current version of VirGenFunD contains 3,351,765 viral gene sequences detected in metagenomics datasets of five diseases and the corresponding healthy controls: irritable bowel syndrome (IBS), type 2 diabetes (T2D), Crohn’s disease (CD), colorectal cancer (CRC) and liver cirrhosis (LC). We manually classified viral gene functions into 16 categories to facilitate systematic view of function of virus...
HoPhage
HoPhage(Host of Phage) is a computational tool that integrates two modules respectively using the deep learning and the Markov chain model to identify the host of a given phage fragment from metagenome or metavirome data at the genus level. HoP demonstrates a superior performance on short fragments within a wide candidate host range at every taxonomic level when testing on the artificial benchmark dataset of artificial phage contigs and the real virome data.
SoDpipe
SoDpipe is an integration pipeline for automatically analyzing redundant gens and their regulation in prokaryotes. SoDpipe provides a framework to support genomic surveillance of the occurrence, gene expression and adaptive evolution from the perspective of gene redundancy for prokaryotes. SoDpipe takes both genome assembly and whole genome sequencing data as input and automatically performs the analysis of the data all through a single command-line instruction. It generates a detailed report of the duplicated gene clusters, types of translation initiation signals (SD-like, TA-like, Atypical, or no signal), signal motifs, and the start site of the signal. The pipeline is potentially extendable by adding new rules and will substantially reduce the efforts in sending commands.