General description |
Computational gene finding obtains increasing concerns in the metagenomic sequencing projects. Accurate prediction of gene on both 5'-end and 3'-end is important for experimental identification of purified protein products, as well as predictions of promoter and operon. However almost all of the current algorithms are inapplicable to accurately locate the translation initiation sites (TISs) of protein coding genes; they are usually demonstrated a poor performance on experimentally verified gene data. To address this problem, we provide MetaTISA, a tool that post-processes predictions from gene-finders for metagenome with an aim to improve the prediction of TISs. At this moment, it provides option for MetaGene (Noguchi et al., 2006, 2008) and will extend to other gene finders in later developments. To facilitate the study of diversity of translation mechanisms among different environments, we provide a matlab script to visualize the positional weight matrix generated for each binned cluster.
|
Access |
Online Prediction Download stand-alone MetaTISA (Linux/Windows).
|
Related resources |
A webserver for TIS prediction in complete microbial genomes TiCo A database for TIS annotation for microbial genomes ProTISA
|
Please Cite |
Gang-Qing Hu, Jiang-Tao Guo, Yong-Chu Liu, and Huaiqiu Zhu. MetaTISA: Metagenomic Translation Initiation Site Annotator for improving gene start prediction. Bioinformatics 2009,25(14):1843-1845. |
References |
Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.
Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9: 217.
Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res., 15: 387-396.
Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.
Sandberg, R,, Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing whole-genome characteristics in short sequences using a Naive Bayesian classifier. Genome Res. 11:1404�C1409.
|
|
Algorithm Description
MetaTISA is designed to post-process an existing gene annotation pipeline for metagenome, with an aim to improve its prediction accuracy of translation initiation sites (TISs). It takes inputs as a set of shotgun fragments and a set of CDS annotations, and outputs refined initiation sites.
The method constitutes two steps in TIS prediction. First, it uses binning techniques to classify all input fragments. Fragments binned in the same clade are assumed to have closely phylogenic origin, and hence share a similar mechanism in translation initiation. Then, the recently developed tool, TriTISA, is modified to predict TIS for each clade of fragments in an unsupervised iterative manner.
Binning is to assign an anonymous sequence fragment to certain phylogenic group. A number of binning method currently published may include BLAST, K-mer (Sandberg, et al., 2001), SOM (Abe, et al., 2003), PhyloPythia (McHardy, et al., 2007), TETRA (Teeling, et al., 2004), etc. At this stage, we implemented the K-mer method, and plan to include other methods in the future.
The k-mer is a supervised method and employs a naive Bayesian classifier for sequence binning. The training data is the k-mer frequencies compiled for each phylogenic clade. Given a sequence fragment, the k-mer method calculates a likelihood score for each clade based on the pre-compiled frequencies, and assigns the fragment to the clade that produces the highest likelihood score (Sandberg, et al., 2001). We prepared the training sets from completely sequenced genomes, and chose one genome per genus to reduce redundancy (the genus list). Each genome presents a phylogenic clade, and the fragment is binned to genus level.
We have recently proposed an unsupervised method for TIS prediction for microbial genomes (Hu et al., 2009). The method classifies all TIS candidates into three categories: true TISs, false TISs upstream of the true TISs (in noncoding region), and false TISs downstream of the TISs (in coding region). The features of sequences around TISs for each category are characterized by a non-homogenous Markov model. The three models are trained by an iterative self-learning procedure. At each step, the Markov models are combined by a Bayesian methodology, which assign three post-probabilities to each candidate TIS: the probability that the TIS is a true TIS (Pt), that it is from non-coding region (Pnc), and that it is from coding region (Pc). TriTISA predict the one with the highest Pt score as the TIS of a gene. The updated annotation constitute the training set for the next step of iteration. To further improve the prediction accuracies, TriTISA employs a cascade combination of different orders of Markov model, namely it first uses a 0th-order Markov model for initial refinements, and then move to higher (1st an 2rd) order Markov models in the later steps of the iteration. Test on simulation data and experimentally verified data show that TriTISA produces a more accurate and robust prediction than the state-of-the-art (Hu et al., 2009).
Here, we modified the TriTISA algorithm to post-process annotation for metagenomic binned fragments. CDSs from binned fragments are assumed to share similar machinery in translation initiation, and the sequence pattern for each set of TISs are homogenous across the clade. The assumptions allow the parameters to be trained as that trained for a single genome. CDSs are extended to the 5'-most before post-processing, and CDSs that are complete in their 5'-ends are used for parameter training. With the converged parameters, TriTISA calculates for each candidate TIS three scores: Pt, Pnc and Pc. For CDSs that are complete in their 5'-ends, the start codon is predicted as the candidate start that shows the highest Pt score. For CDSs that are incomplete in their 5'-ends, we need to estimate whether the start codon is missing. In other words, is the 5'-most start-codon-like triplet belong to coding region? We estimate the distribution of Pco from training set, and it is readily to have a threshold to say if a candidate is from coding regions (at a 95% confidence interval). For CDSs that are estimated to contain start codons, we predict the TIS following the procedure acting on training CDSs.
![](/__local/D/1C/42/40C5FFD8D78C73740937C7D0F27_98BB17F5_D0E4.png?e=.png)
Figure 1 Program flow chart of MetaTISA
��
Performance Evaluation
Due to the lacking of experimentally verified TISs in metagenome project, the only way to reliably evaluate the prediction performance is to simulate a metagenome based on artificial shotgun sequences from complete microbial genomes. The validities of the k-mer method and the TriTISA method are documented previously (Sandberg et al., 2001; Hu et al. 2009. Here we tested their combined effect on TIS prediction for metagenomes with shotgun sequences simulated from 95 randomly selected genomes plus 5 genomes where experimentally verified TISs are available (Hu et al. 2009). Two sets of simulation were created with different settings of fragment length: L = 700 bps and L = 400 bps. We selected genes that have experimentally verified TISs from the five genomes as benchmarks. Since many of their start codons are absent from the fragments, we calculate sensitivity TP/TP+FN (sn) and specificity TN/TN+FP (sp) for accuracy measurements, where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. We demonstrate the performance of MetaTISA used to process the outputs of the newest version of MetaGene, namely MetaGeneAnnotator or MGA (Noguchi et al. 2008) (Table 1, 2). Similar improvements are obtained for Neural Net (Hoff et al., 2008) (data not shown).
��
Table 1. Accuracies calculated according to the RefSeq whole genome annotation
�� |
Fragment length = 700 |
Fragment length = 400 bps |
Genomes |
# |
MGA_sn/sp |
MTS_sn/sp |
# |
MGA_sn/sp |
MTS_sn/sp |
A. pernix |
1309 |
60.25/98.58 |
80.29/99.04 |
980 |
60.98/98.74 |
79.45/98.80 |
Synechocysis sp. |
2196 |
78.08/99.34 |
78.14/99.18 |
1608 |
80.20/99.42 |
79.11/99.07 |
E.coli |
3195 |
87.65/99.70 |
91.27/99.65 |
2455 |
86.97/99.71 |
90.74/99.48 |
N. pharaonis |
1978 |
83.78/99.21 |
88.50/99.20 |
1502 |
83.55/99.28 |
87.74/98.91 |
H. salinarum |
1626 |
79.64/99.34 |
88.45/99.46 |
1234 |
78.18/99.39 |
88.02/99.30 |
Weighted average |
- |
80.12/99.33 |
86.10/99.35 |
- |
80.24/99.40 |
85.90/99.17 |
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.Figures for the other 95 genomes are included here(Fragment length=700,Fragment length=400), but caution should be taken for accuracy interpretation because RefSeq annotation on TIS is not of high quality (Hu et al. 2008). |
Table 2. Accuracies calculated according to experimentally verified TISs
�� |
Fragment length = 700 |
Fragment length = 400 bps |
Genomes |
# |
MGA |
MTS |
# |
MGA |
MTS |
A. pernix |
103 |
64.42/98.71 |
94.52/99.61 |
78 |
60.54/98.70 |
92.29/99.25 |
Synechocysis sp. |
92 |
83.53/99.25 |
81.33/98.93 |
75 |
82.20/99.29 |
82.56/98.85 |
E.coli |
733 |
89.56/99.76 |
93.77/99.72 |
562 |
87.06/99.75 |
93.44/99.60 |
N. pharaonis |
248 |
91.27/99.55 |
97.04/99.58 |
184 |
90.74/99.53 |
95.86/99.11 |
H. salinarum |
428 |
85.72/99.54 |
96.04/99.71 |
337 |
82.43/99.51 |
94.65/99.48 |
Weighted average |
- |
86.84/99.57 |
94.21/99.64 |
- |
84.37/99.56 |
93.40/99.43 |
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates. |
��
References |
Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T. and Ikemura, T.B (2003) Informatics for Unveiling Hidden Genome Signatures. Genome Res. 13: 693-702.
Hu, G.-Q., Zheng, X.-B, Ju, L.-N., Zhu, H. and She, Z.S. (2008) Computational evaluation of TIS annotation for prokaryotic genomes, BMC Bioinformatics, 9:160.
Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S. (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.
Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9:217.
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. and Rigoutsos, I. (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. 4: 63-72.
Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Research, 15:387-396.
Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.
Sandberg, R., Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing Whole-Genome Characteristics in Short Sequences Using a Naive Bayesian Classifier. Genome Res. 11: 1404 - 1409.
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. and Glockner, F.B. (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 5:163.
|
Download stand-alone MetaTISA |
CDSFormatConverterWe provide tools that convert the CDS format of a gene finder to the format that is acceptable for MetaTISA.
Table of Contents
Linux Release of MetaTISAThe package in MetaTISA contains the following items:
Source code of MetaTISA
Default settings file of MetaTISA
Pre-calculated TIS parameters in folder "parameter" (in case for groups whose samples are insufficient for a good self-training of paramters)
k-mer bin model file and taxonomic map file in folder " bin_model"
Example file in folder "example"
Readme file
MetaTISA (tarball) Table of Contents
Windows Release of MetaTISAThe package in MetaTISA contains the following items:
Executable file of MetaTISA
Default settings file of MetaTISA
Pre-calculated TIS parameters in folder "parameter" (in case for groups whose samples are insufficient for a good self-training of paramters)
k-mer bin model file and taxonomic map file in folder " bin_model"
Example file in folder "example"
Readme file
MetaTISA(zip) Download source code of windows here (zip) Table of Contents
Visualization of PWMThe package in PWM visualization contains the following items:
PWM Visulization(tarball) Table of Contents |
Input and Output formats |
MetaTISA Input Format
CDS Annotation Format
Sequence Format
MetaTISA Output Format
MetaTISA Input FormatCDS Annotation FormatOne input format for CDS annotation accepted is the output of MetaGene prediction. The other accepted format is our own MED format: an sequence fragment id beginning with ">", followed by co-ordinations of all CDSs in the fragment, one CDS per line. The coordinates of a CDS define the nucleotide region, from which the first position for positive strand (second position for negative strand) can be used to translate the CDS to amino acid sequence. At the same time, in order to facilitate users to post-processing results from other gene prediction tools, we provide several tools for a conversion from a format of a gene-finder to the MED format (converting formats). Example (MED format):
>NC_000913_1
2 109 -
124 699 -
>NC_000913_2
3 305 +
539 700 +
>NC_000913_3
1 411 -
442 699 -
>NC_000913_4
2 685 -
>NC_000913_5
1 699 +
Sequence FormatThe metagenome sequence is in Fasta format. The first line is recognized as unique id for sequence fragment. Example:
>mgutLn1_U_BL_aaa09a05_b1 Mouse Gut Community PT3 : mgutLn1_U_BL_aaa09a05_b1
AAATCTCGCCCTGTGGTGGATTCCTTTTCCCATTGCCCGATCTTATTTTT
ATCTTCCAAAAATGACGAACTGGACAAGATCCTGGGTCTTTCGGTGGGCG
GGGATGATTACGTGGCAAAGCCGTTCAGCCCGAAGGAGATCGCGTATCGG
GTCAAGGCGCAGCTCCGGCGGGCCGCGTATCAGCAAGACCCGTCGGAGGA
GGAGCTCATAAAAACAGGGGAATTGGAAATTGACGTGGAGGGCTGCAGGG
TCACAAAAGGCGGCAGCCCCATAGAACTGACCGCGCGGGAATTTGAAATC
CTGCGGTATCTGGCGGAAAATCAAGGCCGGGTCATCAGCCGCGAACGCTT
ATATGAAACCATCTGGGGCGAGGACAGCTTCGGGTGCGACAATACGGTCA
TGGTGCATATCCGGCATCTGCGTGAAAAAATAGAGGACGATCCCGCGGCG
CCCCGATACATCATCACGATGAAAGGATTAGGCTATAAGCTGGTGGACCC
TTATGAAGAATAAAAGCGATCTCAATCTGTTTTTTCGTTCGTTCGGCATT
GTCGTGATTGTGATCTTCGCGGCCATTGCAGCGGGGATATGCCTGTTTTA
TTATGTGTTCGCGATTCCGGCGCGGGAGGGACTCAGCCTGGCCTCATGGC
CAGACGTGTATACAGACAATTTTTCCCTTCAGCTTGAAGAAGAACAGGGA
GAGCTTAAAGTAAAAGAATTCGGGATTGAAGATCTGGACCGGTATGGCTT
ATGGCTGCAGGTGATCGATGAAACGGGACAGGAGTTTTTTCACACAATAA
GCCGGAGACCTGTCCCAACAGCTATACGGCCTCGAGCTTTTGGCATTCGG
GTACGAACGTTTA
Top of Input and Output format
MetaTISA Output FormatMED FormatMED format gives a sequence id and corresponding CDS annotation, as described above. View the example. GFF FormatThe output in GFF (general feature format) is denoted according to the specifications of the Sanger institute: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
Example:
##gff-version 3
##MetaTISA
##metagenome sequence name: NC_000913_45
NC_000913_45 MetaTISA CDS 3 395 . - .
##gff-version 3
##MetaTISA
##metagenome sequence name: NC_000913_255
NC_000913_255 MetaTISA CDS 2 298 . - .
NC_000913_255 MetaTISA CDS 320 616 . - .
Top of Input and Output format |
|
Configuration parameters |
Sequence region around start codon
Max order of Markov model
Minimal # of CDS in training
Binning Method
Start and Stop Codon
Sequence region around start codonThe parameters specify the region of the sequence around start codon that should be used to calculate position weight matrix:
Top of Configuration parameters
Max order of Markov modelIn training parameters, TriTISA uses a cascade combination of Markov model from lower order to higher order. By Default, three model are cascaded, namely 0th, 1st and 2nd order. This option corresponds to the max order of Markov model. Default: 2
Top of Configuration parameters
Minimal # of CDS in trainingSince our method begins with a 0th order Markov model, it has a number of parameters as small as 4 at each nucleotide position, and 3 prio-probabilities to be estimated. But if the size of training set of a binned group is too small, we suggest to use parameters pre-trained from sequenced genomes for this group. Test on simulation data from the EcoGene database shows that 200 samples are sufficient for a good estimation of the parameters. And this is the default value used by MetaTISA to determine whether to self-train the parameters or use already trained parameters. �� Default: 200 ��Top of Configuration parameters
Binning MethodAt this stage, we implemented the k-mer method for binning. Test on simulation data showed that the accuracy is slightly higher for k > 9, but a less k will greatly saves the memory, computation time as well as downloading time. Default: 9-mer
Top of Configuration parameters
Start and Stop Codon
Start codons: ATG, CTG, GTG, TTG;
Stop codons: TAA, TGA and TAG;
TGA is not set as stop codon in: Mycoplasma, Acholeplasma, Aster, Onion and Ureaplasma.
Top of Configuration parameters |
|
©2008 MetaTISA