General description

Computational gene finding obtains increasing concerns in the metagenomic sequencing projects. Accurate prediction of gene on both 5'-end and 3'-end is important for experimental identification of purified protein products, as well as predictions of promoter and operon. However almost all of the current algorithms are inapplicable to accurately locate the translation initiation sites (TISs) of protein coding genes; they are usually demonstrated a poor performance on experimentally verified gene data. To address this problem, we provide MetaTISA, a tool that post-processes predictions from gene-finders for metagenome with an aim to improve the prediction of TISs.

At this moment, it provides option for MetaGene (Noguchi et al., 2006, 2008) and will extend to other gene finders in later developments.

To facilitate the study of diversity of translation mechanisms among different environments, we provide a matlab script to visualize the positional weight matrix generated for each binned cluster.

Access

Online Prediction

Download stand-alone MetaTISA (Linux/Windows).

Related resources

A webserver for TIS prediction in complete microbial genomes TiCo

A database for TIS annotation for microbial genomes ProTISA

Please Cite
Gang-Qing Hu, Jiang-Tao Guo, Yong-Chu Liu, and Huaiqiu Zhu. MetaTISA: Metagenomic Translation Initiation Site Annotator for improving gene start prediction. Bioinformatics 2009,25(14):1843-1845.

References

  1. Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.

  2. Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9: 217.

  3. Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res., 15: 387-396.

  4. Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from
    environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.

  5. Sandberg, R,, Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing whole-genome characteristics in short sequences using a Naive Bayesian classifier. Genome Res. 11:1404�C1409.

Algorithm Description



MetaTISA is designed to post-process an existing gene annotation pipeline for metagenome, with an aim to improve its prediction accuracy of translation initiation sites (TISs). It takes inputs as a set of shotgun fragments and a set of CDS annotations, and outputs refined initiation sites.

The method constitutes two steps in TIS prediction. First, it uses binning techniques to classify all input fragments. Fragments binned in the same clade are assumed to have closely phylogenic origin, and hence share a similar mechanism in translation initiation. Then, the recently developed tool, TriTISA, is modified to predict TIS for each clade of fragments in an unsupervised iterative manner.

   

  • Binning Method

Binning is to assign an anonymous sequence fragment to certain phylogenic group. A number of binning method currently published may include BLAST, K-mer (Sandberg, et al., 2001), SOM (Abe, et al., 2003), PhyloPythia (McHardy, et al., 2007), TETRA (Teeling, et al., 2004), etc. At this stage, we implemented the K-mer method, and plan to include other methods in the future.

The k-mer is a supervised method and employs a naive Bayesian classifier for sequence binning. The training data is the k-mer frequencies compiled for each phylogenic clade. Given a sequence fragment, the k-mer method calculates a likelihood score for each clade based on the pre-compiled frequencies, and assigns the fragment to the clade that produces the highest likelihood score (Sandberg, et al., 2001). We prepared the training sets from completely sequenced genomes, and chose one genome per genus to reduce redundancy (the genus list). Each genome presents a phylogenic clade, and the fragment is binned to genus level.

  • TIS Prediction

We have recently proposed an unsupervised method for TIS prediction for microbial genomes (Hu et al., 2009). The method classifies all TIS candidates into three categories: true TISs, false TISs upstream of the true TISs (in noncoding region), and false TISs downstream of the TISs (in coding region). The features of sequences around TISs for each category are characterized by a non-homogenous Markov model. The three models are trained by an iterative self-learning procedure. At each step, the Markov models are combined by a Bayesian methodology, which assign three post-probabilities to each candidate TIS: the probability that the TIS is a true TIS (Pt), that it is from non-coding region (Pnc), and that it is from coding region (Pc). TriTISA predict the one with the highest Pt score as the TIS of a gene. The updated annotation constitute the training set for the next step of iteration. To further improve the prediction accuracies, TriTISA employs a cascade combination of different orders of Markov model, namely it first uses a 0th-order Markov model for initial refinements, and then move to higher (1st an 2rd) order Markov models in the later steps of the iteration. Test on simulation data and experimentally verified data show that TriTISA produces a more accurate and robust prediction than the state-of-the-art  (Hu et al., 2009).

Here, we modified the TriTISA algorithm to post-process annotation for metagenomic binned fragments. CDSs from binned fragments are assumed to share similar machinery in translation initiation, and the sequence pattern for each set of TISs are homogenous across the clade. The assumptions allow the parameters to be trained as that trained for a single genome. CDSs are extended to the 5'-most before post-processing, and CDSs that are complete in their 5'-ends are used for parameter training. With the converged parameters, TriTISA calculates for each candidate TIS three scores: Pt, Pnc and Pc. For CDSs that are complete in their 5'-ends, the start codon is predicted as the candidate start that shows the highest Pt score. For CDSs that are incomplete in their 5'-ends, we need to estimate whether the start codon is missing. In other words, is the 5'-most start-codon-like triplet belong to coding region? We estimate the distribution of Pco from training set, and it is readily to have a threshold to say if a candidate is from coding regions (at a 95% confidence interval). For CDSs that are estimated to contain start codons, we predict the TIS following the procedure acting on training CDSs.

Figure 1 Program flow chart of MetaTISA

��


Performance Evaluation




Due to the lacking of experimentally verified TISs in metagenome project, the only way to reliably evaluate the prediction performance is to simulate a metagenome based on artificial shotgun sequences from complete microbial genomes. The validities of the k-mer method and the TriTISA method are documented previously (Sandberg et al., 2001; Hu et al. 2009. Here we tested their combined effect on TIS prediction for metagenomes with shotgun sequences simulated from 95 randomly selected genomes plus 5 genomes where experimentally verified TISs are available (Hu et al. 2009). Two sets of simulation were created with different settings of fragment length: L = 700 bps and L = 400 bps. We selected genes that have experimentally verified TISs from the five genomes as benchmarks. Since many of their start codons are absent from the fragments, we calculate sensitivity TP/TP+FN (sn) and specificity TN/TN+FP (sp) for accuracy measurements, where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. We demonstrate the performance of MetaTISA used to process the outputs of the newest version of MetaGene, namely MetaGeneAnnotator or MGA (Noguchi et al. 2008) (Table 1, 2). Similar improvements are obtained for Neural Net (Hoff et al., 2008) (data not shown).
��

Table 1. Accuracies calculated according to the RefSeq whole genome annotation

�� Fragment length = 700 Fragment length = 400 bps
Genomes # MGA_sn/sp MTS_sn/sp # MGA_sn/sp MTS_sn/sp
A. pernix 1309 60.25/98.58 80.29/99.04 980 60.98/98.74 79.45/98.80
Synechocysis sp. 2196 78.08/99.34 78.14/99.18 1608 80.20/99.42 79.11/99.07
E.coli 3195 87.65/99.70 91.27/99.65 2455 86.97/99.71 90.74/99.48
N. pharaonis 1978 83.78/99.21 88.50/99.20 1502 83.55/99.28 87.74/98.91
H. salinarum 1626 79.64/99.34 88.45/99.46 1234 78.18/99.39 88.02/99.30
Weighted average - 80.12/99.33 86.10/99.35 - 80.24/99.40 85.90/99.17
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.Figures for the other 95 genomes are included here(Fragment length=700,Fragment length=400), but caution should be taken for accuracy interpretation because RefSeq annotation on TIS is not of high quality (Hu et al. 2008).

Table 2. Accuracies calculated according to experimentally verified TISs

�� Fragment length = 700 Fragment length = 400 bps
Genomes # MGA MTS # MGA MTS
A. pernix 103 64.42/98.71 94.52/99.61 78 60.54/98.70 92.29/99.25
Synechocysis sp. 92 83.53/99.25 81.33/98.93 75 82.20/99.29 82.56/98.85
E.coli 733 89.56/99.76 93.77/99.72 562 87.06/99.75 93.44/99.60
N. pharaonis 248 91.27/99.55 97.04/99.58 184 90.74/99.53 95.86/99.11
H. salinarum 428 85.72/99.54 96.04/99.71 337 82.43/99.51 94.65/99.48
Weighted average - 86.84/99.57 94.21/99.64 - 84.37/99.56 93.40/99.43
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.

��

References
  1. Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T. and Ikemura, T.B (2003) Informatics for Unveiling Hidden Genome Signatures. Genome Res. 13: 693-702.

  2. Hu, G.-Q., Zheng, X.-B, Ju, L.-N., Zhu, H. and She, Z.S. (2008) Computational evaluation of TIS annotation for prokaryotic genomes, BMC Bioinformatics, 9:160.

  3. Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S. (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.

  4. Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9:217.

  5. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. and Rigoutsos, I. (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. 4: 63-72.

  6. Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Research, 15:387-396.

  7. Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from
    environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.

  8. Sandberg, R., Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing Whole-Genome Characteristics in Short Sequences Using a Naive Bayesian Classifier. Genome Res. 11: 1404 - 1409.

  9. Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. and Glockner, F.B. (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 5:163.

Download stand-alone MetaTISA
 
  • CDSFormatConverter

  • Linux Release of MetaTISA

  • Windows Release of MetaTISA

  • Visualization of PWM


CDSFormatConverter

We provide tools that convert the CDS format of a gene finder to the format that is acceptable for MetaTISA.

Table of Contents


Linux Release of MetaTISA

The package in MetaTISA contains the following items:

  • Source code of MetaTISA

  • Default settings file of MetaTISA

  • Pre-calculated TIS parameters in folder "parameter" (in case for groups whose samples are insufficient for a good self-training of paramters)

  • k-mer bin model file and taxonomic map file in folder " bin_model"

  • Example file in folder "example"

  • Readme file

MetaTISA (tarball)

Table of Contents


Windows Release of MetaTISA

The package in MetaTISA contains the following items:

  • Executable file of MetaTISA

  • Default settings file of MetaTISA

  • Pre-calculated TIS parameters in folder "parameter" (in case for groups whose samples are insufficient for a good self-training of paramters)

  • k-mer bin model file and taxonomic map file in folder " bin_model"

  • Example file in folder "example"

  • Readme file

MetaTISA(zip)

Download source code of windows here (zip)

Table of Contents


Visualization of PWM

The package in PWM visualization contains the following items:
  • Matlab script to visualize PWM around TIS

  • Visualization of pre-calculated PWM

  • Readme file

PWM Visulization(tarball)

Table of Contents

Input and Output formats

MetaTISA Input Format

  • CDS Annotation Format

  • Sequence Format

MetaTISA Output Format                
  • MED Format

  • GFF Format


       

MetaTISA Input Format

CDS Annotation Format

One input format for CDS annotation accepted is the output of MetaGene prediction. The other accepted format is our own MED format: an sequence fragment id beginning with ">", followed by co-ordinations of all CDSs in the fragment, one CDS per line. The coordinates of a CDS define the nucleotide region, from which the first position for positive strand (second position for negative strand) can be used to translate the CDS to amino acid sequence. At the same time, in order to facilitate users to post-processing results from other gene prediction tools, we provide several tools for a conversion from a format of a gene-finder to the MED format (converting formats).

Example (MED format):

>NC_000913_1
	  2     109   -
	  124     699   -
>NC_000913_2
	  3     305   +
	   539    700   +
>NC_000913_3
	  1     411   -
	   442    699   -
>NC_000913_4
	  2     685   -
>NC_000913_5
	  1     699   +


Sequence Format

The metagenome sequence is in Fasta format. The first line is recognized as unique id for sequence fragment.

Example:

>mgutLn1_U_BL_aaa09a05_b1 Mouse Gut Community PT3 : mgutLn1_U_BL_aaa09a05_b1
AAATCTCGCCCTGTGGTGGATTCCTTTTCCCATTGCCCGATCTTATTTTT
ATCTTCCAAAAATGACGAACTGGACAAGATCCTGGGTCTTTCGGTGGGCG
GGGATGATTACGTGGCAAAGCCGTTCAGCCCGAAGGAGATCGCGTATCGG
GTCAAGGCGCAGCTCCGGCGGGCCGCGTATCAGCAAGACCCGTCGGAGGA
GGAGCTCATAAAAACAGGGGAATTGGAAATTGACGTGGAGGGCTGCAGGG
TCACAAAAGGCGGCAGCCCCATAGAACTGACCGCGCGGGAATTTGAAATC
CTGCGGTATCTGGCGGAAAATCAAGGCCGGGTCATCAGCCGCGAACGCTT
ATATGAAACCATCTGGGGCGAGGACAGCTTCGGGTGCGACAATACGGTCA
TGGTGCATATCCGGCATCTGCGTGAAAAAATAGAGGACGATCCCGCGGCG
CCCCGATACATCATCACGATGAAAGGATTAGGCTATAAGCTGGTGGACCC
TTATGAAGAATAAAAGCGATCTCAATCTGTTTTTTCGTTCGTTCGGCATT
GTCGTGATTGTGATCTTCGCGGCCATTGCAGCGGGGATATGCCTGTTTTA
TTATGTGTTCGCGATTCCGGCGCGGGAGGGACTCAGCCTGGCCTCATGGC
CAGACGTGTATACAGACAATTTTTCCCTTCAGCTTGAAGAAGAACAGGGA
GAGCTTAAAGTAAAAGAATTCGGGATTGAAGATCTGGACCGGTATGGCTT
ATGGCTGCAGGTGATCGATGAAACGGGACAGGAGTTTTTTCACACAATAA
GCCGGAGACCTGTCCCAACAGCTATACGGCCTCGAGCTTTTGGCATTCGG
GTACGAACGTTTA


Top of Input and Output format


MetaTISA Output Format

MED Format

MED format gives a sequence id and corresponding CDS annotation, as described above. View the example.

GFF Format

The output in GFF (general feature format) is denoted according to the specifications of the Sanger institute:

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]


Example:

##gff-version  3
##MetaTISA
##metagenome sequence name: NC_000913_45
NC_000913_45   MetaTISA        CDS     3       395     .       -       .
##gff-version  3
##MetaTISA
##metagenome sequence name: NC_000913_255
NC_000913_255  MetaTISA        CDS     2       298     .       -       .
NC_000913_255  MetaTISA        CDS     320     616     .       -       .

Top of Input and Output format


Configuration parameters

       
  • Sequence region around start codon

  • Max order of Markov model

  • Minimal # of CDS in training

  • Binning Method

  • Start and Stop Codon


       

Sequence region around start codon

The parameters specify the region of the sequence around start codon that should be used to calculate position weight matrix:

  • upstream

  • # of nucleotides that are upstream to the start codon
    Default: 50 nucleotides

  • downstream

  • # of nucleotides that are downstream to the start codon
    Default: 15 nucleotides


Top of Configuration parameters


Max order of Markov model

In training parameters, TriTISA uses a cascade combination of Markov model from lower order to higher order. By Default, three model are cascaded, namely 0th, 1st and 2nd order. This option corresponds to the max order of Markov model.

Default: 2


Top of Configuration parameters


Minimal # of CDS in training

Since our method begins with a 0th order Markov model, it has a number of parameters as small as 4 at each nucleotide position, and 3 prio-probabilities to be estimated. But if the size of training set of a binned group is too small, we suggest to use parameters pre-trained from sequenced genomes for this group. Test on simulation data from the EcoGene database shows that 200 samples are sufficient for a good estimation of the parameters. And this is the default value used by MetaTISA to determine whether to self-train the parameters or use already trained parameters. ��

Default: 200

��

Top of Configuration parameters


Binning Method

At this stage, we implemented the k-mer method for binning. Test on simulation data showed that the accuracy is slightly higher for k > 9, but a less k will greatly saves the memory, computation time as well as downloading time.

Default: 9-mer


Top of Configuration parameters


Start and Stop Codon

  • Start codons: ATG, CTG, GTG, TTG;

  • Stop codons: TAA, TGA and TAG;

  • TGA is not set as stop codon in: Mycoplasma, Acholeplasma, Aster, Onion and Ureaplasma.


Top of Configuration parameters

Contact us

��

©2008 MetaTISA