MetaTISA-ZhuLab

General description

Computational gene finding obtains increasing concerns in the metagenomic sequencing projects. Accurate prediction of gene on both 5'-end and 3'-end is important for experimental identification of purified protein products, as well as predictions of promoter and operon. However almost all of the current algorithms are inapplicable to accurately locate the translation initiation sites (TISs) of protein coding genes; they are usually demonstrated a poor performance on experimentally verified gene data. To address this problem, we provide MetaTISA, a tool that post-processes predictions from gene-finders for metagenome with an aim to improve the prediction of TISs.

At this moment, it provides option for MetaGene (Noguchi et al., 2006, 2008) and will extend to other gene finders in later developments.

To facilitate the study of diversity of translation mechanisms among different environments, we provide a matlab script to visualize the positional weight matrix generated for each binned cluster.

Access

Online Prediction

Download stand-alone MetaTISA (Linux/Windows).

Related resources

A webserver for TIS prediction in complete microbial genomes TiCo

A database for TIS annotation for microbial genomes ProTISA

Please Cite

Gang-Qing Hu, Jiang-Tao Guo, Yong-Chu Liu, and Huaiqiu Zhu. MetaTISA: Metagenomic Translation Initiation Site Annotator for improving gene start prediction. Bioinformatics 2009,25(14):1843-1845.

References

Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.
Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9: 217.
Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res., 15: 387-396.
Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from
environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.
Sandberg, R,, Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing whole-genome characteristics in short sequences using a Naive Bayesian classifier. Genome Res. 11:1404�C1409.

Algorithm Description

MetaTISA is designed to post-process an existing gene annotation pipeline for metagenome, with an aim to improve its prediction accuracy of translation initiation sites (TISs). It takes inputs as a set of shotgun fragments and a set of CDS annotations, and outputs refined initiation sites.

The method constitutes two steps in TIS prediction. First, it uses binning techniques to classify all input fragments. Fragments binned in the same clade are assumed to have closely phylogenic origin, and hence share a similar mechanism in translation initiation. Then, the recently developed tool, TriTISA, is modified to predict TIS for each clade of fragments in an unsupervised iterative manner.

Binning Method

Binning is to assign an anonymous sequence fragment to certain phylogenic group. A number of binning method currently published may include BLAST, K-mer (Sandberg, et al., 2001), SOM (Abe, et al., 2003), PhyloPythia (McHardy, et al., 2007), TETRA (Teeling, et al., 2004), etc. At this stage, we implemented the K-mer method, and plan to include other methods in the future.

The k-mer is a supervised method and employs a naive Bayesian classifier for sequence binning. The training data is the k-mer frequencies compiled for each phylogenic clade. Given a sequence fragment, the k-mer method calculates a likelihood score for each clade based on the pre-compiled frequencies, and assigns the fragment to the clade that produces the highest likelihood score (Sandberg, et al., 2001). We prepared the training sets from completely sequenced genomes, and chose one genome per genus to reduce redundancy (the genus list). Each genome presents a phylogenic clade, and the fragment is binned to genus level.

TIS Prediction

We have recently proposed an unsupervised method for TIS prediction for microbial genomes (Hu et al., 2009). The method classifies all TIS candidates into three categories: true TISs, false TISs upstream of the true TISs (in noncoding region), and false TISs downstream of the TISs (in coding region). The features of sequences around TISs for each category are characterized by a non-homogenous Markov model. The three models are trained by an iterative self-learning procedure. At each step, the Markov models are combined by a Bayesian methodology, which assign three post-probabilities to each candidate TIS: the probability that the TIS is a true TIS (P_t), that it is from non-coding region (P_nc), and that it is from coding region (P_c). TriTISA predict the one with the highest P_t score as the TIS of a gene. The updated annotation constitute the training set for the next step of iteration. To further improve the prediction accuracies, TriTISA employs a cascade combination of different orders of Markov model, namely it first uses a 0^th-order Markov model for initial refinements, and then move to higher (1^st an 2^rd) order Markov models in the later steps of the iteration. Test on simulation data and experimentally verified data show that TriTISA produces a more accurate and robust prediction than the state-of-the-art (Hu et al., 2009).

Here, we modified the TriTISA algorithm to post-process annotation for metagenomic binned fragments. CDSs from binned fragments are assumed to share similar machinery in translation initiation, and the sequence pattern for each set of TISs are homogenous across the clade. The assumptions allow the parameters to be trained as that trained for a single genome. CDSs are extended to the 5'-most before post-processing, and CDSs that are complete in their 5'-ends are used for parameter training. With the converged parameters, TriTISA calculates for each candidate TIS three scores: P_t, P_nc and P_c. For CDSs that are complete in their 5'-ends, the start codon is predicted as the candidate start that shows the highest P_t score. For CDSs that are incomplete in their 5'-ends, we need to estimate whether the start codon is missing. In other words, is the 5'-most start-codon-like triplet belong to coding region? We estimate the distribution of P_co from training set, and it is readily to have a threshold to say if a candidate is from coding regions (at a 95% confidence interval). For CDSs that are estimated to contain start codons, we predict the TIS following the procedure acting on training CDSs.

Figure 1 Program flow chart of MetaTISA

��

Performance Evaluation

Due to the lacking of experimentally verified TISs in metagenome project, the only way to reliably evaluate the prediction performance is to simulate a metagenome based on artificial shotgun sequences from complete microbial genomes. The validities of the k-mer method and the TriTISA method are documented previously (Sandberg et al., 2001; Hu et al. 2009. Here we tested their combined effect on TIS prediction for metagenomes with shotgun sequences simulated from 95 randomly selected genomes plus 5 genomes where experimentally verified TISs are available (Hu et al. 2009). Two sets of simulation were created with different settings of fragment length: L = 700 bps and L = 400 bps. We selected genes that have experimentally verified TISs from the five genomes as benchmarks. Since many of their start codons are absent from the fragments, we calculate sensitivity TP/TP+FN (sn) and specificity TN/TN+FP (sp) for accuracy measurements, where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. We demonstrate the performance of MetaTISA used to process the outputs of the newest version of MetaGene, namely MetaGeneAnnotator or MGA (Noguchi et al. 2008) (Table 1, 2). Similar improvements are obtained for Neural Net (Hoff et al., 2008) (data not shown).
��

Table 1. Accuracies calculated according to the RefSeq whole genome annotation

��	Fragment length = 700			Fragment length = 400 bps
Genomes	#	MGA_sn/sp	MTS_sn/sp	#	MGA_sn/sp	MTS_sn/sp
A. pernix	1309	60.25/98.58	80.29/99.04	980	60.98/98.74	79.45/98.80
Synechocysis sp.	2196	78.08/99.34	78.14/99.18	1608	80.20/99.42	79.11/99.07
E.coli	3195	87.65/99.70	91.27/99.65	2455	86.97/99.71	90.74/99.48
N. pharaonis	1978	83.78/99.21	88.50/99.20	1502	83.55/99.28	87.74/98.91
H. salinarum	1626	79.64/99.34	88.45/99.46	1234	78.18/99.39	88.02/99.30
Weighted average	-	80.12/99.33	86.10/99.35	-	80.24/99.40	85.90/99.17
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.Figures for the other 95 genomes are included here(Fragment length=700,Fragment length=400), but caution should be taken for accuracy interpretation because RefSeq annotation on TIS is not of high quality (Hu et al. 2008).

Table 2. Accuracies calculated according to experimentally verified TISs

��	Fragment length = 700			Fragment length = 400 bps
Genomes	#	MGA	MTS	#	MGA	MTS
A. pernix	103	64.42/98.71	94.52/99.61	78	60.54/98.70	92.29/99.25
Synechocysis sp.	92	83.53/99.25	81.33/98.93	75	82.20/99.29	82.56/98.85
E.coli	733	89.56/99.76	93.77/99.72	562	87.06/99.75	93.44/99.60
N. pharaonis	248	91.27/99.55	97.04/99.58	184	90.74/99.53	95.86/99.11
H. salinarum	428	85.72/99.54	96.04/99.71	337	82.43/99.51	94.65/99.48
Weighted average	-	86.84/99.57	94.21/99.64	-	84.37/99.56	93.40/99.43
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.

��

References

Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T. and Ikemura, T.B (2003) Informatics for Unveiling Hidden Genome Signatures. Genome Res. 13: 693-702.
Hu, G.-Q., Zheng, X.-B, Ju, L.-N., Zhu, H. and She, Z.S. (2008) Computational evaluation of TIS annotation for prokaryotic genomes, BMC Bioinformatics, 9:160.
Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S. (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.
Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9:217.
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. and Rigoutsos, I. (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. 4: 63-72.
Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Research, 15:387-396.
Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from
environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.
Sandberg, R., Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing Whole-Genome Characteristics in Short Sequences Using a Naive Bayesian Classifier. Genome Res. 11: 1404 - 1409.
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. and Glockner, F.B. (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 5:163.

Download stand-alone MetaTISA

CDSFormatConverter
Linux Release of MetaTISA
Windows Release of MetaTISA
Visualization of PWM

CDSFormatConverter

We provide tools that convert the CDS format of a gene finder to the format that is acceptable for MetaTISA.

MeteGeneAnnotator 2 MED: mga2med (windows version )
Download source code here (tarball)

Table of Contents

Linux Release of MetaTISA

The package in MetaTISA contains the following items:

Source code of MetaTISA
Default settings file of MetaTISA
Pre-calculated TIS parameters in folder "parameter" (in case for groups whose samples are insufficient for a good self-training of paramters)
k-mer bin model file and taxonomic map file in folder " bin_model"
Example file in folder "example"
Readme file

MetaTISA (tarball)

Table of Contents

Windows Release of MetaTISA

The package in MetaTISA contains the following items:

Executable file of MetaTISA
Default settings file of MetaTISA
Pre-calculated TIS parameters in folder "parameter" (in case for groups whose samples are insufficient for a good self-training of paramters)
k-mer bin model file and taxonomic map file in folder " bin_model"
Example file in folder "example"
Readme file

MetaTISA(zip)

Download source code of windows here (zip)

Table of Contents

Visualization of PWM

The package in PWM visualization contains the following items:

Matlab script to visualize PWM around TIS
Visualization of pre-calculated PWM
Readme file

PWM Visulization(tarball)

Table of Contents

Input and Output formats

MetaTISA Input Format

CDS Annotation Format
Sequence Format

MetaTISA Output Format

MED Format
GFF Format

MetaTISA Input Format

CDS Annotation Format

One input format for CDS annotation accepted is the output of MetaGene prediction. The other accepted format is our own MED format: an sequence fragment id beginning with ">", followed by co-ordinations of all CDSs in the fragment, one CDS per line. The coordinates of a CDS define the nucleotide region, from which the first position for positive strand (second position for negative strand) can be used to translate the CDS to amino acid sequence. At the same time, in order to facilitate users to post-processing results from other gene prediction tools, we provide several tools for a conversion from a format of a gene-finder to the MED format (converting formats).

Example (MED format):

>NC_000913_1
	  2     109   -
	  124     699   -
>NC_000913_2
	  3     305   +
	   539    700   +
>NC_000913_3
	  1     411   -
	   442    699   -
>NC_000913_4
	  2     685   -
>NC_000913_5
	  1     699   +

Sequence Format

The metagenome sequence is in Fasta format. The first line is recognized as unique id for sequence fragment.

Example:

>mgutLn1_U_BL_aaa09a05_b1 Mouse Gut Community PT3 : mgutLn1_U_BL_aaa09a05_b1
AAATCTCGCCCTGTGGTGGATTCCTTTTCCCATTGCCCGATCTTATTTTT
ATCTTCCAAAAATGACGAACTGGACAAGATCCTGGGTCTTTCGGTGGGCG
GGGATGATTACGTGGCAAAGCCGTTCAGCCCGAAGGAGATCGCGTATCGG
GTCAAGGCGCAGCTCCGGCGGGCCGCGTATCAGCAAGACCCGTCGGAGGA
GGAGCTCATAAAAACAGGGGAATTGGAAATTGACGTGGAGGGCTGCAGGG
TCACAAAAGGCGGCAGCCCCATAGAACTGACCGCGCGGGAATTTGAAATC
CTGCGGTATCTGGCGGAAAATCAAGGCCGGGTCATCAGCCGCGAACGCTT
ATATGAAACCATCTGGGGCGAGGACAGCTTCGGGTGCGACAATACGGTCA
TGGTGCATATCCGGCATCTGCGTGAAAAAATAGAGGACGATCCCGCGGCG
CCCCGATACATCATCACGATGAAAGGATTAGGCTATAAGCTGGTGGACCC
TTATGAAGAATAAAAGCGATCTCAATCTGTTTTTTCGTTCGTTCGGCATT
GTCGTGATTGTGATCTTCGCGGCCATTGCAGCGGGGATATGCCTGTTTTA
TTATGTGTTCGCGATTCCGGCGCGGGAGGGACTCAGCCTGGCCTCATGGC
CAGACGTGTATACAGACAATTTTTCCCTTCAGCTTGAAGAAGAACAGGGA
GAGCTTAAAGTAAAAGAATTCGGGATTGAAGATCTGGACCGGTATGGCTT
ATGGCTGCAGGTGATCGATGAAACGGGACAGGAGTTTTTTCACACAATAA
GCCGGAGACCTGTCCCAACAGCTATACGGCCTCGAGCTTTTGGCATTCGG
GTACGAACGTTTA

Top of Input and Output format

MetaTISA Output Format

MED Format

MED format gives a sequence id and corresponding CDS annotation, as described above. View the example.

GFF Format

The output in GFF (general feature format) is denoted according to the specifications of the Sanger institute:

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Example:

##gff-version  3
##MetaTISA
##metagenome sequence name: NC_000913_45
NC_000913_45   MetaTISA        CDS     3       395     .       -       .
##gff-version  3
##MetaTISA
##metagenome sequence name: NC_000913_255
NC_000913_255  MetaTISA        CDS     2       298     .       -       .
NC_000913_255  MetaTISA        CDS     320     616     .       -       .

Top of Input and Output format

Configuration parameters

Sequence region around start codon
Max order of Markov model
Minimal # of CDS in training
Binning Method
Start and Stop Codon

Sequence region around start codon

The parameters specify the region of the sequence around start codon that should be used to calculate position weight matrix:

upstream
# of nucleotides that are upstream to the start codon
Default: 50 nucleotides
downstream
# of nucleotides that are downstream to the start codon
Default: 15 nucleotides