��

MED2.1 is a non-supervised prokaryotic gene prediction method which integrates MED2.0^[1] and TriTISA^[2], an iterative self-learning translation initiation site (TIS) prediction algorithm. As the update of MED2.0, MED2.1 modifies the TIS model by replacing the previous one to TriTISA, which imoroves the prediction accuracies for both 3' and 5' ends.

Details are in Result. Performance on verified genes and Comparisons with Genebank annotations show the comparision of accuracy among MED2.1, MED2.0 and other gene prediction algorithms.

[1]Zhu,H.Q. et al. 2007 MED: a new non-supervised gene prediction algorithm forbacterial and archaeal genomes. BMC Bioinformatics,8, 97
[2]Gangqing,Hu. Et al. 2009. Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics 25,123-125.

��

The basic steps of the MED2.1 is showed as follows:

Determination of the initial root coding and non-coding ORFs. All possbile ORFs longer than 90 bps are extracted in both strands, thus to determine the coding potential. MED Method described earlier in MED2.0 is used to determine the root coding and non-coding ORFs, which typically cover more than 60% of all genes with a very high reliability over 99%. These seed sequences serve as a reliable learning set for the further procedures.
Determination of the remaining ORFs. TIS refinement is firstly applied to the remaining ORFs by using TriTISA. Then, EDP scores and TIS scores are integrated for the identification of protein coding ORFs by implementing Fisher Discriminant.
Resolving overlap coding ORFs. To reduce false positive in the detected coding ORFs, all the ORFs shorter than 100 are excluded. A strategy to resolve overlap coding ORFs is then applied.
Final TIS refinement. At last, the TriTISA model is used again to improve the TIS prediction of all coding ORFs.

The strategy of TriTISA can be sumerized as follows:

TriTISA classifies all TIS candidates into three categories: true TISs, false TISs upstream of the true TISs, and false TISs downstream of the TISs. The features of sequences around TISs for each category are characterized by a non-homogenous Markov model. The three models are trained by an iterative self-learning procedure. At each step, the Markov models are combined by a Bayesian methodology, which assign three post-probabilities to each candidate TIS: the probability that the TIS is a true TIS (Pt), that it is from non-coding region (Pnc), and that it is from coding region (Pc). TriTISA predict the one with the highest Pt score as the TIS of a gene. The updated annotation constitute the training set for the next step of iteration. To further improve the prediction accuracies, TriTISA employs a cascade combination of different orders of Markov model, namely it first uses a 0th-order Markov model for initial refinements, and then move to higher (1st an 2rd) order Markov models in the later steps of the iteration.

Result

```
Performance on verified genes
```

Comparisons with Genebank annotations

��

Performance on verified genes

To evaluate the performance of MED2.1 , we include here datasets with genes confirmed by N-terminal amino acid sequencing or experimental evidences. We also show in Tab.1 the performance of MED2.1, MED2.0, Glimmer3.02 and GeneMarkS for comparisons. Predictions for Glimmer 3.02 and GeneMarkS were downloaded from NCBI.

Tab.1: Comparison of prediction for 5' and 3' ends of genes for MED2.1, MED2.0, Glimmer3.02 and GeneMarkS, on a set of reliable test sets

��	��	3'end match(%)				5' and 3' end match(%)
Test set	Gene	MED2.1	MED2.0	Glimmer3.02	GeneMarkS	MED2.1	MED2.0	Glimmer3.02	GeneMarkS
solfgene	56	100	100	100	100	87.5	89.3	87.5	85.7
B.sub	4176	98.5	98.9	97.6	98.9	88.1	86.1	82.4	86.1
Syne	125	98.4	96.8	99.2	99.2	88.8	81.6	76.8	86.1
E.coli	883	99.2	98.6	99.2	99.4	94.2	91.2	90.5	93.3
A.per	131	100	100	99.2	99.2	96.9	88.5	92.4	95.4
N.pha	321	98.4	98.1	99.7	100	96.6	82.6	94.7	96.6
Mtub66	66	97	95.5	97	98.5	89.4	87.9	80.3	80.3
Psaer107	107	98.1	97.2	95.3	93.5	94.3	93.5	90.6	85
H.sal	522	98	96.6	99.8	99.2	93.5	75.7	86.6	92.2

References for datasets: A.per(Yamazake et al., 2006); E.coli(Rudd, 2000); H.sal and N.pha(Aivaliotis et al., 2007); Syne(Sazuka et al., 1999); Data sets Psaer107, Mtub66 and SolfGene contains TISs confirmed by N-termianl amino acid sequencing or inferred from experimental evidences.

return

Comparisons with Genebank annotations

To illustrate the prediction accuracy of MED2.1, we present the comparison results against the GenBank annotation for the 3' end match. Two independent quantities, Sn (sensitivity) and Sp (specificity), are defined to evaluate the performance of the gene finder at gene level as: Sn = TP/(TP+FN) and Sp = TP/(TP+FP), where TP, FP, and FN are the number of true positive, false positive, and false negatives, respectively.

Tab. 2:Accuracies comparison for MED2.1 , MED2.0, Glimmer3.02 and GeneMarkS.

��

Since the GenBank annotation is not fully accurate, further evaluation is performed based on the function-known genes which have more reliable annotations. The function-known genes are defined as those listed in GenBank annotation with product descriptions excluding any of the key words as "-like", "conserved", "hypothetical", "homolog", "probable", "possible", "predicted", "putative", "similarity" and "unknown". We present the comparison result against function-known genes for the3' end and both end match.

��

Tab. 3: Accuracies comparison for MED2.1 , MED2.0, Glimmer3.02 and GeneMarkS.

Welcome to Download MED 2.1

MED 2.1 were written in C++

MED 2.1 User Guide

Input file: *.fna in FastA format containing genomic sequence.
Output file: the first three columns of *tis_predictions.txt show the predicted CDSs.
For windows user : First ensure MED2.exe, TRITisa.exe, settings, EDPCenters.txt and EDPCentersGC.txt were saved in the same directory. Open the program "c:\windows\system32\cmd.exe" pre-installed in the MS Operating System and then locate on the directory where you saved MED 2.0. Typing in "MED2.exe fna_file" to start MED 2.0. For examples: "MED2.exe c:\tmp\Bsub.fna". Output files will be saved in the same directory with that containing the input file.
For linux/Unix user: please run the script file named "buid" to install the program. All resultant executable files will be generated in the package "bin". Please follow below instructions to run the program: 1) ensure that MED2, TRITisa.exe, settings, EDPCenters.txt and EDPCentersGC.txt are saved in the same directory, say "bin"; 2) move your genomic file in FastA format to the, say "bin" directory; 3) type in the console the instruction "./MED fna_file". Predictions will be saved in the same directory, say "bin".

��

File formats description

Sequence files (*.fna)

��

Here we adopt the FASTA style.

>gi|40068520|gb|AE017199.1| Nanoarchaeum equitans Kin4-M, complete genome
TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGAACCCAAACCGCAATCG
TACAAAAATTTGTAAAATTCTCTTTCTTCTTTGTCTAATTTTCTATAAACATTTAACTCTTTCCATAATG
TGCCTATATATACTGCTTCCCCTCTGTTAATTCTTATTCTTATTGATACTGTTTTATAGAAAGTAAAACC
TTCGAATATTGCTTCTTCAAAATAAAAGTTCTTCCCACAGAAATTATTATATTTTCTTAAGCTTTGCTCT
TTTAGGTCTCGCAACCAATTAGAAAAAGCGCTAAAACTCTTATTTCTAAAATCGTAATACCTATCTTTTA
CATCTATAATTTTCCATCCATCTTTGATTTGGTTTAAACTAATCTCTATGGGTTCTTCTAAAATGGGCTC
TTTCTTTTTTATTAGATTATACTTTTTCAATTCTTCCAATCGCTTAACAAATACCTTATAGTATTTGTCC
CCTTTTAATATTACTATTTTGCCTTCCCTTAATACTATAGGAGTTATAGTAACCCAAGGAAATCTTAATT
TTGGATCGAATTTTTTTGTTTTCCTTAATTGGAATTGAGCCAAACCAATAGTTATTATATCTAAATCTTT

Result

Input file: *.fna in FastA format containing genomic sequence.

Output file: the first three columns of *tis_predictions.txt show the predicted CDSs.

File formats description