Datasets and Results
Download datasets (tarball, 12 MB)
| Post processors | ||||||
|---|---|---|---|---|---|---|
| Data | Nr | GLIMMER | GS-finder | MED-Start | RBSfinder | TICO |
| EcoGene | 854 | 63.2 | 90.3 | 92.0 | 81.9 | 94.2 |
| Bsub | 1248 | 61.3 | 87.9 | 89.2 | 78.5 | 89.4 |
| 58 | 69.0 | 94.8 | 94.8 | 82.8 | 91.4 | |
| Bsub-short | 123 | 53.7 | 75.6 | 79.7 | 72.4 | 78.9 |
| 72 | 54.2 | 83.3 | 86.1 | 75.0 | 84.7 | |
| 51 | 47.1 | 82.4 | 86.3 | 70.6 | 84.3 | |
| P. aeruginosa | 3281 | 58.7 | 83.6 | 3.6 | 67.7 | 85.2 |
| R. solanacearum chr | 3440 | 51.5 | 71.4 | 5.0 | 56.8 | 74.9 |
| R. solanacearum plasm. | 1676 | 48.9 | 66.2 | 6.0 | 55.5 | 70.1 |
| B. pseudomallei chr1 | 3399 | 53.2 | 64.3 | 5.5 | 53.3 | 69.6 |
| B. pseudomallei chr1 | 2329 | 48.9 | 67.5 | 4.7 | 52.1 | 67.0 |
For all post processors, predictions of the tool GLIMMER2.02 were used for initial annotation of the coding regions.
For the evaluation of the performance we compared the results of the post processors with the most reliable annotations available for the genomes of E. coli and B. subtilis. For E. coli (NC_000913) this is a set of 854 genes from the EcoGene database (Rudd, 2000), with N-termini verified by protein sequencing.
For B. subtilis (NC_000964), we used four datasets for the performance comparison. The first set covers all non-y genes of the GenBank annotation. These genes are said to be experimentally characterized and have individually verified start sites (Yada et al., Hannenhalli et al.). The second set contains 58 genes confirmed by homology to the closely related organism B. halodurans. Not all start sites of the 58 set have been verified experimentally. For the evaluation of the performance of TICO on short genes, we used a different set (Bsub-short) consisting of three subsets. The subsets contain 123, 72 and 51 genes shorter than 300 nucleotides (Besemer et al., 2001). The set of 123 genes includes such annotations that were confirmed by at least one significant BLAST homology (E-value <1^(-4)). The set of 72 includes such with at least two significant homologies and the set of 51 genes, such with at least ten significant homologies.
Note, that most of the starts in the B. subtilis dataset are not experimentally confirmed. Also note, that the small datasets of B. subtilis containing 58, 123, 72 and 51 annotated genes, are not as representative as the rates on the larger dataset in terms of statistical significance.
The verified datasets described above were downloaded from Center of Theoretical Biology (CTB) at the Peking University.
The performance of TiCo on high-G+C genomes has be evaluated on Pseudomonas aeruginosa (GC: 66.6%) (Stover et al.), Ralstonia solanacearum (GC: 67.04% and 66.86%) (Salanoubat et al.) and Burkholderia pseudomallei (GC: 67.7% and 68.56%) (Holden et al.). The Pseudomonas data have be downloaded from the Pseudomonas aeruginosa Community Annotation Project (PseudoCAP). The data of Ralstonia and Burkholderia have been downloaded from NCBI.
Visualization
View an example for the Visualization with the tool WeightsVis.
Literature
- Besemer, J., Lomsadze, A., and Borodovsky, M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res, 29, 2607-2618.
- Delcher, A. L., et al. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res, 27, 4636-4641.
- Hannenhalli,S.S., Hayes,W.S., Hatzigeorgiou,A.G. and Fickett,J.W. (1999) Bacterial start site prediction. Nucleic Acids Res., 27, 3577-3582.
- Ou, H.-Y., Guo, F.-B., and Zhang, C.-T. (2004) GS-Finder: a program to find bacterial gene start sites with a self-training method. The International Journal of Biochemistry & Cell Biology, 36, 535-544.
- Holden, M.T. et al. (2004) From the Cover: Genomic plasticity of the causative agent of melioidosis, Burkholderia pseudomallei. Proc. Natl. Acad. Sci. U.S.A., 101(39):14240-14245.
- Rudd, K. E. (2000) EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res., 28, 60-64.
- Salanoubat, M. et al. (2002) Genome sequence of the plant pathogen Ralstonia solanacearum. Nature, 415(6871):497-502
- Stover, K.C. (2000) Complete genome sequence of Pseudomonas aeruginosa PAO1: an opportunistic pathogen.
- Suzek, B. E., Ermolaeva, M. D., Schreiber, M., and Salzberg, S. L. (2001) A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics, 17, 1123-1130.
- Yada,T., Totoki,Y., Takagi,T. and Nakai,K. (2001) A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res., 8, 97-106.
- Zhu, H.-Q., et al. (2004) Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics, 20, 3308-3317.