Identification and correction of incorrect ORF start sites is important for a variety of experimental and analytical purposes, ranging from cloning to inference of operon structure. The genome of the H37Rv reference strain of Mycobacterium tuberculosis (Mtb) was originally annotated when it was first sequenced nearly 15 years ago. While this annotation has served the TB research community well as a standard of reference for over a decade, it has been demonstrated experimentally that the actual start sites for an estimated 5-10% of open reading frames differ from the annotation. In this paper, we present a comprehensive bioinformatic analysis of all 3989 ORFs (open reading frames) in the M. tuberculosis H37Rv genome. Our method combines information from comparative analysis (alignment to start sites of orthologs in other Actinobacteria), sequence conservation, \"protein likeness\", putative ribosome binding sites, and other data to identify translational start sites. The features are combined in a linear model that is trained on dataset of known start sites verified by mass spectrometry, with a cross-validated accuracy of 94%. The method can be viewed as an augmentation of Hidden Markov Model-based tools such as Glimmer and GeneMark by incorporating more information than just the raw genomic sequence to decide which position is the legitimate translational start site for each ORF. Using this analysis, we identify 269 genes that most likely need to be re-annotated, and identify the best alterative translational start site for each. These revised ORF definitions could be used in the reannotation of the H37Rv genome, as well as to prioritize genes for experimental start-site validation.
- Translational Start Sites