Centre for Modeling and Simulation
Savitribai Phule Pune University All Models Are False, Some Are Useful

Technical Report CMS-TR-20121224


Title One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses
Author/s Leelavati Narlikar
Centre for Modeling and Simulation, Savitribai Phule Pune University, Pune 411 007 India

and
Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune 411 008


Nidhi Mehta
Centre for Modeling and Simulation, Savitribai Phule Pune University, Pune 411 007 India


Sanjeev Galande
National Centre for Cell Science, Savitribai Phule Pune University Campus, Pune 411 007
and
Indian Institute of Science Education and Research, Pune 411 021


Mihir Arjunwadkar
Centre for Modeling and Simulation, Savitribai Phule Pune University, Pune 411 007 India
and
National Centre for Radio Astrophysics, Savitribai Phule Pune University Campus, Pune 411 007, India
Abstract The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the ‘optimal’ order, we investigated two model selection criteria: Akaike information criterion and Bayesian information criterion (BIC). The BIC optimal order delivers the best performance for mammalian phylogeny reconstruction and motif discovery. Importantly, this order is different from orders typically used by many tools, suggesting that a simple additional step determining this order can significantly improve results. Further, we describe a novel classification approach based on BIC optimal Markov models to predict functionality of tissue-specific promoters. Our classifier discriminates between promoters active across 12 different tissues with remarkable accuracy, yielding 3 times the precision expected by chance. Application to the metagenomics problem of identifying the taxum from a short DNA fragment yields accuracies at least as high as the more complex mainstream methodologies, while retaining conceptual and computational simplicity.
Keywords
Download Journal
Citing This Document Leelavati Narlikar, Nidhi Mehta, Sanjeev Galande, and Mihir Arjunwadkar , One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses . Technical Report CMS-TR-20121224 of the Centre for Modeling and Simulation, Savitribai Phule Pune University, Pune 411007, India (2012); available at http://cms.unipune.ernet.in/reports/.
Notes, Published Reference, Etc. Published as Nucleic Acids Research 41(3), 1416-1426 (2012).
Contact mihir AT cms.unipune.ernet.in
Supplementary Material


About | People | Programmes | ResearchReports
Alumni | Announcements | WebMail | Contact



© 2004-17 Centre for Modeling and Simulation
Valid HTML 4.01!