Adaptive Inference with Applications to Bioinformatics
Over the past few decades, statistical inference has provided important set of tools for the study of biological sequences as well as protein structures. Hidden Markov models are commonly used to characterize functional units in DNA and RNA sequences, and higher order models have more recently been applied to study three-dimensional protein structure. For both of these applications, the maximum a priori (MAP) configuration defines the maximum likelihood set of states of the system, and can be computed using a well-known dynamic programming algorithm. We have developed methods for ``adaptive inference'' that improve the performance of sequence and protein structure analysis in computational biology. In particular, problems in sequence and structure analysis incorporate small (and perhaps frequent) changes to the model over time, for example, due to mutations in the sequence, new experimental measurements, or changes to the environment of a protein. In such a setting, it is necessary to repeatedly update an already computed MAP configuration when changes are made to the underlying model. Since small changes to the model often require updating only a small fraction of the MAP configuration, we give a framework for adaptive inference that preprocesses the input in a way that allows for efficient updates and recomputation. In particular, our algorithm is within a logarithmic factor of the optimal and is asymptotically never slower than recomputing from scratch. That is, if a modification to the model requires $m$ updates to the MAP configuration of $n$ random variables, then our algorithm requires $O(m\log (n/m) )$ time; recomputing from scratch requires $O(n)$ time. To demonstrate the practical utility of our approach, we show that our framework for inference can provide significant speedups for adaptive versions of CpG island detection and protein sidechain packing.