Predicting the secondary structure of an RNA sequence with fast speed and high accuracy has been a long standing challenge in computational biology. It is an important problem because knowing structures reveals crucial information about the RNA’s function, which is useful in many applications. Being able to rapidly determine the structure is extremely useful given the overwhelming pace of increase in genomic data (about 10^21 base-pairs per year) and given the small percentage of sequences that have experimentally determined structure. While experimental assays still constitute the most reliable way to determine such structures, they are prohibitively costly, slow, and difficult, and therefore computational prediction provides an attractive alternative.
However, existing prediction algorithms scale poorly with longer sequences, running in O(n^3) time to predict nesting structures or O(n^4)∼O(n^6) time for pseudoknots (n is the RNA sequence length and could be as large as thousands). Interestingly, these prediction algorithms are mostly borrowed from computational linguistics, where predicting the most likely syntactic structure of a sentence is closely analogous to finding the lowest energy structure of an RNA. We therefore borrow well-known linear-time dynamic programming algorithms from computational linguistics (developed by the speaker) to predict RNA secondary structures, which results in orders of magnitude faster predictions without loss of accuracy.