Longest Common Substring in Haskell
Rosalind is a website resembling Project Euler but for bioinformatics problems instead. On Rosalind, there is a problem that asks you to find the longest common substring between $k$ sequences of length $n$. I am aware of three different approaches to this problem:
- With dynamic programming, taking $O(n^k)$ time.
- By building a suffix tree, taking $O(nk)$ time.
- By conducting a binary search of the possible substring lengths. Substrings are detected in $O(nk)$ time on average by computing rolling hashes, similar to the approach used in Rabin–Karp string search. Takes $O(nk \log n)$ time in total.
I'm solving most of the problems on Rosalind with Haskell, as a programming exercise. I don't like dynamic programming or trees, so I chose to implement the third approach.
Let's get started with the code. Firstly, we need to import some standard fare:
{-# LANGUAGE DataKinds #-}
import Data.Char (ord)
import Data.List (elemIndex, groupBy)
import Data.Maybe (fromJust)
import Data.Mod
import Data.Set qualified as S
Then, we write a function that returns the rolling hash of a string where $k$ denotes the sliding window size. For example, if we have the string "hello" with $k=3$ then the first/starting hash is, with characters in ASCII,
$$H_0 = 104a^2 + 101a^1 + 108a^0$$
and the next hash is $H_1 = a(H_0 - 104a^2) + 108$ where we update the hash by striping off the first character (the "h") then shift the entire polynomial up by multiplying by $a$ before pushing a new character onto the newly created space on the right end, which in this case is the next "l" in "hello" producing the hash $H_1 = 101a^2 + 108a^1 + 108a^0$. This update proceduresmakes the algorithm much faster when $k$ is large compared to recomputing the hash from scratch for each sliding window.
In Haskell, we produce $a^0, \dotsc, a^{k-1}$ in a list called as
and convert the input string s
to a list of numbers called contents
. We compute the startingHash
and the middle
variable, which stores the first and last character of each sliding window. Finally, we write a function to update a hash given the first and last characters of a sliding window. Then, a simple scanl
gets the job done.
type ModP = Mod 2147483647 -- 2^31 - 1
a = 257 :: ModP
rollingKarp :: Int -> String -> [ModP]
rollingKarp k s = scanl update startingHash middle
where
as = reverse $ take k $ iterate (a *) 1
contents = map (fromIntegral . ord) s
middle = zip contents (drop k contents)
startingHash = sum $ zipWith (*) contents as
update ac (x, y) = (ac - x * head as) * a + y
Using this hash function, we can find common substrings of a certain length by taking a hash that is present in all strings. The function below assumes that there are no hash collisions, which suffices for the one-time challenge the website presents, but obviously does not ensure a correct solution. You should check manually that the substrings indicated by the hash are actually equal, but I was too lazy to implement that.
commonSubstringOfLength :: Int -> [String] -> Maybe String
commonSubstringOfLength n xs = hashToSubstring <$> candidateMatch
where
hashes = map (rollingKarp n) xs
candidateMatch =
let s = map S.fromList hashes
in S.lookupMin $ foldl S.intersection (head s) s
hashToSubstring h =
let i = fromJust $ elemIndex h (head hashes)
in take n $ drop i (head xs)
Now, we can implement a binary search to search for the maximum substring. If a common substring of length $n$ is found then we know that either the greatest common substring is of length $n$ or its greater than length $n$, so we search higher. Otherwise, if there is no common substring, we know that there can't be a common substring greater than length $n$, so we search lower.
search lo hi xs
| hi - lo == 1 = fromJust result
| otherwise =
case result of
Just x -> search (mid + 1) hi xs
Nothing -> search lo mid xs
where
mid = (lo + hi) `div` 2
result = commonSubstringOfLength mid xs
To solve the challenge presented by Rosalind, we just read in the FASTA file and run the binary search on the DNA strands.
fasta s = map (\x -> (tail (head x), concat (tail x))) d
where
d = groupBy (\a b -> head a /= '>' || head b /= '>') (lines s)
main = do
s <- readFile "/home/kjc/Downloads/rosalind_lcsm.txt"
let dna = map snd $ fasta s
putStrLn $ search 0 (maximum (map length dna) - 1) dna
And we are done!