see_chrom_get_distance.c
This file contains functions for calculating similarity between two regions (by comparing their markov models).
For efficiency, things seem slightly complex at first glance here, but it's straightforward.
What happens when you call computedistances_go()
and give it two chromosomes to look for similarities between:
1) For each region in the first chromosome, open the markov model for the region
(we cached it earlier in the .markovs file on disk).
2) Expand this region into an 'expanded' structure. that's a really quick mapping from
dna sequence (encoded as an int) to a set of integer counts that say how often
that sequence was encountered. same structure we used earlier when counting.
3) Allocate an array of "flags" that we'll use to check off when a sequence is seen.
just like for the expanded structure, we can quickly access any flag by directly indexing on
the sequence-encoded-as-integer.
4) For each region in the other chromosome, open the markov model for the region
(we cached it earlier in the .markovs file on disk).
5) Clear all the flags.
6) We don't need to fully expand the second data, we can leave it in a list. walk through every
item in the list (items contain a sequence encoded as an int and a set of integer counts
that say how often that sequence was encountered).
Use the sequence-encoded-as-integer to look up the other counts for that sequence in the 'expanded' structure from step 2.
We can now compute the euclidean distance between these sets of integers (as if these were vectors in a really-large dimensional space). store the tuple (firstregionlocation, secondregionlocation, similarity score) in memory. Use the sequence-encoded-as-integer to set the corresponding flag to 1.
7) Iterate through the flags to see places where the flag was not set to 1. These are sequences that were seen in the first region but not in the second. Add these distance scores, treated as if the other vector were all zeros.
8) When we have scores for all the pairs of regions, save our tuples to the database.
currently saving to the database is done under a mutex, but a future improvement would be to
make it use an optimistic strategy, and simply retry if the db returned ERR_DB_LOCK.
When we have all scores for all chromosomes everywhere, create an index
so we can more efficiently sort by similarity score.
Stamp the database as complete, so that future runs can use it.