see_chrom_util.c illustrated
Storing a nucleotide sequence as an integer:
For efficient memory usage, we can store a sequence into an integer. There are 4 bases, so it takes 2 bits to describe a base, so a standard 32bit integer can represent a sequence of 16 bases. In our case, our sequences are 9 bases long, which we store in the lowest 18 bits of an integer.
This insert-and-truncate operation is exactly what is needed when building a markov model.
There are only two data structures that merit more description (and see the image below).
Recall that we keep a count of each DNA sequence seen in a region. When we come across a DNA sequence, how do we map that to the right integer to increment?
counts_expanded_in_memory_t
is an "expanded" memory bitmap that maps a sequence to the integers for counting. It doesn't have the overhead of hashing, and also in my experiments, the data is not sparse at all -- nearly every 9-base sequence is seen, and so hashtables lose their advantages. It's a really quick mapping from dna sequence (encoded as an int) to a set of integer counts that say how often that sequence was encountered.
counts_condensed_in_list_t
stores the same data as the expanded structure, but in a list, each entry in the list is labeled with the sequence. Because it doesn't need to store every possible sequence, it uses fewer resources for sparse data, such as after we keep only the data for the most-frequently-seen 16000 sequences.