see_chrom_gen_model.c
This file has functions for generating markov models and persisting them to disk.
We cache markov models in a binary format for efficiency. They'll be read later,
by see_chrom_get_distance.
Typical usage:
1) call sv_markov_gen_open()
this creates a expanded counting structure
sets up a binary .markovs file to cache the model.
2) call sv_markov_gen_got() for each nucleotide seen
this builds up a nucleotide sequence,
and increments the correct place in the counting structure
if we've done a full region, we'll call sv_markov_gen_finish_region(), which:
puts the expanded counting structure in a condensed so we can sort it.
(condensed will label each item with the dna sequence, unlike expanded)
sort it.
then let's write num_buckets_kept (default=160000) of the very best ones to our output file,
which effectively rounds the sequences we didn't see often to zero.
3) call sv_markov_gen_finish()
stamps the binary .markov file with a header saying it is done.
We'll ignore the few remaining nucleotides at the end of a chromosome,
since there might not be enough for a good markov model. we plan to address this
by splitting regions by better markers like starting/ending genes,
rather than arbitrary cutoffs of a million bases.
One great thing about persisting the markov models like this is that we'll re-use them
in future runs. So, future investigations will be faster. And you can terminate the program
halfway through and not lose all your progress.