see_chrom_gen_model.c

This file has functions for generating markov models and persisting them to disk.

We cache markov models in a binary format for efficiency. They'll be read later,
by see_chrom_get_distance.

Typical usage:

1) call sv_markov_gen_open()
    this creates a expanded counting structure
    sets up a binary .markovs file to cache the model.
2) call sv_markov_gen_got() for each nucleotide seen
    this builds up a nucleotide sequence,
    and increments the correct place in the counting structure
    if we've done a full region, we'll call sv_markov_gen_finish_region(), which:
        puts the expanded counting structure in a condensed so we can sort it.
        (condensed will label each item with the dna sequence, unlike expanded)

        sort it.

        then let's write num_buckets_kept (default=160000) of the very best ones to our output file,
        which effectively rounds the sequences we didn't see often to zero.

3) call sv_markov_gen_finish()
    stamps the binary .markov file with a header saying it is done.

We'll ignore the few remaining nucleotides at the end of a chromosome,
since there might not be enough for a good markov model. we plan to address this
by splitting regions by better markers like starting/ending genes,
rather than arbitrary cutoffs of a million bases.

One great thing about persisting the markov models like this is that we'll re-use them
in future runs. So, future investigations will be faster. And you can terminate the program
halfway through and not lose all your progress.