Development process
- Goals:
- Enables inquiry/creativity
- Helps the user analyze the human genome, provides avenues to explore
- Correctness, via the following:
- Refactor methods and structs until the logic is legible
- Automated tests, both low-level tests and integration tests
- Attention paid to boundaries - prevent defects in corner scenarios
- Consistency aids in reducing defects. I designed my objects to be used similarly, different structs have analogous methods. Use consistent terminology.
- Try to be concise but not use clever/confusing tricks to save typing. Balance function length vs ablility to follow program logic. Divide implementation into independent modules, but don't create overly many abstraction layers.
- Have clearly defined behavior on error
- I structured the code such that unhandled errors become compile warnings (unused return value).
- In my case, a failed "check_result" will write a log entry and safely bubble upwards.
- Add assertions for defense in depth
- Performance / resource usage
- Important to support creativity, if turnaround time is excessive it is hard to experiment
- In modern/cloud world, be mindful of metrics like server cost-of-goods-sold
- Good performance can be useful if porting to platforms like web browser, mobile
- Enables inquiry/creativity
A few steps I took to help support concepts above:
- Invested in writing solid infrastructure (logging, error handling, util methods).
- For example, it's as easy as typing
check_win32
to check win32 errors and, it even automatically shows a string translation. check
system makes error-handling easier and memory leaks more visible (everyopen()
is paired with aclose()
)
- For example, it's as easy as typing
- Avoid global state.
- Parameters should cause focused changes.
- For example, in an earlier version many methods took a boolean
is_testing
that tweaked some settings. Realized that this wasn't good design, instead added more-precise controls for what the test needed to manipulate. - It's a bad sign if an object can be used in two different ways or a method has two different 'modes'
- For example, in an earlier version many methods took a boolean
- Put limits on infinite loops if they run an unreasonable amount of times
- Use compiler warnings / treat warnings as errors / use tools like valgrind.
- An enum is often better than boolean params
- Even annotating can be ambiguous, e.g.
do_foo(false /* skip the first entry */)
- Does that mean we are or aren't skipping the first entry?
- Even annotating can be ambiguous, e.g.
- Handle windows newlines, don't assume little-endian without checking
- Don't assume files are valid without checking version, marked-as-complete, and created with the same params
- Strictly order dependencies (no cycles). Use headers to chain from the highest to the lowest level, one step at a time.
- Use callbacks for controlled dependency injection
Process:
- Drew concepts in pencil/paper, sketched what final visualization might look like. Wanted to use markov models to look for patterns in sequences, then visualize by drawing connections between regions in human genome.
- Wrote python scripts to explore human genome data, to learn.
- Other ideas: if most regions have similar models, perhaps sequencing errors could be detected if the data read has significant distance from the expected markov model.
- Online, found that many others have used markov models to analyze DNA. Maybe I can give my project a 3D visualization so that it's at least partially unique.
- Sequences are apparently quite dense. It's common for nearly all possible 9-base sequences to be seen. My python script showed that after reading 1 billion bases, 261959 of the possible 262144 9-base sequences were seen at least once. This led to a descision to use a memory bitmap instead of a hashmap, as a hashmap would have fewer advantages for non-sparse data.
- Decided not to normalize the weight of frequencies when comparing one markov-model 'vector' with another. E.g. instead of comparing the vector <0.21, 0.24, 0.35, 0.20> weighted to sum to 1, I just use the raw counts <84, 96, 140, 80>, with no normalization to a specific sum. If normalized, it makes all points seem very distant because rarely seen sequences have just as much of an effect as common ones. Instead use raw counts, which is still a fair comparison since each region is the same length in bases. My metric now in essence just determines how often a sequence appears.
- Learned to read .fa format through experimentation in Python scripts
- Realized it'd be hard to look for start/stop codons because of group-of-3 alignment. mRNA would be more aligned, but I'm not sure if there are cases that change the alignment.
- Simple script to count occurances of bases took 370seconds in Python and 70seconds in C on my old laptop. The .fa format isn't a complex format that would be unpleasant to parse in C. Computing vector differences is a cpu-heavy problem that would be relatively slow in python. Finally, Python extensions can easily be written in C, and so a hybrid-language approach in the future (performance-critical work done in C) would be straightforward. So chose C as a development language.
- Saw that the
vpython
module I used to use can be used online as "glowscript"- (Even though a circular diagram would probably be just as informative for showing connections in genome, it's fun to have something 3D and interactive)
- Ported experiments to C
- Used my open source
glacial_backup
as a basis, had cross-platform utils and filesystem wrappers I had invested much time in. - Divided problem into 3 steps, 1) markov models for each region 2) compare regions 3) draw visualization
- Used structs as semantic units. Even if the struct only has a simple list as state, I made it a separate struct, an object-based approach more similar to C++.
- Was doing a lot of typing to write automated tests, so temporarily gave up on freeing memory for small objects like strings (only in automated tests, not product code). The TEST_OPEN macro indicates that it is ok if the object is leaked, and it can only be used in test code.
- Alternated between writing high level code and lower level building blocks. Similarly, alternated between making wide swathes of changes and then hours later, correcting the syntax and typos.
- It's tough to keep terminology consistent as the project evolves. I think I consolidated most of the redundant terms together. One remaining inconsistency is "similarity score"/"distance" which refer to the same concept in the code.
- Used my open source
- Speedbumps
- Can't store all sequence/count information for each region, it would use too much disk space. Could possibly not cache it at all and recompute every time, but would mean the program would run on the order of hours. Decided to keep the largest 16000 entries in the vector, and treat the others as if they were 0.
- Ran into tradeoffs between performance and configurability. The case I thought the hardest about: in test code I wanted a markov sequence length of 2, in product code, 9. The code runs faster if it uses a compile time constant (immediate instructions, and bit manipulation/shifts by a constant can be optimized). I can't use C++ template metaprogramming in C, and xmacros are inelegant, and if there's separate code in product/test it is untenable to keep them synchronized. If test code used a sequence length of 9 it would be hard to write tests and hard to visually see that the test examples are correct. Because performance was still important to me, I made the value of 9 non-configurable by the outside world. For a case where the sequence length is either 2 or 9, instead of using a traditional integer passed in as a parameter, which would be slower (although more configurable by the end user), I can put a branch in key parts of the code and use either the 2 constant or the 9 constant. It still duplicates some logic, but the slowdown is just a branch, which branch prediction will alleviate. I then thought of an even better workaround. For the inner loop where performance matters the most, I'll have the sequence passed in as a parameter, but make the function inline, so that it can still be potentially be recognized as a constant. Compiler directives like
__force_inline
are useful for this. - C might not have been the best choice of language, since the goal is for this to be a one-week project. Concise code is possible, but only after a lot of time spent refactoring and polishing. Maintaining header files slows down development speed. Having to pass around context structs to contexts and threads, as a poor imitation of closures, is tiresome. I decided, though, to complete what I had started.
- Wanted to have more deterministic sorting, especially so that automated tests do not fail on different platforms, so went through some hoops to have stable sorting. Similarly, used a 3 line custom PRNG since
rand()/srand()
behavior can differ across platforms and I want determinism. - When I first created the visualization, it showed all connections simultaneously, by design. I expected fewer clusters of connected regions, but instead I saw a thorny impenetrable snarl of widespread connections. Fortunately, I soon realized that I could highlight just the connections to/from one chromosome to clarify the picture. Glowscript allows interaction, but instead of taking the time to build a user interface, I took a shortcut and just made the visualization move on its own, showing each for about 5 seconds. This also saves the step of having to teach the user how to change which chromosome is selected.
- Improve performance (a few examples)
- first wrote a pinning test to have more security that changes are not regressions
- added multi-threading (only for cpu-intensive operations)
- create sqlite index only after rows have been inserted
- consider order of members in structs and memory alignment
- used 'expanded' data structure (mentioned elsewhere) instead of hashmap. also very useful in comparing distances, since it can stay stationary (see description elsewhere).
- benchmarks. cache markov data on disk and not hold GB in memory.
- I unrolled a few loops based on benchmarks. It's reasonable to spell out a few key calculations 4 times, one for each base (when getting euclidean distance), since it's only a few tokens and not even one redundant line, and is also covered in tests to protect against typos.
- stayed close-to-the-metal. not very much pointer indirection, no virtual methods and encourage inlining, make very few calls to the os or external libraries, kept memory access contiguous.
- My final step was to refine the automated tests as much as is reasonable for a one-week project. I added a high-level test with full .fa input files confirming that it can detect similar regions. Future effort can continue my work breaking down the implementation into isolated, testable units.
Used the following libraries:
- Bstring, from Paul Hsieh
- SQLite, from SQLite team
- tinycthread, from Marcus Geelnard, Evan Nemerson
- glowscript, from David Scherer and Bruce Sherwood
- os_clock_gettime(), from Asain Kujovic
- simple_rng_next(), Park and Miller, October 1988 issue of CACM