Further ideas
- Look at intra-chromosome similarity too.
- Connect to the functional. Ideally, associate regions with start/stop codons and not arbitrary divisions. Read literature about non-coding regions, think of a good way to handle N (variant) bases. Study chromosomes like 21 that seem to have few similarities with any others.
- I repeatedly see an interesting distribution of similarity scores. The scores are more uniform than I would have expected, except for a handful of similar regions, and the last 1/5th are very far apart. Relatedly, it is interesting that the similar regions are so widely spread across the chromosomes, even when only taking the top 1000 most similar regions.
- I should get a more intuitive understanding of how the similarity metric changes, when I keep only the top 16000 sequences.
- Currently, for maximum performance, does not support input files more than one > comment. A better design would be this: if an unexpected comment is seen, fall back to a slower file-reader module that handles more .fa features.
- Might be interesting to include looking for similarity with mitochondrial DNA.
- More interactivity to the visualization, like manually selecting a chromosome.