Next post: Clipcircle

SimpleSourceIndexing

Full IDE environments like Code::Blocks and Visual Studio often will "index" your source code pre-emptively, so that searches are faster. They will keep track of where function and class definitons are, so that when you right-click a symbol and say "go to definition", the result feels almost instantaneous. The index is a noticeable speed improvement when working in a large project with thousands of source files.

I prefer to use a light weight code editor, but I liked this feature. Searching online, it didn't look like anyone had written a plugin like this for SciTE, the code editor I use. Also, I felt in the mood for writing something fast in C.

So, I wrote "SSIP", the simple source indexing processor. Now, if you are working in a C or C++ codebase (even a large one with many thousands of files), you can

  • click on a function or method and press F12 to go to its definition
  • Shift-F12 to go to the definition of a function/method/class
  • Ctrl-F12 to list references to a function/method/class

My primary goal is speed -- I want this to be useful for large codebases. I've claimed this is "fast". What do I mean?

Benchmarks on a 2010 Intel 2.67 GHz machine
Creating initial index, starting with no database present

Entire source of audacity-src-1.2.6
987 source files.
9.351Mb of data indexed.
Execution time: 3.753s
Speed: 263 files/s

Entire source of php-5.3.9
1584 source files.
41.677Mb of data indexed.
Execution time: 6.297s
Speed: 251 files/s

Looks good to me! :) I think I can safely say that most of the projects I work on have less than 40Mb of source code. Remember that these times are for the slowest initial index creation; subsequent runs only look at files that have been changed and run in milliseconds. And searching against this index takes only tens of milliseconds.

On GitHub I have a more full explanation, but here are a few of the more interesting choices along the way:

  • I chose to write in C, eschewing C++ features like classes and exceptions
  • Used prepared queries in SQLite for better performance.
  • Benchmarked a multi-threaded version, but it was not significantly faster because problem is bounded by SQLite's insert speed.
  • Because reading through a single file is fast, I ended up not storing only the filename and not position within the file, the position within the file could be found later.
  • Because I am searching whole words, not partial words, I only needed to store the numerical hash.
  • In fact, by using a 24-bit hash, I could allocate a block of 2^24 bytes of memory and use it as a simple and extremely fast hash table.

ssip's command line syntax:

sssip -s word        (Search for a word)
sssip -start         (Re-build index)
sssip -noindex word	 (Search for word without using index)
sssip -noindexnowhole word (Search for partial word without index)

The source code I've posted on GitHub here, released under the GPLv3.

Update 2016: my new project SciTE-with-Python has searching that is fast enough for all my needs currently. In SciTE-with-Python, F12 Searches for selected term (definition), Shift+F12 Searches for selected term (all references), and Ctrl+F12 Searches for selected term (declaration). Also, it works with Python in addition to C and C++.