Next post: Manually storing files on amazon glacier

A Little Parser

Writing a little parser from scratch

For the early prototypes of my ViperCard project, I used a little parser like this to parse and evaluate expressions. I had always thought of parsers as intense computer-science theory, so it was a good feeling to be able to fight through some small obstacles and write a parser of my own.

Going in, I knew about BNF form for writing parsing rules. I hadn’t read “the dragon book” and only had a vague understanding of PEG vs LL(k). I had the weapon of experience writing recursive code, though: I knew how to use recursion to break down a problem into smaller subproblems. I could look up the rest as needed as I worked.

The approach is to work from left to right and “consume” tokens (popping them from the start of the array) if there is a match. Each doesMatch() method returns null if there is no match, or the array of remaining tokens and the current result tree if there is a match. For example, if I have a parsing rule A --> B + B, I could write a method doesMatchA() that checks if the expression starts with an A by writing:

  • see if we can consume a B by calling doesMatchB()
  • if this does not return a match, return null.
  • otherwise, look at the remaining tokens, if the next token is a +,
  • see if we can consume another B by calling doesMatchB() on remaining tokens
  • if this does not return a match, return null.
  • otherwise, make a results array containing two elements for the results arrays that were returned by doesMatchB earlier.
  • return two things from the method - the results array and the list of remaining tokens.

I can then write doesMatchB() in the same style, until I implement a base case rule like NumLiteral --> numLiteral which just consumes a numeric literal token (any input number like “12” or “123”) or returns null if there isn’t a numeric literal there.

I first wrote an extremely simple “lexer” that turns an input string like "12 + 13" into a list like ["12", "+", "13"].

My code for “parsing” (turning the list into a tree, if it is validly formed):

My code for “evaluating” (recursively collapsing the tree down to a single result):

The demo shows that the order of operations is followed (multiplication before addition). I also support ^ for raising to a power, periods like 0.25 for floating point values, and arbitrarily complex parentheses.

I think the code is pretty intuitive. I could have used reduce when evaluating, but the code wouldn’t have been as readable. The main downside for this approach is that in pathological cases, it can be slow, because there is potentially exponential time complexity due to the backtracking. Parsers with the PEG design don’t have this issue. Also, although there might be slight performance advantages of a hand-written parser, I would recommend using a third party tool like Chevrotain in a real-world project.

A few of the “obstacles” I mentioned: 1) I made sure that unbounded recursion can’t occur because every recursive call either truncates the input or calls downward into a rule “below”. A naive rule like A --> A + B would infinitely call itself (this problem is known as “left recursion”, and there are known was of rewriting grammar rules to avoid it). 2) I refactored/made the approach more elegant by using immutability where possible. For example, to pop the first element of an array, instead of actually modifying the array, I create an entirely new list without the first element, via the slice() method. Conceptually this makes it easier to ‘try’ calling different methods on the array and not have to worry about restoring state if the method call didn’t succeed - because the method is only getting a copy. 3) At first my code failed to support 1 + 2 + 3 because my rule was written A --> B + B. I needed to change the rule to support A --> B + B + B + B... as well.

I know I’m not explaining everything that well, so I recommend reading other tutorials - but maybe my code can be inspiring to make writing a parser feel less intimidating.