The initial working implementation was actually an order of magnitude slower. This turned out to be because I was making a dumb excess copy of a string for every token in the input program, then destroying it immediately; by simply operating on a subset of the original input string, I eliminated this wastage and dropped things back to sane levels.
Unfortunately, all things considered, the lexer actually slowed the parser down by a few milliseconds on the 20KB test program. Eventually I tracked that down as well - turns out I was doing some redundant lookahead trying to be clever when the lexer had obviated the need for byte-level lookahead entirely. Removing the pointless lookahead correspondingly improved parse times. The test case edged down to around 17.5ms, which is about 1ms faster than without lex.
Of course I'm actually doing this on a 2MB input file and not the 20KB original test case, because that's the only way to get enough data to make profiling runs worthwhile. So in reality parses are in the 1.7 second range for a 2MB input.
The upside is that the backtracking done on the byte level was masking a lot of inefficiencies at the higher grammar level, mainly to do with optional chunks of code. For example, an "if" statement may have one or more "elseif" statements and an optional trailing "else." Expressing this naively in the grammar is really slow, because it involves a lot of backtracking: "ok, I have an if... now what's next? Uh oh, what's next isn't an else! So fail that, and try again to just match the if by itself..." and so on.
I'm culling these dumb inefficiencies one at a time, and so far things are looking good. As of this writing, I'm down to 1.65 seconds on the 2MB file. (Note that due to constant overhead of spinning up the parsing system, the actual runtime on a real 20KB is closer to 18ms than 16.5ms, but the gains on larger inputs are definitely worth it.)
And now, more profiling! Yayy!