50x faster parsing of timeseries
or: just what you need, and nothing else
For the last year and a half at work I’ve been working on performance optimization of our in-house dashboarding system for timeseries data — an equivalent of Grafana, if you’re familiar. Today I’d like to tell the story of one optimization I made, and how it has changed the way that I think about writing software.
But first! Since we’re talking about performance optimization I feel obligated to get up on my soapbox and say: Don’t try to optimize your code until you’ve looked at a profiler! Performance optimization is a sharp tool. Don’t play with it when the lights are out. Okay, down off my soapbox now; let me show you some cool stuff.
My goal with this work was to improve the load time of our dashboards. That’s the time between when you hit enter in browser’s URL bar, all the way until all the graphs on the page had all the data drawn to the screen.
After doing some profiling, I’d discovered that parsing the timeseries data we were getting from the backends took 40% of the total page load time. In one case that parsing was taking 20ms per chart, or ~1000ms for the whole page. The total amount of time would vary depending on the size of the dashboard and amount of data being loaded, but we’ll use those numbers as a reference for later improvements.
Depending on your background and your standards for your tools that might sound like a lot of time, or nothing at all. How should we feel about 20ms? To quote Dick Sites:
Someone walks into my office and says “My program is too slow.” After a pause, I ask “How slow should it be?”
Let’s do the math. One of our charts is showing 50 timeseries, each of which has 60 points. Each point consists of an 8-byte floating point value and an 8-byte timestamp. In total, then, a chart’s data must be at least 50*60*16B ~= 47KiB. If our wire format were perfectly arranged such that we could just move the data from the response into an array, we would expect to be limited by memory bandwidth. The famous Latency Numbers Every Engineer Should Know by Jeff Dean and Peter Norvig quote 1MB of sequential memory reads as 250µs, so we’re looking at a best case of 250µs / 1MB * 47KiB ~= 10µs. Our real parsing takes 20,000µs instead! That gives us 2000x room for improvement.
So what the heck are we trying to parse, then? A look at the network tab shows that a single point is encoded as this lovely JSON blob:
[{
"3": 3.141592653589793116,
"1": [{
"1": "1745311200"
}]
}]Some digging around in the codebase turns up this protobuf message definition:
message Point {
Timestamp timestamp = 1;
string string_value = 2;
float64 float_value = 3;
}
message Timestamp {
int64 seconds = 1;
int32 nanos = 2;
}Another (internal) tool called JSPB is then producing JSON serializers from these messages. Parsing is handled by a JSON.parse(), followed by the JSPB-generated parser walking the JSON tree to produce an object tree that looks more like the proto messages, followed by our code doing another copy of the data into a third format. There’s a lot to dislike here!
The point is 53B (minimized), instead of the optimal 16B.
Each message is encoded as an array with an object inside of it.
There are two nested messages, instead of just one flat one.
The
int64 secondsfield is encoded as a string. This is unfortunately necessary if you’re putting it in JSON, because the JS number type is a 64-bit float, which can only represent up to 53-bit integers accurately. JSON doesn’t support bigints.The floating point value is encoded in 20 bytes of ASCII decimal, because JSON, instead of 8 bytes of binary.
We could keep ragging on it, but at this point it’s clear: we can do better!
Let’s try the simplest possible encoding. As described above each point is an 8B timestamp and an 8B floating point value. We’ll encode these directly on the wire back-to-back in little endian, so that our “parser” can just read them directly. In the header we’ll put a 2B version number for the format so that we can upgrade it later, a 2B point type to choose between double- or string-valued points, and a 4B count of the number of points in the message. Together that looks like this:
+------------+------------+-------------------------+
| 2B version | 2B type | 4B num points |
+------------+------------+-------------------------+
| 8B timestamp |
| 8B float value |
| ... N times |
+---------------------------------------------------+We’ll also support a version with string-valued points, where the 8B float value is replaced with a 4B string length and 4B string offset into a data block that lives at the end of the message. Including error checking and comments, a JS parser that handles both versions is about 60 lines of code.
So how did we do? A run in prod shows this is 10x faster when both are compiled and minified, and 30x faster when running in a debug build. Great!
…but we can do better.
We’re encoding an 8B timestamp for every point. Our timeseries backend guarantees that the data it produces is evenly spaced in time, as long as there are no gaps. We can take advantage of this by encoding only the timestamp of the first point, and the time between consecutive points. We can indicate a gap by inserting points with a value of NaN. Our new format looks like this:
+------------+------------+-------------------------+
| 2B version | 2B type | 4B num points |
+------------+------------+-------------------------+
| 8B start timestamp |
| 8B time between points |
+---------------------------------------------------+
| 8B float value |
| ... N times |
+---------------------------------------------------+This packed format is more efficient for data without gaps. The sparse format is better for data with a lot of gaps. We can get the best of both worlds by having the serializer check on the fly which version would be faster to decode, and encoding to the faster-to-parse format.
Measuring again, this is now 16x faster than the original when both are compiled and minified, and 50x faster in debug builds.
There’s certainly more we could do to improve parsing: NaN-boxing a number of gap points; run-length-encoding for timeseries with constant values; adjusting the rest of our frontend code so that we can reduce the number of allocations for our list of points; etc. But we should stop here.
Our original goal was not to optimize parsing, but to optimize the page load. Now that we’ve improved parsing by 16x in prod, it’s no longer a big fraction of the total page load time. We’re at the point of diminishing returns, and our optimization effort is better spent elsewhere in the system.
So let’s celebrate!
Our original page that took 1000ms to parse now takes just 62ms.
Parsing the points now takes less time than parsing the rest of the metadata about the timeseries.
Where parsing took 40% of the page load before, it now takes just 4%.
Why is the new format faster? Are there insights that we could apply more generally?
I have some hypotheses as to why. Each of these is a true statement about the code that would affect the performance, but I’m not sure what the effect size is in each case.
The new code copies the data just once, instead of three times.
The new code does just one allocation per point instead of 7+.
again, would be great to also dodge this allocation
The new code is MUCH easier to branch predict.
The old code has to do JSON parsing plus a tree walk. The JSON parser has at its core a switch statement that checks what the next token is. That’s inherently unpredictable for the CPU!
The new code in contrast should give <= 3 branch prediction failures for the whole timeseries, outside of allocations. Those would be when checking whether to parse string values or float values, and at the start and end of the per-point for loop.
The new wire format is smaller, and so may have better cache locality.
Those are good technical reasons. But I think there’s a more general reason why the new code performs better: It does exactly what we need, and nothing more.
Protobuf and JSPB are great tools. They solve a general, hard problem well:
they auto-generate serializing and parsing code
for complex, deeply nested messages
where the schema changes often
and the serializer and parser might be using different schema versions.
But each of these features has a cost, and in this case we just don’t need them.
Writing serializing and parsing code once didn’t cost that much: just ~80loc;
points are simple, and don’t nest;
the schema hasn’t changed in 4-5 years;
and given that cadence, we can manage release of new versions manually.
More broadly, it’s important to recognize that convergence, ie. applying a common tool to your problem, is a double-edged sword. It has some great, and well-written-about benefits:
many tools will support the common thing
there’s a team working just on that thing
etc.
But it has series downsides that are underappreciated in my corner of the tech world:
it rarely does exactly what you need, and adapting it has a cost
it often does something else too that you still have to pay for
you have to learn how to use it
changing its behavior is hard, because you can’t afford to break other users
and your mileage may vary, but often the team working on the tool won’t fix your bugs and may not accept your patches because you’re just one user, likely using it in a strange way they didn’t predict.
Neither approach is a perfect fit for every situation. Like most things in software buy versus build is a choice with tradeoffs that you need to critically evaluate yourself. But please don’t assume that because someone else wrote it, it’s the best fit for your problem. Don’t be afraid to make something new.


