Protobufs With Go
Faster and smaller - two important words when dealing with data.
I have been meaning to try Google’s Protocol Buffers (protobuf) with Go for quite a while. Structured data that’s smaller when serialized and faster to load as well as code generation - what’s not to like? I tried out protobufs on some quake data and I was surprised by how much smaller and faster a protobuf verion was - 35 times smaller and 180 times faster to unmarshal than the original data.
The code for my test is available here https://github.com/gclitheroe/exp
The source XML file is a SeisComPML (XML) event file. It’s data for this quake. The same data is available QuakeML format here. The QuakeML format is created by transforming the SeisComPML on the fly and I’m interested in speed so for this experiment I’ve started from the SeisComPML.
SeisComPML represents data for the entire process of locating an earthquake. I’m interested in displaying only part of this information. I’ve modeled the information I want in the file protobuf/quake/quake.proto. I’ll call this format Quake to differentiate it from the SeisComPML. Deciding which information I wanted and how to structure it was by far the most time consuming task.
From the quake.proto file I can use the protobuf compilier with Go support to generate Go code. Code for other languages including Java, Objective C, and Python can be generated from the same quake.proto file.
Compiling the Go code for the quake protobuf looks like:
protoc --proto_path=protobuf/quake/ --go_out=quake protobuf/quake/quake.proto
I can then add some funcs to unmarshal the SeisComPML, remap it to my Quake protobuf and save it to disk. I’ve also output XML and JSON versions of the Quake file for comparison. There are tests to generate the files:
go test ./quake ./seiscompml07
ok github.com/gclitheroe/exp/quake 0.041s
ok github.com/gclitheroe/exp/seiscompml07 0.050s
Size (bytes) | File Name | Format |
---|---|---|
495917 | seiscompml07/etc/2015p768477.xml | SeisComPML (XML) |
113830 | quake/etc/2015p768477.xml | Quake (XML) |
99615 | quake/etc/2015p768477.json | Quake (JSON) |
14181 | quake/etc/2015p768477.pb | Quake (protobuf) |
There is a significant drop in file size going from the SeisComPML to my Quake format as XML. This is not surprising as I’ve omitted most of the entity mapping (publicIDs) and creation information as well as some amplitude information from the original SeisComPML. The protobuf Quake file is 35 time smaller than the corresponding SeisComPML file. This drop in size will lead to large improvement for disk i/o and network transfer times.
There are benchmark tests that unmarshal SeisComPML and the Quake files. The benchmarks unmarshal data from byte slices to avoid any bias from i/o. Unmarshalling the Quake protobuf is over 180 times faster than unmarshalling the complete SeisComPML; 0.16589 ms per operation versus 30.602699 ms. The Quake protobuf is also faster to unmarsal than the corresponding XML or JSON files. There is a Go benchmark test:
go test -bench=. ./quake ./seiscompml07
ns/op | File Name | Format |
---|---|---|
30269773 | seiscompml07/etc/2015p768477.xml | SeisComPML (XML) |
8545983 | quake/etc/2015p768477.xml | Quake (XML) |
1800593 | quake/etc/2015p768477.json | Quake (JSON) |
163473 | quake/etc/2015p768477.pb | Quake (protobuf) |
It’s not really surprising that binary data is smaller and faster to work with than XML. I was a little surprised how much faster the protobuf is. I’m also stoked with how little effort it is to make this gain. Coupled with the easy code generation protobufs look like an approach worth investigating further.