Protocol Buffers taken to their Simplest Form

Posted on 2020-04-12

In this post I am going to design a data serialization format heavily based on Protocol Buffers (protobufs). The main difference is that I am going to try to simplify the encoding and explore the tradeoffs.

What are Protocol Buffers?

Protocol buffers are a number of things. Primarily they is a binary encoding format for structured data, however they also consist of a language to define the schema, a compiler to generate serializers and deserializers in multiple languages as well as a couple other related features.

In this post I will only discuss the binary encoding, all of these other tools could be adjusted to use this binary format if desired (although the wins are unlikely to be worth the migration cost).

What are the features of Protocol Buffer Encoding?

The encoding of protocol buffers is already quite simple. Its key features are:

Relatively compact encoding.
Fast to encode and decode.
Forwards and backwards compatibility with different message versions.

Notable missing features are:

No support for reading unknown messages. (The schema is required to decode)
No support for directly serializing a primitive value. If you want to do so you must wrap it in a message.

One interesting observation is that the compatibility requirements and lack of support for messages of unknown schema are somewhat at odds. One could think of different versions of messages as messages with a partially unknown schema. However what Protocol Buffers guarantee is that you can read the message fields present in the union of the two versions. So fields that the reader doesn’t know about will be ignored, and fields that the writer didn’t know about will not be present. (It is up to the reader to handle this gracefully).

One other interesting feature is that the reader can preserve unknown fields. Imagine that you have an authentication proxy in front of a protocol buffer based service. Preserving unknown fields means that an old version of the authentication proxy would be able to read and modify known fields of the request, without dropping unknown information before proxying the request to the backend. (Of course this could also be dangerous if the new fields are security relevant!)

We are going to preserve all of these features (and amplify the downsides) in our new encoding.

How are Protocol Buffers Encoded?

There is good reference documentation available. But I will give a quick overview here.

As mentioned above Protocol Buffers only supports serializing messages. As such the top level entity in the encoding is the message. A protobuf message is a list of fields. In protobuf each field is assigned an integer “tag”. This tag is used in the serialization (not the field name). This makes encoding quick and compact.

So for a message such as:

message Question {
  optional string text = 1;
  optional int32 points = 2;
}

It will be encoded as:

Field 1: data
Field 2: data

The field header is formatted as a varint of value $tag << 3 | $type. For text the tag is 1 and the type is 0b010. So the encoded value will be 0b00001010. For points the tag is 0b10 and the type is 0 so the header will be 0b00010000. Note that if the tag is greater than 15 then the header will be multiple bytes long.

The type value is the “wire type” of the field and can be found in this table.

After the field header is the field data based on the wire type of the field. For string that is just the length as a varint, then the utf8 bytes, and for int32 it is a varint.

For this example message:

text: "What is your quest?"
points: 1

It will be encoded as follows:

00000000: 0a13 5768 6174 2069 7320 796f 7572 2071  ..What is your q
00000010: 7565 7374 3f10 01                        uest?..

Encoding Format Design

The key design point of the encoding format is that you can skip over unknown messages. This is why we have a “wire type” in our schema-required format. This provides the decoder just enough information to work out the length of unknown fields, then it can ignore it or save it for when the message is re-encoded.

My Simplification.

My idea was to remove the wire type. Instead we will just encode the length of the value into the header.

Pros:

Simpler, there is only one case.
More compact, less cases means less entropy, we can put some of the length into the header.

Cons:

Messages are even less readable without a schema.
No possibility for special-case optimizations.

My format

The field header is 0bTttttLll where the ts are the tag and the ls are the length of the data. In order to support tags and lengths greater than fit directly into the header these are considered the beginning of a varint. If the T field is set then there will be the remaining varint bytes directly after. If the L byte is set the rest of the length varint will follow (after the tag varint if any).

This leads to a very similar encoded message:

00000000: 0b04 5768 6174 2069 7320 796f 7572 2071  ..What is your q
00000010: 7565 7374 3f11 01                        uest?..

Comparison

Let’s compare the differences of various field types encoded.

Previously “length-delimited” Fields

(now all of our fields are “length-delimited”)

Our encoding for “length-delimited” messages such as strings, bytes and embedded messages is very similar. The primary difference is that we embed 2 bits of length inside the header byte. This means that certain data lengths will be a byte shorter. Most notably this includes 0-3 byte content which is likely common (empty messages and strings).

Integer Fields

The more flexible header allows us to avoid using varints to encode integers. This should provide some minor encoding and decoding performance benefit. It also avoids the need for ZigZag encoding and makes negative non-signed integer types much smaller. However the size benefit for regular positive integers is not completely clear, for smaller numbers our new encoding is smaller due to the two “bonus” length bits in the header, however for some larger integers our encoding is larger because we allocate a whole byte of length bits as soon as we exceed the two in the header were-as the varint encoding allocates the length bits as it needs them.

Size to Encode Integer Field

This shows the number of bytes to encode the field (includes the header for a field with a small tag).

Integer Encoding Size Table

This has the raw data. Note that the rows are not evenly spaced in any way, they just represent the values where the encoding size changes for one or both formats.

Threshold	Expression	Ours	Protobuf
0	2⁰ - 1	1	2
127	2⁷ - 1	2	2
255	2⁸ - 1	2	3
16383	2¹⁴ - 1	3	3
65535	2¹⁶ - 1	3	4
2097151	2²¹ - 1	4	4
16777215	2²⁴ - 1	4	5
268435455	2²⁸ - 1	6	5
4294967295	2³² - 1	6	6
34359738367	2³⁵ - 1	7	6
1099511627775	2⁴⁰ - 1	7	7
4398046511103	2⁴² - 1	8	7
281474976710655	2⁴⁸ - 1	8	8
562949953421311	2⁴⁹ - 1	9	8
72057594037927935	2⁵⁶ - 1	9	9
72057594037927935	2⁵⁶ - 1	9	10
9223372036854775807	2⁶³ - 1	10	10
18446744073709551615	2⁶⁴ - 1	10	11

You can see that our new encoding wins at small sizes because putting the length inside the header allows us to use the full 8 bits of the first byte where Protobuf can only use 7. However at 2²⁴ our encoding spills over to a dedicated length byte putting us behind Protobuf. However our more efficient non-varint format catches up at 2⁵⁶ and then is more efficient for all larger numbers.

Fixed-width Fields

This is one place where our new encoding falls short of protobuf. Protobuf has a dedicated wire type for 64 bit values. This is used both for its fixed 32 and 64 bit integer types as well as its float and double types. Our encoding requires a 2 bytes header for both 32 and 64 bit values whereas protobuf can fit them in a single byte.

It is hard to discuss the impact of this, it depends on what you are using the fields for. If you frequently pass floating-point numbers around or randomly distributed integers your messages will take more space on the wire. It would take some real-world profiling to identify the impact for your application.

Possible Tweaks

I chose a very simple 5/3 split of the header byte for tag and length. Depending on your use case it may make sense to use a different value. The split allows tags up to 15 to be “inline” and lengths up to 3. If you usually have small messages (or the higher tag fields are rare) it may be beneficial to give more bits to the length. However this is almost impossible to guess without profiling.

Another tweak is using a different form of varint for the header values. For example instead of using the top bit to mean “this integer continues” you could reserve a couple of the top values for fixed numbers of subsequent bytes. For example using 4 bits for the length, 15 means an 8 byte length follows, 14 means a 4 byte length follows, 13 2 bytes, 12 1 byte and 11 or lower is an inline length. Notably this scheme allows you to store 64 bit values without an additional length byte (solving the fixed-length problem above for another tradeoff).