Nested Encoding Efficiency
Posted
It’s relatively common that you want to send data over a restricted channel. For example a URL path can’t contain a ?
or a null byte. A common solution is using some encoding format to convert your data into an acceptable string.
Base64
Base64 is a very common format. It can convert arbitrary binary data into an alphabet of 64 characters (plus an extra character for padding if desired).
b64_encode(['a', '?', 0, 'b', ' ', 0xFF]) == "YT8AYiD/"
With Base64 the output is always longer than the input (other than the empty string). In fact the output length is very easy to predict.
# Assumes no padding.
def b64_length(in: bytes) -> int:
complete_chunks = len(in) // 3
remainder = len(in) % 3
complete_encoded = complete_chunks * 4
remainder_encoded = 0 if remainder == 0 else remainder + 1
return complete_encoded + remainder_encoded
The remainder is at most 3 bytes, so we can say that as the input data gets bigger the ratio approaches 4/3. Therefore, encoding data as Base64 will make it more or less 1/3 larger.
Base64 takes a very simple approach to encoding, it takes 3 input bytes and maps that to 4 output bytes. Rinse and repeat until you are done (with a little special logic for any remainder that doesn’t fit into a 3 byte chunk). This is simple, quick and predictable, but the output looks like nonsense cat
becomes Y2F0
and hello
becomes aGVsbG8
. What if you want human-readable strings to remain human-readable?
C String Escaping
C-style string escaping is very simple. You have some set of “allowed” characters which are input literally, and anything else needs to be “escaped”.
c_encode(['a', '?', 0, 'b', ' ', 0xFF]) == "a?\x00b \xFF"
Any non-allowed bytes are replaced by a backslash-x (\x
) then the hex representation of the byte. In practice most implementations have shortcuts for common characters. For example \\
for a backslash and \0
for a null byte. It is easy enough to read and write by hand.
So what is the efficiency of this encoding? Well… it depends. If you encode only allowed characters it has a ratio of 1, if you encode only characters with shortcuts it is 2, but if you encode only disallowed characters with no shortcuts it is 4! Unlike Base64 the encoding efficiency is data-dependant.
But what I find interesting is the scaling of nested encoding. For example imagine that I want to encode a single null byte then encode the result, then encode the result again…
\0
(2 bytes)\\0
(3 bytes)\\\\0
(5 bytes)\\\\\\\\0
(9 bytes)\\\\\\\\\\\\\\\\0
(17 bytes)\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\0
(33 bytes)
Repeated encoding tends toward doubling in size! That doesn’t seem optimal. You would hope this case is rare but if I had $1 for every time I saw JSON encoded inside a JSON string I could buy myself something nice.
Percent Encoding
Percent Encoding is most commonly used in URLs. It is similar to C string escaping except that it doesn’t have a shortcut for the metacharacter itself.
percent_encode(['a', '?', 0, 'b', ' ', 0xFF]) == "a?%00b %FF"
percent_encode(['%']) == "%25"
It is also data-dependant. Its size ratio varies between 1 and 3. However, it doesn’t have as big of a problem with nested encoding.
%00
(3 bytes)%2500
(5 bytes)%252500
(7 bytes)%25252500
(9 bytes)%2525252500
(11 bytes)%252525252500
(13 bytes)
Since the encoding doesn’t add additional characters that need escaping the growth on every iteration the growth is a constant factor based on the number of characters in the initial string that require encoding instead of doubling each time.
Can we do better?
It is easy to make a minor improvement. In C-style encoding using \.
instead of \\
to represent \
would avoid the problem. Percent Encoding could be slightly optimized by adding shortcuts for common characters (for example %p
instead of %25
) at the cost of complexity and performance.
Metacharacter switching
One idea is a variable metacharacter. For example imagine encoding "'\
as \"\'\\
. Imagine that we could then do \1\"\'\\
to say that the metacharacter is now 1
instead of \
. This would avoid adding characters to the rest of the string, in this example it already saves one character.
The main downside is that you need to pick one of the following evils:
- Give up streaming support.
- Allow the metacharacter switch to occur anywhere in the stream.
- Accept that you may not pick the best metacharacter.
Not terrible but not great and not pretty.
Encoding Level
Another idea is declaring the “encoding level” at the start of the message. Upon encountering this the decoder just decrements the counter and passes though the data unchanged.
For example imagine a modification of base64.
Y2F0
=1=Y2F0
=2=Y2F0
=3=Y2F0
This has basically the same problems as the first approach and requires you to add another character to your alphabet or require that the nesting level is always declared (which adds overhead for non-nested encodings).
Research?
I tried looking around for a bit but couldn’t find any other discussion on this. I’m probably using the wrong keywords. If you are aware of any encoding systems that have considered this problem please let me know.