Dictionary Compression

Posted

A certain 5-letter word game has gotten really popular recently. A peek at the source shows an interesting choice for implementing the dictionary of allowed guesses and the answer list.

  1. The list of secret words is just sitting in the code, the word is picked based on the user’s local date.
  2. The words are a simple list in the JS.
  3. The answers are just another list.

The first point is clever. It means that they don’t need to update the backend every day (or have logic on the backend) and it automatically switches at midnight for each user, not some random time of day based on the server’s midnight.

The other two surprised me. I would have expected at least some simple domain-specific compression. Also, I would have expected the answers list to just be pointers into the dictionary, rather than a separate list (which requires the validation code to check both lists every time you submit a word).

I thought to myself how I would implement this, and from there started wondering about how to optimize it. A post about compressing the dictionary to fit on a Game Boy then added fuel to the fire, and eventually I cracked. I spend a fun day trying different things, and here are my findings. All the code backing this study is open source and can be viewed here.

Note: For this analysis I am using this archive of the original Wordle. The dictionary and answers have changed over time. I don’t expect any changes to make much difference to this analysis. For all general-purpose compression I am using zstd since it is a very high quality compressor.

Requirements

My requirements are pretty simple. I’ll assume a browser target, and I’m going to see how best to get the data there. I’m going to assume I can send binary data, and I’ll also approximate decoder size to ensure that I am not just moving data into the decoder. However, in practice, this wasn’t an issue for any of my solutions.

Measuring code size it is very difficult due to WASM overhead and Rust error handling overhead. To normalize for this I will be taking the code size of the smallest implementation (bitmap at 1347 B) and subtract the difference in size of the same when implemented with no possibility of panic (234 B). This means that the code sizes seen in this post will be 1113 B less than the actual compressed size. This should roughly account for the overhead of the Rust panic infrastructure. I think this makes sense because unless you are implementing just this one function in WASM you will have to pay this cost anyway, and this isn’t storing any domain-specific data. I also didn’t feel like implementing each one with unsafe to avoid the chance of panic due to slice bounds checking. This is not very scientific and mostly makes sure that no implementation results in a huge code size.

I also want to ensure that the encoding is feasible for interactive use. I’ll target 50 ms to leave lots of room in the 100 ms threshold for other work. In practice this isn’t an issue for any of my solutions by a couple orders of magnitude. All times in this article were taken by counting the average time to check all possible valid or invalid words in a random order on an AMD 5950X. This is not particularly scientific because the branch predictor will quickly learn to always say yes or no and the timings were just run once but since I am so far under my time requirement it doesn’t seem important.

Baseline

First, let’s see how the simplest options will work. The game has 12 972 words in the dictionary, 2315 answers, and 10 657 guessable words. In the game these are JSON encoded in two separate lists. Let’s see how big these are. For the “both” I took the best-case scenario of simply concatenating them with not even a newline between them.

Data Uncompressed Default (3) Max (19)
answers 18.088 KiB 7.604 KiB 6.266 KiB
dictionary 83.259 KiB 31.739 KiB 18.070 KiB
both 101.346 KiB 39.985 KiB 24.379 KiB

Let’s be honest. Less than 25 KiB for your app data is not much these days. But let’s see how much better we can do.

Single List

First, let’s try putting the words into a single list. This should allow for better compression because all the words are in order, not the randomized order for the answers. (Notice that the answers only compressed down to ⅓ of their size whereas the full dictionary compressed down to almost ⅕. I assume that a lot of this is the unpredictable ordering.) However, at the same time, I’m also dropping the JSON encoding. I’m just concatenating all the 5-letter words. This saves me three characters (",") on every entry which is a ⅜ savings (although these characters likely compress very well).

As for lookup, I’m doing a binary search. It isn’t necessary to hit my performance target but significantly speeds up my exhaustive testing. Plus the added code size is tiny.

Results

While our size dropped significantly from removing the JSON the compression became less effective and we ended up saving less than 2%. If you add on 2 bytes per answer to regain the answer list you basically break even.

More Efficient Encoding

The first obvious thing is that all the words only use letters in the a-z range. We don’t need to spend a full byte per character. We only need log226=4.7\log_2 26 = 4.7 bits per character, or 23.5 bits per word. I’ll ignore the half a bit and use 24 bits, or 3 bytes per word.

The logic is quite simple. Instead of storing the word itself you store the index of the word in the list of possible words. For example aaaaa is 0 and zzzzz is 265126^5 - 1. The code is below:

fn index(word: [u8; 5]) -> u32 {
	let mut i: u32 = 0;
	for c in word {
		debug_assert!((b'a'..=b'z').contains(&c));
		i *= 26;
		i += (c - b'a') as u32;
	}
	i
}

Our dictionary is now a list of 3 byte word indexes.

Results

We get the ⅗ size difference we were expecting, however zstd completely fails to find any patterns! The compressed size is actually larger than before, and on the default zstd setting it doesn’t compress at all. This is surprising because typically there is repetition 3 bytes apart at the start of every word, but I guess zstd doesn’t notice that. gzip also struggles but zopfli does manage to crunch it down to 34.314 KiB. Still significantly larger than letting the compressor see the ASCII directly.

More Efficient Encoding with Index

Instead of storing each word in 3 bytes we can use the first 2 characters in an index and compress the remaining 3 characters into 2 bytes. This saves us 1 byte per word but takes up space on the index.

Luckily the most common 2-letter prefix “co” only has 220 words, this means that we only need one byte per prefix in our index or 262=67626^2 = 676 bytes.

Results

This is much smaller than before plus the compression seems to find more patterns, but it is significantly slower. There are two main reasons.

  1. We sum up the values in the index. This is good for space but bad for speed. This could be fixed with a pre-processing pass to resolve the offsets before doing the lookups.
  2. I didn’t bother implementing a binary search for the values. I didn’t think it would help for a max of 220 entries, but maybe it would.

Bitmap

What if instead of storing the indexes we stored one bit for every possible word. 1 if it is an allowed word and a 0 otherwise. This will require 26526^5 bits and should give very fast lookups.

We can use the same logic for converting the word into an index then just check that bit in the table.

Results

This method is incredibly fast, a great option if you were hosting the data on a server. The general-purpose compression also does a great job making this option quite small on the wire.

Delta-encoding

Another option is that instead of storing the words, we just store the change from the previous word. This takes advantage that in the sorted list the first character is almost always the same, the second character is usually the same, and so on. The way I decided to do this was by storing the difference in indexes. This works well because we can enumerate the possible words. Other options would be including a count for the number of characters that don’t match, then the replacement characters.

Unfortunately, the distance between words varies wildly. From adjacent words like “charr”, “chars” and “chart” to distant words like “xerus” and “xoana” which are 164 068 possible words apart! That means we will need 17.4 bits to store the distance. However, since most gaps are smaller it makes the most sense to use a variable-length encoding.

In this case, I use a simple varint encoding where the top bit set means that there are more bytes to add on. In this system, 3 bytes gives us 21 usable bits which is large enough for our largest gap.

Results

This solution is slow because it sequentially scans all the words until it finds a match. This is because we can’t seek in the variable-length encoding, and you need to sum all previous values to find the current index. However, at 10 µs it is still well within our target time. This could also be improved easily by building a 2 or even 3 character prefix index on the client, or the client could just construct a bitmap.

Delta-encoding with Index

This was an idea that in retrospect doesn’t make much sense. It was basically the same as the previous idea except that I inserted a jump-table for the first letter at the front. This speeds up lookup as you don’t need to scan every earlier word. However, I thought that it would also be smaller because I didn’t have to include the first letter in the list, however since I was doing delta encoding it doesn’t make a difference. The distance between taken and tools is the same as the distance between aken and ools.

Results

This is insignificantly larger, both before and after compression.

The speed-up is significant, about 20x faster. This makes sense because instead of scanning half of all words on average, you only need to scan half of the words with the same first letter. I’ll assume this isn’t exactly 26x due to unequal letter distribution.

Bloom Filter

This is the reason why I started the project. I thought I would be able to over-fit a bloom filter to the problem. However, doing the math showed that this was unlikely. I experimented anyway and confirmed my suspicion. To get no false-positives I needed to use a ¼MiB filter. This could maybe be reduced slightly by jigging some constants to over-fit the data, but it seemed very unlikely that a small enough filter could be found to be an efficient approach.

One intrinsic downside of the filter is that you can’t iterate the entries. This means that you can’t ship your answers as a 16-bit index into the valid-word list, you need to use a full 24-bit possible-word index. However, even if you want to store a year of answers it is a small extra cost (only 365 extra bytes).

I still think this idea has value even if it doesn’t fit my goal of no false positives at a reasonable size. If you can compromise on requirements it could be a good choice. With a 16 KiB filter, you can get 1:127 false positive rate, with 8 KiB you can get 1:11. If the goal is to prevent accidental submits from typos and avoid people just guessing “aeiou” as their first word then these may be reasonable tradeoffs. The word list is full of strange words like “abcee” anyway, so I don’t think anyone will notice a couple false-positives.

Results

It is very fast (and I could have used a faster hash) but if you want no false-positives this approach isn’t great.

Conclusion

The best approach was Delta Encoding at 15 KiB after compression. This is a significant reduction over the simple approach at 25 KiB. Of course, it is easy to argue that this isn’t a worthwhile optimization compared to the simplicity of embedding the list in the JavaScript and checking membership using dictionary.includes(guess()) || answers.includes(guess).