Default to Less Than Quadratic

Posted

When programming you often need to pick an algorithm for small tasks. For example if you need to select an account by name a bunch of algorithms probably spring to mind. For example linear search over the array or adding a hashmap to lookup by name in constant time. Which one should you choose?

It is well studied that simpler algorithms can perform better on small datasets despite worse complexity. For example bubblesort is one of the fastest algorithms for sorting small arrays and linear search can search through hundreds of integers in the time it takes to even compute a moderately complex hash.

So what algorithm do you choose? It can be tempting to pick one of these simpler algorithms if you know your data will typically be small. Most users are probably signed into a single account and few if any people will have over ten. So optimizing for a small number of accounts will be faster for almost, if not all, users and is likely simpler to implement.

My default advice is to choose the simplest algorithm with less than quadratic time and space complexity. Every situation is different, but unless I have a good reason for an exception I stick to this rule.

The reasons for this are simple:

I really like this quote from Bruce Dawson:

O(n2)\O{n^2} is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there

If you are frequently putting quadratic algorithms into your code your users are going to see it fall down when they pushed it in a dimension that you expected to “always be small”.

It is also important to remember this observation from the Accidentally Quadratic blog.

The property of linearity, however, is not closed under composition. If you do a linear operation linearly many times, you get quadratic!

Even if you always pick sub-quadratic algorithms the composition may not be. So you should consider the complexity of the system as a whole. Linear search to find an account in a list is fast enough for my rule, but if you are doing this for each account then you have an O(n2)\O{n^2} system overall, and I would recommend optimizing it.

Example

As a concrete example. To make a list of unique elements two functions came to mind. The first is an O(n2)\O{n^2} algorithm that does the simple thing of checking if each element is in the output list before adding it.

fn unique<'a>(names: &[&'a str]) -> Vec<&'a str> {
	let mut result = Vec::new();
	for name in names {
		if !result.contains(name) {
			result.push(*name);
		}
	}
	return result
}

However, I would advise against this function unless you know for sure that the inputs will never be large. I would instead prefer to write this O(n)\O{n} algorithm.

pub fn unique<'a>(names: &[&'a str]) -> Vec<&'a str> {
	let mut result = Vec::new();
	let mut seen = std::collections::HashSet::new();
	for name in names {
		if seen.insert(name) {
			result.push(*name);
		}
	}
	return result
}

This is slightly more complex, but does that hurt us for the small datasets that we expect? Yes, but it probably doesn’t matter. The algorithms cross at about 60 elements, at that point the runtime is less than 2µs. At worst the linear version is going to cost you about 600ns at 30 elements. It just doesn’t matter in 99.9% of cases.

Gnuplot Produced by GNUPLOT 6.0 patchlevel 0 0 100 200 300 400 500 600 700 10 20 30 40 50 60 70 $data Slowdown in ns Number of Elements

If this function does ever show up on your profiles it is easy to add a length check to select the optimal algorithm for the data size. This way you can have the best of both worlds, great performance on small datasets but linear growth. However, I don’t think this complexity is worth it for most code. I default to the sub-quadratic algorithm, then profile my code and spend my time where it counts.