Multidimensional Service Configuration

Posted

Working as an SRE you often have to configure a whole bunch of similar services. For example you need the production instance (of course) and a staging instance. Maybe you have another instance for load testing and maybe another for doing canary testing. For some of these you probably run them redundantly, so you may have copies in different countries.

At the end of the day you end up with a number of individual deployments which are similar, but different in critical ways. A common pattern I see to handle this is what I call the “Override Pattern”.

Override Pattern

The Override Pattern is very simple, likely a reason it is a common first step. You simply define a “base” configuration and then you have a (hopefully small) “override” configuration for each particular instance.

Here is a simple configuration to use in examples.

# default.cfg
logging-endpoint: https://us.logging.example
tasks: 3
databse-uri: postgresql://api:passw0rd@prod-db-primary.internal
databse-replica-uri: postgresql://api:passw0rd@prod-db-replica.internal
threadpool-size: 8
# prod-us-west.cfg
hostname: api.company.example
tasks: 50
# prod-us-east.cfg
hostname: api.company.example
tasks: 28
# prod-de-central.cfg
hostname: api.company.example
logging-endpoint: https://eu.logging.example
tasks: 41
# qa-us-east.cfg
hostname: beta.api.company.example
# qa-de-central.cfg
hostname: beta.api.company.example
# dev-us-east.cfg
hostname: dev-api.internal
logging-endpoint: https://beta.logging.example
databse-uri: postgresql://api:passw0rd@dev-db-primary.internal
databse-replica-uri: postgresql://api:passw0rd@dev-db-primary.internal

Now that doesn’t look too bad! We have just the information we need in each instance and it is very obvious where to make changes when you want to do so. It is also easy to implement and we can easily write scripts to update the values.

However you can do better. You quickly spot duplication! hostname is shared by all prod services. And the logging endpoint is shared by the services in each continent. So you add env:prod.cfg, env:qa.cfg, env:dev.cfg, country:us.cfg and country:de.cfg file. Now you can put the appropriate overrides in each file. This solves some redundancy better but still has a number of pretty severe issues.

Problems

Forgetting a Value

It is easy to forget a value. For example imagine that I forgot the database-uri from dev-us-east.cfg. Now our dev instance is going to connect to the production database! Of course you can try to avoid this by ensuring that default only has values that are safe defaults. With good validation you can even ensure that no values were missed. You can add intermediate override files to help target values but you always end up with a loss. For example most of your instances are in the US, so you don’t want to repeat the logging endpoint, and it isn’t too bad if someone logs across the world. But this problem will keep creeping up on you.

Hierarchy and Order

When you start adding intermediate overrides files you end up with two major options. You can have a strict hierarchy (country:us in env:prod is completely independent from country:us in env:qa), or you can define an override order. Both of these approaches end up with problems.

The hierarchy is easier to understand, especially if you structure your files in subdirectories it is clear that each override file only affects the files under it. It is easy to understand that the deeper in the directory structure has higher priority. However the major problem is that now if you want to apply a us logging-endpoint for country:us you are unable to do so. You must do it once for prod and once for qa.

default.cfg
dev
├── default.cfg
└── us
    └── east.cfg
prod
├── default.cfg
├── de
│   └── central.cfg
└── us
    ├── default.cfg
    ├── east.cfg
    └── west.cfg
qa
├── default.cfg
├── de
│   └── central.cfg
└── us
    └── east.cfg

The natural solution to this is no hierarchy, but an order. This solves a lot of the problems as anything that is uniform along any aspect can be set in the file, and you can override exceptions in the individual instance files. However this is still much harder to understand as you need to know the order in which the overrides apply. Even then this system isn’t perfect. If country overrides env I can’t effectively set the dev logging endpoint as it will be overridden by the per-country option. If env overrides country I will have the same problem for other options. This can work well if your dimensions are mostly-hierarchal but inevitably you will get configuration options that just don’t fit with your chosen order, and you fall back to repeating things.

default.cfg

country:de.cfg
country:us.cfg
env:dev.cfg
env:prod.cfg
env:qa.cfg

dev-us-east.cfg
prod-de-central.cfg
prod-us-east.cfg
prod-us-west.cfg
qa-de-central.cfg
qa-us-east.cfg

Browsability

Another downside of this system is that it is hard to browse. Say I want to find the vaule for databse-uri. I look and the base file and see the prod database URI. However I suspect that our test environments connect to different databases so I check a qa instance. Oh, I guess everything uses the prod database. For envs it maybe isn’t too much work (if you have intermediate overrides) but it becomes more likely that you will miss an override as the number of places it can be increases.

You can work around this with a tool. But that is another tool that everyone needs to find and learn, especially if people are used to using regex search it will be effort to train them off, especially when regex search gets then most of the info most of the time.

Cargo Culting

This setup also tends to accumulate configs over time. I mostly blame this on Browsability with a little help from the repetition that tends to form. This means that when you want to turn up a new instance you tend to copy some “similar” instance and edit the file. This is quite error prone. It also means that you end up in weird states with gradual/test rollouts as they tend to just adopt whatever was in their source instance. This often leads to surprise when the person managing the rollout later realizes that their changes are affecting more instances than they expected!

Gradual Rollouts

It is also hard to do gradual rollouts in this form. With intermediate override files people will be tempted just to make the change in env:qa.cfg, then if everything looks OK update env:prod.cfg, rolling out every env:prod instance at once. Ideally you would want to first make the changes in a couple of instances at a time. Then at some point you can flip the default. However this is hard to do without tooling.

I would also argue that once you add enough tooling to this approach you now have a database, not a human-maintained set of configs. That isn’t necessarily a bad idea, but at this point you are dealing with a whole different beast (and this post isn’t particularly relevant).

Expression Pattern

So what do I suggest? The key observation is that whenever you are trying to configure a set of things there are a number of ways in which they need to differ. I will refer to these as dimensions. You may have dimensions for the environment type, location, particular instances, canary state or even lifecycle. In the Override Pattern you had to pick the order in which to slice the dimensions once for all of the fields in all of the config files, furthermore there was generally a lot of complexity to using new dimensions. The key to a maintainable config is to embrace the multidimensional intrinsic complexity and allow your configuration to easily deal with the relevant dimensions.

For example in the case above you had the country dimension and the environment dimension. In real-world systems the story often gets a lot messier. For example you may have a client dimension if you run a separate instance of your service for each client (or for some particularly large clients). The override pattern forces you to somehow put these dimensions into a hierarchy, or at the very least some sort of order of importance. However the real world is never that clean. The solution to the problem is to deal with dimensions separately for each key. Most keys in your config will only care about a small number of dimensions (usually zero, often one, occasionally two and very rarely more) and it makes it easy to understand.

Let’s convert the above example to this form:

local full-instance = "{args.env}-{args.country}-{args.instance}"

hostname = {
	prod = "api.company.example"
	qa = "beta.api.company.example"
	dev = "dev-api.internal"
}[args.env]

logging-endpoint =
	if args.env == "dev" then "https://beta.logging.example"
	else {
		us: "https://us.logging.example"
		de: "https://eu.logging.example"
	}[args.country]

tasks =
	if args.env != "prod" then 3
	else {
		prod-de-central = 41
		prod-us-east = 28
		prod-us-west = 50
	}[full-instance]

local database-info = {
	prod = {
		primary = "postgresql://api:passw0rd@prod-db-primary.internal"
		replica = "postgresql://api:passw0rd@prod-db-replica.internal"
	}
	qa = prod
	dev = {
		primary = "postgresql://api:passw0rd@dev-db-primary.internal"
		replica = primary
	}
}[args.env]

database-uri = database-info.primary
database-replica-uri = database-info.replica

threadpool-size = 8

Wow, that is more complicated isn’t it! Well kind of… but the complexity is organized in one place. Even more importantly instead of sorted by instance (and spread around different files) the complexity is sorted by key! I find this a much more helpful and meaningful way to digest this information. In fact reading this you can fairly clearly see that the only value that is per-instance is tasks as nothing else accesses the args.instance item.

Solved Problems

Let’s compare this approach to the earlier problems:

Forgetting a Value

This format doesn’t force us to give a default value, we can decide if there is a reasonable default for each key individually. For example database-info explicitly gives a value to each environment. prod and qa share the database but it is explicit. For example, if I add a dr environment I will get an error until I add a value. There is no unsafe default in this case so we can easily force humans to make a choice.

In the example we have done it by using a map lookup. However another great solution is to use separate files for each instance. However for each file we only put the values which are unique for every instance. It is often useful to have these files for machine updated values. For example you could have a tool which checks your metrics every day and commits an adjustment to the data file. (If you take this approach be sure to keep these files as small as possible otherwise you might slip out of this pattern.)

Hierarchy and Order

As mentioned above the dimension priority can now be defined per-key. For example logging-endpoint is primarily by country. However there is a high-priority exception for dev which has been encoded in a clear way. Each key can express the order of input variables in a clear way, this ensures that we have no more repetition than required.

Browsability

I find this format incredibly easy to look at. If I want to see where each instance sends the logs it is trivial to find out. dev sends to the beta endpoint of our logging service and everything else sends by country. No need to open each and every file to check for an override, the complete logic is right there.

I’d like to emphasize that the logic is right there. In the overrides approach you need to look at all of the values and try to reverse-engineer the logic yourself. In this case you see the raw logic, not the results. It even provides a nice place to add a comment about why the logic is that way.

Cargo Culting

This is related to the Forgetting a Value point. Since many keys will have defaults, when creating a new instance you will get most of your config for free. The couple of places that need a value will raise an error so you can decide what the correct value is. The problem isn’t completely solved, because you might just copy the value from a similar instance but at least someone did that intentionally and decided that it was the right thing to do. For example you won’t accidentally extend a single-instance canary to a new instance because you will naturally fall into the default of that expression.

Furthermore the choices made are incredibly obvious in a diff. Unlike the Override Pattern where the entire file is copied and there is no context for each value, each value here will show up as a separate diff surrounded by the relevant context.

Gradual Rollouts

It isn’t super obvious how this system can help partial rollouts but an example makes it clear. Let’s say that we want to change the size of a threadpool from 8 to 16. How to do env:dev and env:qa rollouts to the whole environment is obvious but let us see how we can slowly roll out to env:prod.

# Canary deployment of larger threadpool https://bugs.example/123
threadpool-size =
	if args.env != "prod" || full-instance == "prod-us-west"
		16
	else
		8

Note how as discussed in Cargo Culting any new instances won’t copy this experiment by mistake any new non-env:prod instances will get the new value and any new env:prod instances will get the old, tested value. It is also clear to any reader what is happening.

Going further can be done in a myriad of ways. You can simply list more test instances, then replace the conditional with the new value once you are happy. However I like to make it even easier for the user with something like this:

# Decides if the current instance should be part of a canary.
#
# id: The name of the canary. This is used to avoid picking the same instances for all canaries. Typically you can use the name of the variable that you are canarying.
#
# amount: is the canary fraction from 0 to 1.
#
# Note that instances marked as canary nodes will always be included. Even at amount == 0!
local canary = id: amount:
	args.canary || (sha1("{id} {full-instance}") < sha1-max * amount)

# Canary deployment of larger threadpool https://bugs.example/123
threadpool-size =
	if args.env != "prod" || canary("threadpool-size", 1/10)
		16
	else
		8

Of course this isn’t guaranteed to be the most even split possible but it is good enough for most gradual rollouts. It is easy to imagine more complex solutions but I find that making canaries as easy as possible is the better tradeoff as you end up with less people skipping them.

Downsides

There are a couple of downsides to this method, but I think they are minor, especially when compared to the upsides.

More Complex Processing

Instead of the input being data-only it is now an expression. Personally I prefer this option anyways as I find it gives more chance to keep your config DRY and avoid having dependent values drift from each other. However now you probably need a dedicated tool to compile the configuration into raw data, instead of the simple merge that could be done in a couple of lines in most languages. However I would also argue that this is a feature. Having multiple tools reading the same human configuration format is a recipe for divergence down the line, even if your format is just a couple of merges. No matter how simple your configuration format is I recommend having a dedicated step to compile your configs into a raw data file for each instance, giving you a single interpretation that you can diff as part of code review and deploy without chance of bugs, making this a moot point.

Harder to Write

In simple cases writing code is harder than updating data files, however in the past every data-only configuration system that I have seen slowly grew in complexity until it was really just an ad-hoc, informally-specified, bug-ridden programming language. I really recommend jumping straight to a real programming language and consider this a very minor downside.

Chance of Non-determinism

Since you are running code to generate the configuration there is always a chance of non-determinism. This can be problematic as the production config can alternate between deploys or even between diff and deploy! This is a serious danger point. I recommend mitigating it in the following ways:

Not Machine Editable

Unlike data files source code is generally not machine editable in a generic way. However I find this is not a major issue because you can easily put data that needs to be machine editable it a separate data-file. In fact I think this split is ideal anyways because mixing machine-editable and human-edited content in the same file is fraught with annoyance. (You either limit yourself to a small set of tools that preserve comments and formatting or drop those things which makes it difficult for humans.)

Conclusion

I have found that overrides files combined with a simple merge is not a maintainable way to configure complex systems. I highly recommend using some sort of expression language that lets you express the value of each configuration option based on the intrinsic dimensions of your problem instead of trying to fit your overrides into the best places in an inheritance tree.