Problems with Cloudflare Bot Blocking
Cloudflare is very fast to sell you automated “bot” blocking services. These are aimed at blocking malicious requests to your site. However these services have a major problem. In the interest of simplicity the services are far too blunt. I have seen a number of cases of this breaking websites for legitimate visitors.
- Blocking assets such as JS, CSS and images with CAPTCHA challenges. These resources obviously can not be solved as the browser will discard the result as it has the wrong content type. I have even seen this occur with requests using SRI, resulting as a hash mismatch.
- Blocking desirable automated traffic. For example their own blog feed https://blog.cloudflare.com/rss/ returns an HTTP 503 from many public Cloud addresses. For example I run FeedMail on DigitalOcean and whenever I get a new IP there is a good chance that Cloudflare will start blocking me from checking their blog. (FeedMail implements exponential backoff on failure so I am only checking once a week from that IP.)
What’s worse is that for almost all cases these requests are completely harmless. They are requesting public, cacheable assets. There is no reason to deny the request! The information is not sensitive so there is no need for scraping protection and the files are not particularly large so blocking the request is nearly as expensive as just serving it. In fact rending a CAPTCHA is likely more expensive than just serving the request from cache!
So if you do use Cloudflare (it is a very nice low-cost CDN) I recommend you do the following:
- Exempt as much as possible from the automatic blocking rules. This can be tricky as even cacheable requests can be abused by adding query parameters to avoid the cache. At the very least try to exclude robot-targeted resources such as RSS feeds.
- Use a lower protection level to reduce the block rate. Choosing this level can be tricky, Cloudflare provides very little guidance and the blocking analytics (at least on the free plan) are very primitive.
But at the end of the day there is only so much a customer can do. I would love to see the following improvements from Cloudflare:
- Provide an option to avoid blocking on a hot cache. If a website isn’t trying to prevent scraping there is no harm to serving from cache. To take this a step further they could avoid blocking popular URLs that are known to be cacheable. For example: even if the RSS feed isn’t currently in the cache it is a known “safe” URL so it doesn’t hurt to request it once and pull it into the cache again.
- Reduce likelihood of blocking resource requests. There is no option for a CAPTCHA to be presented here so it would be preferable to serve the CAPTCHA for the top-level HTML where the CAPTCHA or automatic browser can actually be completed. This sometimes isn’t possible if the domain in question is only used for assets (not allowed on the free plan anyways) but in that case you probably want to turn the bot protection “Essentially Off” anyways.
To take a step back it appears that Cloudflare is trying to solve a couple of main problems with one solution, resulting it doing a suboptimal job.
- DoS Attack Protection.
- Brute-Force Attack Protection.
- Spamming Prevention.
- Scraping Prevention.
However these are mostly distinct requirements. I think that most sites need 1, some need 2 and 3, while few need 4. It would be great if Cloudflare would ask you which problems you are concerned about. For FeedMail I am only worried about 1. I have my own, more precise, protection against 2, while 3 and 4 are non-issues for my site. If we consider the Cloudflare Blog example 1 is also the only concern. However at their scale it may make sense to just absorb extra traffic in which case they can get away with none of the above. In any case it seems that the RSS fetch was likely blocked as scraping prevention. Since that is not the desired goal turning it off would avoid the issue.
Overall I think that Cloudflare is a good service, but it is important to remember that any time you attempt to block unwanted request you will also be blocking some portion of desirable traffic as well. If Cloudflare themselves can’t ensure that their own blog’s RSS feed is accessible it is clear Cloudflare isn’t trivial to use correctly.