What is CloudFlare?

CloudFlare is a US-based company offering DDoS-protection/mitigation services for websites, as well as last-mile TLS encryption, content caching, and blocking of common attacks on webservices.

As it provides useful attack mitigation services even with their free plan, CloudFlare is increasingly popular and now used by a large array of users, ranging from small personal websites seldom visited to huge websites like reddit, or HackerNews.

What's wrong with CloudFlare?

Before going any further, I want to make clear that I am by no mean insinuating that CloudFlare is doing, has done or will do any of what I'm going describing. I am simply discussing their technical ability to do so. In fact, they have a somewhat reasonable privacy policy and provide regular transparency reports. They also appear to be pretty engaged in promoting Free Speech and, on a side note, they contribute quite a lot to Free Software. Furthermore, there are most certainly other companies with similar positions, but I had to pick one, and CloudFlare it is.

That being said, while all the services provided by CloudFlare are useful, they come at a cost: the centralization of the Web. Indeed, by design, all the web traffic to CloudFlare-using websites goes through CloudFlare. Worse: when the traffic is encrypted, CloudFlare is the TLS endpoint and can thus spy on or tamper with that traffic. In order to provide its services, CloudFlare has built an infrastructure that is perfectly fit for mass surveillance, and the increase of websites centralized in this manner is a worrying trend.

Spying more practical than with hosting providers

One could argue that CloudFlare is no worse than a hosting provider, and while it's certainly true in theory, CloudFlare's position offers them some practical advantages: indeed, while a typical hosting provider would have to take some effort to manually inspect some VPS's configuration in order to snoop while remaining unnoticed, CloudFlare always sees the traffic in clear-text. Snooping do not require any active action on their part, and altering the traffic is already part of their services (in order to provide information when the backend server is down for instance, or to set cookies and challenge certain users — like Tor users — with a CAPTCHA). A bug in the part responsible of altering the traffic has recently resulted in a significant security issue.

A wide, transversal view

While a "second-party" having this level of access to a single website is already bad enough, CloudFlare has this level of access to a host of different websites, and may aggregate information stemming from the traffic going through all of them. Those websites include, amongst others:

globalsign
Edward Snowden's very own freedom.press
A good number of the websites for the US presidency campaigns
Various pseudonymous/anonymous communities: 4chan, reddit, HackerNews, …
Various torrent directories: t411, BTDigg, The Pirate Bay, …
Lots of websites talking about Bitcoin and similar technologies: blockchain.info, www.titanbtc.com, bitcoin.it, etherapps.info, etherchain.org, ethereumpyramid.com, …
Various political groups' websites around the world
Some UK Police Forces' websites: Cleveland, Gwent, Leicester, …
News websites: The Register, …
Various software projects: cyanogenmod, discourse, jquery, moodle, …
Lots of pornographic websites
Various platforms such as is.gd, puu.sh, pastebin.com, stackexchange, mywot, …
Websites that recently hit the news: Ashley Madison, Hacking Team, …

I would like to stress that CloudFlare has access to all of the traffic, including passwords and other credentials, and can tamper with it, sending fake information or software packages, for instance.

IP ranges

CloudFlare's services use several ranges of IPv4 and IPv6 addresses, which they publish on their website. At the time I am writing this, however, some of the IP ranges they might be using are not listed (or those could be just people reverse-proxying CloudFlare's reverse-proxies):

185.122.0.0/22
2a06:98c0::/29

Statistics

While no direct security conclusion can be drawn from such statistics, I attempted to get the most exhaustive list possible of CloudFlare-using websites to study how CloudFlare was actually used, as well as trying to discover things like unannounced CloudFlare IP addresses or other technical properties.

The dataset analyzed here has been constructed by crawling and using various heuristics. A few datapoints might be incorrect due to connectivity issues or unusual domain configurations.

The dataset includes XXX domains, YYY of which using CloudFlare's nameservers, and ZZZ using CloudFlare to actually serve webpages, AAA of them also exposing non-CloudFlare servers. In addition, XXX domains seem to be served by non-CloudFlare servers reverse-proxying CloudFlare servers. Further statistics will exclude this last category, as they will only be computed on servers having a CloudFlare IP address.

CloudFlare usage by TLD

CloudFlare adoption by TLD could be an interesting statistic. It may also be useful to check how representative the dataset might be, as the crawling process I used might be biased towards some TLDs.

Server names

Most — if not all — CloudFlare servers identify themselves as “cloudflare-nginx” or “yunjiasu-nginx” (for their Baidu partnership).

HTTP Status code

HTTP status codes gives an idea of how many domains in the dataset are “real websites”: those returning 404 or 302 errors probably aren't. Furthermore, some status codes are of particular interest (525, 526) as they indicate that some domains have SSL enabled between the backend and CloudFlare. The high number of 302 responses can be easily explained by domains redirecting to their “www” subdomain.

HTTP Status 403 and 503 can be caused by CloudFlare issuing challenges to the user (CAPTCHA) or her browser (javascript, cookies, …)

X.509 Certificates used for HTTPS

While CloudFlare is always the HTTPS end-point, the user can provide her own certificate. In some rare cases, she might avoid giving away her private key by using what CloudFlare calls “Keyless SSL”.

Due to the crawling process, the dataset might be biased against custom certificates.

Domains with no SSL at all are likely to be static data served by cdn.cloudflare.net.