CloudFlare is a US-based company offering DDoS-protection/mitigation services for websites, as well as last-mile TLS encryption, content caching, and blocking of common attacks on webservices.
As it provides useful attack mitigation services even with their free plan, CloudFlare is increasingly popular and now used by a large array of users, ranging from small personal websites seldom visited to huge websites like reddit, or HackerNews.
Before going any further, I want to make clear that I am by no mean insinuating that CloudFlare is doing, has done or will do any of what I'm going describing. I am simply discussing their technical ability to do so. In fact, they have a somewhat reasonable privacy policy and provide regular transparency reports. They also appear to be pretty engaged in promoting Free Speech and, on a side note, they contribute quite a lot to Free Software. Furthermore, there are most certainly other companies with similar positions, but I had to pick one, and CloudFlare it is.
That being said, while all the services provided by CloudFlare are useful, they come at a cost: the centralization of the Web. Indeed, by design, all the web traffic to CloudFlare-using websites goes through CloudFlare. Worse: when the traffic is encrypted, CloudFlare is the TLS endpoint and can thus spy on or tamper with that traffic. In order to provide its services, CloudFlare has built an infrastructure that is perfectly fit for mass surveillance, and the increase of websites centralized in this manner is a worrying trend.
While a "second-party" having this level of access to a single website is already bad enough, CloudFlare has this level of access to a host of different websites, and may aggregate information stemming from the traffic going through all of them. Those websites include, amongst others:
While no direct security conclusion can be drawn from such statistics, I attempted to get the most exhaustive list possible of CloudFlare-using websites to study how CloudFlare was actually used, as well as trying to discover things like unannounced CloudFlare IP addresses or other technical properties.
The dataset analyzed here has been constructed by crawling and using various heuristics. A few datapoints might be incorrect due to connectivity issues or unusual domain configurations.
The dataset includes XXX domains, YYY of which using CloudFlare's nameservers, and ZZZ using CloudFlare to actually serve webpages, AAA of them also exposing non-CloudFlare servers. In addition, XXX domains seem to be served by non-CloudFlare servers reverse-proxying CloudFlare servers. Further statistics will exclude this last category, as they will only be computed on servers having a CloudFlare IP address.
CloudFlare adoption by TLD could be an interesting statistic. It may also be useful to check how representative the dataset might be, as the crawling process I used might be biased towards some TLDs.
Most — if not all — CloudFlare servers identify themselves as “cloudflare-nginx” or “yunjiasu-nginx” (for their Baidu partnership).
HTTP status codes gives an idea of how many domains in the dataset are “real websites”: those returning 404 or 302 errors probably aren't. Furthermore, some status codes are of particular interest (525, 526) as they indicate that some domains have SSL enabled between the backend and CloudFlare. The high number of 302 responses can be easily explained by domains redirecting to their “www” subdomain.
HTTP Status 403 and 503 can be caused by CloudFlare issuing challenges to the user (CAPTCHA) or her browser (javascript, cookies, …)
While CloudFlare is always the HTTPS end-point, the user can provide her own certificate. In some rare cases, she might avoid giving away her private key by using what CloudFlare calls “Keyless SSL”.
Due to the crawling process, the dataset might be biased against custom certificates.
Domains with no SSL at all are likely to be static data served by cdn.cloudflare.net.