Recently I got curious about how well, overall, Drupal performs and how well it can be made to scale. I’ve been doing some experimenting with different setups and benchmarking and thought I’d offer, in one place a nice summary of the options available to scale drupal websites and my opinion of the various options. The first part of this covers the basic built in drupal performance features and is a good introduction for new users. After that we’ll explore some more advanced scaling techniques.
Please note, any benchmarks I mention aren’t all that useful. All my throughput testing was done by simply hammering a single page using apache bench (the page being the front page of a very complex production site). This is unlikely to be similar to any real world scenario, outside of perhaps having a single page linked to from a popular website which floods you with traffic. In addition my goal isn’t to try to get the highest performance, it’s just to compare the different options, so I didn’t really spend much time tweaking server settings, apache/php config, etc.
A prefix - Drupal's architecture sucks
Put down the pitchforks everyone. I love Drupal (that is a strong word, let’s say I love what it does). However, Drupal is designed first and foremost to be flexible, extensible and customizable. Some of the things it does, were you writing a one off application from scratch, would get you fired. I’m not bashing Drupal here, for what it is designed to do its architecture works remarkably well. But it’s just not built for performance. Fundamentally there are two big problems with drupal.
The number of queries issued to generate a single page are ridiculous. I’ve had complex drupal sites that issue over 200 queries to draw a single page. Just imagine that for a second. Worse yet, a number of those queries are duplicates. The problem stems from the modular nature of drupal, where various modules do their own thing without interacting with the other parts all that much. While drupal has a static node cache (so a node object loaded during a request cycle won’t be reloaded again), modules also query a lot of other things. These queries can quickly add up. Not to mention the number of queries that can be generated just to handle path aliases when you have multiple nodes displayed on a page…
The module system, while flexible, adds a significant overhead - How much is questionable, but loading all the module files into memory, calling hooks, passing node data around, it’s a fair bit of overhead. An opcode cache helps here, as does using fewer modules. However, a reasonably complex drupal site will be using quite a few modules…
The only solution? Caching.
Out of the box, Drupal is set up with caching turned completely off. This is fine and dandy for development. In this mode, every time a page is loaded by anyone, it is generated completely from scratch. If you’re lucky, you might hit the mysql query cache. Unless you’re repeatedly loading the same page again and again, or you have a seriously huge mysql query cache, this is quite unlikely. In this mode, you can likely expect Drupal to serve somewhere between 0.5 and 2 requests per second, with an average of maybe 1-2 seconds of processing to generate the page. While this is probably super for a small site, any bit of load can easily crash it. With absolutely no caching on, I was able to DOS an Amazon EC2 micro instance running nothing but a Drupal site with a mere 3 concurrent requests (side note, I’m not terribly impressed with the performance of the Amazon Micro instances). In addition, even if everything is responding ok, for users with a fast internet connection a 1-2 second load time (plus rendering time) is going to feel a bit pokey.
What about APC?
For any Drupal site, your #1 bottleneck will be the database. For a complex site issuing numerous queries per page, the database will be enough of a bottleneck that APC is largely irrelevant. However, once we enable various forms of caching, APC or other opcode caches become more important as the caching allows us to reduce the queries and thus we spend more time in PHP than in the database, and APC offers quite a nice boost. To prove this, I did some benchmarking on two Amazon EC2 Large instances. With a stock drupal installation and no APC, I was able to get a whopping 2.54 requests per second. This is actually quite impressive. On modest hardware you’ll be lucky to get 1 request per second. With APC enabled, I was able to get 2.24 requests per second. Yes it’s less, but it’s statistically insignificant. Without any sort of database caching, APC makes no difference. Now, when you start enable the page cache, you’ll see APC start to shine. With a normal page cache and no APC, I was able to get 105 requests per second. With APC enabled, it jumped up to 259 requests per second. Note that the server running had 15 gigs of ram and 6 CPU cores. On smaller hardware the benefit of APC will be less, but still significant.
Tuning for Anonymous Users - Surviving a slashdotting/redditing/digging/whatevering
First we’ll take a look at what we can do to improve performance for anonymous users. This is the primary concern of a site where the experience is largely read only and not personalized, such as a blog, corporate site, etc. This sort of tuning will make pages feel snappier to users (provided they have a reasonably fast internet connection) and will help to survive the most likely infrequent “I just got linked on reddit” syndrome. Maybe.
Turn on the cache already
Page caching - When page caching is enabled, drupal sticks the rendered html of the site in the database. Future requests will find this in there, and rather than running through all the drupal code and all the database queries, drupal will instead query the page cache (one query) and spit out the html. Page caching also supports an aggressive mode. For aggressive caching, drupal will skip a few hooks that are normally called even when serving up a cached page (namely, hook_boot). This will further lower CPU usage on your server and theoretically allow you to serve more pages. However, it’s not very likely that at this stage your CPU is a bottleneck, and the aggressive cache won’t play nice with a few modules (notably statistics, which keeps track of things like individual page views, etc - not that you should really use this on a site which is being thrashed).
Potential Gotchas with page caching
The Performance page says that normal caching has no side effects and should be turned on in any production site. I agree with half of this. It absolutely should be turned on for any production site. It absolutely is not without side effects. Luckily, the side effects aren’t super common, however here are a few that I’ve run into.
- You are using a poorly written module - I haven't actually seen this happen in a while but it's something to look out for. Sometimes a contributed module will do things in hook_boot or hook_init that it shouldn't. The last time I hit this was a few years ago when a wysiwyg module (I believe it was the FCKEditor standalone module but I'm not entirely sure) did some initialization in one of those hooks. I don't entirely remember the details, however, when I enabled page caching drupal started throwing out errors left and right. Like I said, in recent years I've had no problems with this but it's something to keep in mind especially if you are dealing with older or poorly maintained modules, or in house modules.
Beyond the basics
For 99% of drupal sites out there, you are done. You can stop reading. By simply turning on the page cache, on a fairly complex D6 site, I was able to support 155 requests per second on a 512MB Linode VPS with an average request time of 109ms. If you’re not quick on networking math, that, for the page size given, equates to 45 MB/s. Linode throttles outgoing data at 50 MB/s, so given network overhead, a stock D6 install with the page cache enabled is being bottlenecked by the network. Not the hardware. Not the app. The network pipe. Want more proof? If I turn on gzip headers in apache bench, I now get 413 requests per second. Keep in mind, this is on a $20 a month VPS with a whopping 512M of memory. The rest of this section will be directed towards those curious, or people who are planning on having to support more traffic than this (note: this is unlikely). Skim it if you like, or wait for Part 2 - tuning for authenticated users.
Ok, so you have your drupal site running and somehow need to serve more than a few hundred requests a second. Is it time to plan your horizontal or vertical scaling? No way! A single server drupal installation with a properly tuned setup should be able to serve up millions of anonymous page views a day. You probably don’t need to scale nearly as soon as you think. So how do we move forward? I’m going to present the options in the order I think they should be pursued and comment on the effectiveness of various techniques.
Aggressive page cache
The first option available to you is the Aggressive Page cache. This works almost identical to the normal page cache, however it skips two additional hooks, hook_boot and hook_exit. With the normal page cache, these hooks are called for all enabled modules, which means that you spend a decent bit of time calling hooks in php just to serve up a cached page. The aggressive page cache skips these hooks, thus lowering the overhead in getting out the cached page. So why not use it? The only real reason is that you may have modules installed which don’t work with it. The biggest one would be the core statistics module, which lets you log statistics for each page view. But, let’s be realistic here, you want a high performance site, you dont want to be logging statistics. Stick with google analytics, logfile analyzers, etc. Some other modules may not work with the aggressive cache either, a notable one being mobile tools (which uses hook_boot to redirect mobile browsers to an alternate theme). The settings page will tell you what, if any, modules won’t work with the aggressive page cache. If none are listed, you are good to go.
So how much benefit will this give you? I was able to jump from 260 requests per second to 485 requests per second by switching from the normal cache to the aggressive cache. This was between two EC2 Large instances, so network limitations were not in play and I had plenty of CPU and memory at hand. Part of this may be that my test site had several modules installed which used hook_boot. By turning on the aggressive cache I just skipped that code. I suspect the difference will be less if you don’t have any modules which use them.
Next up, we have memcache. Memcache is a fantastic product and is essential to reducing database load and scaling websites. However, memcache is optimal in saving queried data that would otherwise be complex to load. In the case of a drupal site with the page cache enabled, this is not the case. For such a site, the only real database load is fetching one row from an indexed page cache table. If your site is being hammered such as my test, by requesting only a single page, memcache is mostly useless because the cache result is going to be stored in the mysql query cache anyway, effectively being stored in memory just like memcache. I set up my test site with the Memcache API module and installed a local memcache instance with 2G of ram available (overkill, I know). I was able to serve up 216 requests per second, compared to 260 requests per second without memcache. Yes it’s lower. No I don’t know why. Could be a random thing, could be that memcache has a greater overhead. The upshot being, for a single server installation where you are concerned with a few key pages getting hammered, it has no impact.
Now, on the flip side, if your site is getting a large amount of traffic on numerous pages, memcache might offer some benefit. In that case, you might have too much cache data being returned for the mysql query cache to cope with, and memcache would offer better performance. However, in such a case given a machine with sufficient RAM, you could simply increase the mysql query cache significantly and get the same result.
In my opinion (just that) memcache is of limited use for a single machine site. When you’re talking about a cluster with shared data and multiple drupal backends, however, then memcache will shine. Up until then I would suggest skipping it. Stay tuned for next time when we’ll find out if memcache makes a difference for authenticated users though.
If you need to scale beyond standard page caching, boost is probably what you want. Boost is brilliant. And simple. What does it do? Boost stores cached pages on the filesystem. That’s it. The first time a page is visited it writes the html to a .html file in your site. You then configure your server to check for the html version of any request before calling on drupal (using mod_rewrite rules on apache or similar rules on nginx). With this setup, your server will completely skip drupal (and php) if a page is cached, instead just serving a plain old html file. Brilliant isn’t it? This greatly reduces CPU load (think thousands of hits a second with a load < 1 ), and let’s the web server do what it is meant to do (serve up plain old files, not run programs). On top of that, boost can also store cached versions of ajax requests, your css/js files, and will gzip the whole kaboodle.
How fast is it? Hold on to your hats folks. I got 5,549 requests per second. Think about that. That’s 4.79 MILLION hits a day. Now again, this is on an amazon Large High Memory instance. It has some horsepower. And a fat pipe. How will this do on smaller hardware? I was able to get about 1100 requests per second on an amazon micro instance. That equated to about 80 MBps. Whether that’s because of bandwidth throttling or not I don’t know (Micro instances have less bandwidth than larger instances). The only thing I’ve found suggested that micro instances have about 250 MB of throughput and large has 1GB. I have no idea how accurate that is.
What’s the downside? None. Well almost none. Boost has the same limitations as aggressive page caching. Beyond that, it will work anywhere, including a shared host (provided you have full htaccess support to handle the rewrite rules, if clean urls work on your host you are set). When combined with gzip your site will be so snappy to visitors that I’d recommend this setup even if you don’t need to scale. It’s just blazingly fast.
Use a Content Delivery Network (CDN)
A great option to boost performance and ability to handle load is to use a CDN such as Amazon CloudFront or CloudFlare (and numerous others. A content delivery network uses a series of geographically distributed servers that cache and feed data to visitors. It’s almost like having a Varnish reverse proxy, only you don’t have to maintain it and you have multiple boxes around the world. Using a CDN allows you to serve files faster to visitors (as they may be downloaded geographically closer to the user) as well as reduce the load on your server (as if the CDN server has a cached version, there is no need to hit up your server for the files). Different CDN’s may work in slightly different ways (some are optimized for handling files, others as almost full reverse proxy cache servers). For a small site or company, I’ve heard a lot of good things about CloudFront. In addition to caching complete pages, javscript/css and images on your page, they also provide a wealth of security filters which limit traffic to your site from spambots, email address harvesters, etc. The downside to their service (which has a completely free version) is that you have to hand over DNS for your site to them, and it’s a potential single point of failure in your system (and they really haven’t been around terribly long). But, if you’re looking for a cheap way to boost performance I’d give them a shot.
There is also some benefit to setting up a poor man’s CDN if you have the hardware resources to support it. By creating DNS entires for images.yourdomain.com, css.yourdomain.com, etc, you can get around the pesky limitation of the web browser limiting how many simultaneous files it will download from a single domain. For a resource heavy site this could improve performance for visitors, however probably not significantly and no way near as much as a real CDN.
Varnish is a reverse proxy web accelerator. That’s fancy talk for an in memory cache that passes through to a web application. For a typical setup, varnish will run on a front end server on port 80. It passes through requests to your drupal site, then caches the result in memory. For the next hit, if it has it in cache, it will just serve it up. Varnish of course can be installed on the same server as your drupal site, however doing so would really need a machine with a good chunk of RAM.
Varnish has several advantages over Boost. First, it stores data in memory. This is, obviously, faster than reading an html file from the disk. Well, sometimes. In the case of a single page (or a few) getting hammered, your modern OS will not read that html file from the disk each time. It’s going to cache it as well, and just read it from memory. However, this is less likely if you have thousands of anonymous and authenticated users on your site at once. In that case, Varnish will win. On the flip side though, Varnish is obviously more complex to set up than Boost. It’s really not that bad, but if you’re not hip to unix configuration files, it can seem a bit daunting. The second problem is, varnish loves ram. And if it runs out, my tests indicate it’s not all that happy.
There is one other major problem with Varnish. It does not work with drupal 6. Varnish, sensibly, does not cache or serve up any requests that have cookies set. This makes sense, as you don’t want your logged in users being served a cached page (without their account links, etc). Drupal 6, out of the box, always sets a cookie, even for anonymous users. To get around this, you either need to run Drupal 7, or Pressflow. Pressflow is an alternate distribution of Drupal 6 which has several key performance patches applied (key among them, not setting cookies for anonymous users). On the plus side, pressflow just works. It’s drupal, but optimized. All contrib modules work, your existing db works. To convert an existing drupal site to pressflow, just copy over your sites folder and you are done, there are no database changes required. The downside is, some contrib modules will set cookies or start a session which will prevent varnish from hitting the cache. Finding these can be a little tricky (tip: var_dump $_SESSION at the top of page.tpl.php and use that info to try and find it).
Now, if we’re looking for a single server setup, in my opinion Varnish isn’t much of a consideration. Boost provides great performance with much simpler setup. And varnish requires a lot more RAM. And really, if my single server has that much RAM, I think I’d rather just give it to the mysql query cache. Varnish really shines when your have a cluster set up and one or more varnish instances can act as a front end for several drupal servers. In such a setup, it would be a required part of your server strategy.
Note: still compiling data on varnish performance vs boost
Ok, now what
If you are at the point where you need to scale beyond anything here, you’re ready for a nice cluster. A load balancer, plus several varnish instances sitting in front of several drupal and database servers. At this point, you either know enough about high performance scaling that you don’t need this article, or you need to hire someone who does. To give a brief overview of what a seriously high performance drupal setup would look like:
- A varnish instance running on a high memory box would intercept all http requests (or, a load balancer feeding requests to multiple varnish boxes).
- On a cache miss, varnish would proxy requests to a load balancer
- The load balancer would route requests to 2 or more drupal installations running on high CPU machines
- Drupal installations would connect to a memcache server for all cache data (sessions, block cache, etc)
- An NFS server (or higher performance alternative) to share modules/themes/files between drupal instances)
- A CDN for static files would be a fine option, or perhaps an NGINX instance for serving those up
- One or more mysql database machine's, with high CPU and memory and a tuned query cache
This is just a generic idea. The specific setup would depend on the hardware and site needs, as well as budget obviously. In addition you’d probably want to investigate sharding solutions for MySQL or using something like MongoDB for D7 (not that I would suggest this as currently mongodb does not support journaling or durability in the stable release).
Next time, authenticated users
Everything on this page is really only applicable to anonymous users. With authenticated users, the whole ballgame changes. Now you can no longer really cache entire pages as each page is now user specific (in terms of block, permissions, perhaps content). Next time we’ll take a look at scaling for authenticated users, taking into account such exciting options such as:
- Block caching
- Reducing the number of modules your site uses
- Views/CCK vs a custom module
- Ajaxify Regions
- Logging to syslog
- MySQL caching: database in ram
Some notes: Detailed tabular results of performance coming soon. Test machines I used include a 512MB Linode VPS (4 virtual CPU cores), an Amazon EC2 Micro instance (613MB, 1 virtual CPU core which can boost up to 2, low I/O performance) and an Amazon EC2 Large High Memory instance (17.1GB, 2 potent virtual cores, moderate I/O). As an aside, performance on the EC2 micro sucks, the 512 Linode easily blows it away. Performance on the Large High Memory is obviously quite stellar. All EC2 instances were running Amazon 64 bit linux (optimized CentOS). Linode was running Ubuntu 10.10 server. Linode also by default throttles outgoing bandwidth to 50 MB (including between machines in the same data center on a private network) but will increase it upon request. I was able to get 750 Mbits/sec out of a large EC2 instance. The server was apache with php as a module, using stock settings (I believe on the amazon linux API, amazon has it tuned for their hardware though). I did attempt to test boost with Nginx as well but didn’t have much luck. Since there are no packages for amazon linux, and I wasn’t feeling patient enough to compile from source, I set up an ubuntu instance and installed nginx with php-fpm. With this setup I wasn’t able to get anywhere near what I did with apache/boost (at most about 2,000 requests/second). What does that mean? I was essentially going with the stock nginx settings which I’m sure are intended for an average machine. On the flip side, the OS as well as apache is likely highly tuned by amazon for their top shelf hardware. I’ve heard Nginx can blow away apache on serving static files, however from this experiment it would appear that a tuned apache can do quite well on it’s own.