How'd He Do That? Running Open Atrium At 200+ Req/Second

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
joshk's picture

Looks like I generated a little buzz when I mentioned in my SxSw talk that I'd recently done a proof of concept scaling Open Atrium to 200+ requests/second using EC2. Since a number of people have asked about it, here's the skinny:

Disclaimer

First of all, I need to point out that this was a proof of concept build designed to let me know if the use-case I was pursuing (1000s of logged in users in a "burst" scenario) was even feasible or not. I cannot guarantee these results for your business case; your mileage may vary, etc.

The Basic Setup

The basic architecture I used in this case was six c1.med ec2 instances running the Mercury web stack on the front-end, and a Large RDS instance running the database.

I balanced traffic across the c1.med servers using Amazon's elastic load balancing, and I used a database dump from an existing but relatively small Open Atrium development site we are working on.

My basic test case was to use a logged-in session cookie and test throughput on the Dashboard homepage using ab, and I stepped this up from two c1.med instances in front of a Small RDS, up to six on a Large, at which point I consistently hit over 200 req/sec concurrency and was satisfied that Open Atrium could scale.

To be clear, here are the things that I didn't do but will in subsequent tests:

  • No NFS filesystem mount (though this shouldn't really impact performance)
  • No shared memcache cloud
  • No testing of how content-entry affects performance

However, I feel these test results are quite encouraging. While I do need to go further and test content-entry (and other cache-busting events) on a much more loaded-up database, I also expect to see gains from a shared memcache, and am not worried about page or block cache being invalidated by node submits, since we are dealing exclusively with logged-in users anyway.

Other notes

One big takeaway from my work was that for high-performance use cases, the c1.medium EC2 instance is the ticket to ride for Drupal webservers with complex module stacks. The application will become CPU-bound well before it runs into the (relatively) low memory allotment, and this is only more true on the 64-bit instances. While those may be good as MySQL or memcache role-players, for actually generating page results, you get a lot more mileage out of the high-CPU instance types.

I should note here that Mercury as a stack isn't providing the magic. Varnish doesn't help us very much when we have 100% logged-in users. I used our AMI as a quick way of getting a best-practice hosting environment, but this is definitely an example of "scaling through superior hardware."

To end on a high note, it was a pretty painless experience to use the RDS mysql-in-a-black-box instance, and our ability to scale those instances up or down offers some interesting options. I noticed between 10 and 15 minutes of downtime what up/downsizing the RDS, so it's not a bulletproof answer for highly available sites, but that's not a ton of downtime to have in a planned maintenance window, or to respond to a major event where people need to use a site, and can wait for 15 minutes while it revs itself up.

Comments

Cool post

rjbrown99's picture

Cool post. I have been staying away from RDS instances figuring mysql can be better tuned by hand than delegating control to the Amazon folks. What do you think of it so far? Similar performance on RDS small vs regular small? Or RDS large vs regular large?

Unsure yet

joshk's picture

I haven't done a head to head comparison. My assumption is that you can probably hand-tune slightly better, if you invest a lot of time and know your application. However, I presume RDS has a lot of common-sense optimizations, and the ability to "upsize" your database instance within a planned maintenance window is pretty attractive.

Atrium

nnewton's picture

Great post Josh. I recently launched a larger Atrium site (30k users, 10k active) and did extensive load testing before launch. I believe you and I were testing different aspects of Atrium though. My load test was a jmeter run with 200 threads logging in to the site, visiting 10-15-20 pages and logging out, with some content creation threads. Infrastructure was 4 quad core xeon machines running Nginx for static file serving, Apache, PHP 5.2, APC, MySQL-5.1-Percona and Memcache. My concern was less concurrent users at low datasets, but how Atrium scaled as the dataset was increased. We found some issues, mainly related to the activity stream view on the dashboard which breaks at 12k nodes, around 100k comments. We fixed this for launch by re-writing the view to be a bit more simple, but lost many of the activity stream features we liked. We are working on getting these back, perhaps using tracker2.

We also discovered that notifications_team breaks with larger user numbers. A common use case for us was having a single global group with every user in it. When this got to 15k, notifications_team was trying to load up the entire users table for render the team check boxes for admin roles. We ended up just disabling this.

These issues fixed, we got Atrium to scale up to 40k nodes and 300k comments.

-N

Good work

joshk's picture

You obviously went quite a bit further than I did with my initial testing, which was focused on the proving that we can do the front-side "elasticity" thing with Amazon. I'll be looking at big datasets on my end and in the coming months, but this is excellent intelligence.

Hopefully you are feeding this back to DevelopmentSeed? I know they're hungry for real-world results from using their tools in stressful situations.

Oh also

joshk's picture

Any interest in sharing a sanitized JMX file? I'll promise to update it and/or do the same on my end. ;)

How did you integrate Open Atrium into Mercury?

luismiguel2001us's picture

So, how did you get the Open Atrium install to work on top of Mercury? Did you just copy modules, or perform a file diff? Could you elaborate on the process since I would like to do the same thing. I have a mercury installation already up and running and I would like to get Open Atrium to operate within that environment. Thanks for your help.

Cheers,
Luis