LayerVault Simple version control for designers. Try it free.

No One Said Scaling Was Easy

Over the past week and a half, LayerVault hasn’t gone down, but it’s been noticeably slow for a few users. We have tools in place that allow us to monitor the individual experience of each customer to the service. If anything looks amiss, we recreate the problems and tighten the screws. We’ve been experiencing some great growth over the past few weeks, and with growth comes growing pains.

For now, it seems like we can see the edge of this forest. I wanted to write a blog post to talk about scaling past that initial machine and scaling Rails. As always: if you’re having issues with the service, please send us an email right away at support@layervault.com. Chances are, we’re already working on a solution.

Here we go.

Scaling Your Web Service

At LayerVault, we built our service using the Ruby on Rails framework, version 3.1. Our database is MySQL. Our hosting provider is Slicehost. We use Amazon S3 and CloudFront for serving out static files.

Rails is one of the best—if not the best—framework in existence; it is designed to take you from your initial idea to a sustainable business. In the early days of LayerVault (just a few months ago), the entire service was powered by one box. That box ran Apache, our Rails instances and the database. Scary, right? But that served our initial needs of having a service online quickly without worrying about maintaining servers we didn’t need at prices we couldn’t afford.

As we got more customers, our needs changed. The megabytes of daily data turned into gigabytes. We had to make sure that a heavy request would maybe slow down a box for a short while but not slow down the whole service. To do this, we slowly started peeling apart the logical functions of our web application. Now our service looks like so:

Whew, quite a bit more complicated. That’s what scaling looks like. Let’s break down some of the more important things we did.

Identify the bottlenecks

Scaling for scaling’s sake is a bad idea. You’ll probably introduce complexity that you don’t need. Make sure that you have tools that let you isolate issues. We use New Relic, the oink gem, pingdom, internal monitoring tools and unix’s top. Before you begin, watch all of the New Relic Scaling Rails podcast, even if you don’t use Rails.

Above all, fix one thing at a time and measure the results.

Set up a reverse proxy

A “proxy” in web parlance is an intermediate server. Proxy servers often accomplish different things: they can help you visit otherwise blocked sites or they can act as a traffic-cop. A reverse proxy is usually the traffic cop for a mid-sized web service. It tells the individual different web requests to go to different application servers. It can also perform caching of common requests, but I won’t get into that. Reverse proxy applications have many other names and uses as well; they can be sometimes referred to as HTTP accelerators or load balances. (This is grossly simplified for the sake of this post.)

Technically, LayerVault uses two reverse proxies: Varnish and Pound. (We used a heavily modified version of this setup.) When a client makes an HTTPS request, Pound takes the request and translates it to a vanilla HTTP request inside our firewall and then feeds it to Varnish. Varnish takes all HTTP requests and appropriately doles them out to our application servers. We have an entire 512MB slice dedicated to being our traffic cop.

As many app servers as we need

Thanks to our reverse proxy, we now just add in application servers as we need them. For LayerVault at this point, the bottlenecks come in the form of not enough CPU time and/or memory. We can horizontally scale our app servers as necessary. All it takes it turning on a new app server and adding it to the server rotation in Varnish. Nifty.

We use 2GB slices for our application servers.

Set up memcached

memcached was a service developed by the guys behind live-journal as an in-memory caching system. It’s so fast. Like 1000x faster. LayerVault customers will notice that the heaviest pages now load in less than 400ms with a warm cache, i.e. they have visited the page recently. This makes navigating your account that much more pleasant.

We use the Dalli cache gem and similar caching strategies described in @ddh's 37signals post: "How Key-based Cache Expiration Works".

Caching is difficult, mostly due to the question of “When do you expire this?”. There is no silver bullet. In the future, we’ll be tightening the screws on our caching strategies to shave even more milliseconds off each request.

Rails 3.1 static asset compilation, CloudFront CDN

Each HTTP request has a certain amount of overhead. In general, a page should minimize the total number of requests it makes. LayerVault is not a simple site: we have plenty of CSS and JavaScript. We write our JavaScript in such a way that is highly modular that allows us to make great encapsulations of behaviors. Our style and patterns are prettier than your average CoffeeScript (no, seriously). For example, we may have an entire file that controls the behavior of a button on a single page.

We first built LayerVault on Rails 3.0. We recently upgraded to Rails 3.1 to take advantage of its static asset pipeline. The static asset pipeline allows us to work in separate files until we deploy the code. Upon deploy, all of our JavaScript is smartly concatenated and served off our application server through Amazon CloudFront. Amazon CloudFront is a content delivery network (CDN). It allowed us to remove a slow step in our deployment of uploading all of our static assets to Amazon S3.

So, instead of serving up dozens of CSS and JavaScript files, we server up two. Better yet, these files come from a CDN node near you. Boss sauce. We also take advantage of the Rails asset_data_uri method. It’s amazing. We use it to inline small images in our CSS to cut down on the number of requests. All of this in mind brought our front-page load times from 2 seconds (yuck) down to a few hundred milliseconds in most cases (getting there).

Things that made our lives much easier

Drop the protocol declaration in as many externally-linked assets as you can.

You can visit any page on LayerVault using HTTPS. This is not any easy thing to accomplish, especially when serving up assets from several different domains and worrying about things like page caching. An easy trick is to simply drop the protocol declaration from all asset requests. Thus:

<script src="http://s3.amazonaws.com/bucket/my.css"></script>

Becomes:

<script src="//s3.amazonaws.com/bucket/my.css"></script>

This makes statically caching this page much easier. This will cause older versions of IE to make 2 requests for the asset. Seeing as how 90% of the people visiting LayerVault use a WebKit browser, we could care less about IE.

Offload tasks to worker processes and test them

No duh. But the problem with having worker processes is they often become something extra to test and maintain. We use the delayed_jobs gem to background anything that takes long than a few milliseconds. We also have a huge suite of Cucumber tests. This line in our environments/test.rb allows us to essentially follow all actions to their very end:

Delayed::Worker.delay_jobs = false

Now we know that the work the workers perform will pass tests too.

Switch from Rails-provided lookup helpers to SQL

Rails and ActiveRecord is all flowers and butterflies for awhile.

user.projects.files.select{ |f| f.file_name = "Cool.psd" }

Ah, so nice. Or:

user.select{ |u| u.plan.price > 20 && /[Kk]elly/ =~ u.name }

(Not the best examples, I know.) But while ActiveRecord is great for getting things up-and-running quickly, it’s very greedy. While it’s my fault for using select like I have here, each line loads many more records than needed from the DB. Each record loaded gets instantiated into an ActiveRecord object then tinkered with. ActiveRecord is sufficiently complex and doesn’t go easy on the memory. Too many instantiations of ActiveRecord objects and your request eats up all the memory, your box starts to swap, and your users notice a significant degradation in performance.

So now, we’re slowing moving to more SQL-based statements in our application:

VoreFile.all(:conditions => {
  :user_id    => user.id,
  :project_id => user.projects.map{ |p| p.id },
  :file_name  => "Cool.psd"
}) 

When we are explicit like this, we only load precisely what is needed. In general, the database layer is much better at selecting record with certain sets of criteria than the application. We could even get better about only loading the project IDs in the example above.

Fine-tune Apache and Passenger

If you’re using Apache and Phusion Passenger, fine-tune the settings. Specifically, we found the following apache directives to be helpful:

PassengerMinInstances
PassengerMaxRequests
PassengerMaxPoolSize
PassengerPoolIdleTime

Because we handle a bunch of heavy requests, we recycle Rails processes more frequently than most. Our PassengerMaxRequests is kept pretty low. We maintain a tight window on each application server of the number of Rails instances running. This keeps us in a sweet spot for memory.

Onward and Upward

We’ve had the phrase “good problems to have” bashed into our skulls here at LayerVault. Getting customers is great. Getting lots of customers is even better. Getting lots of customers to your data-intensive application is a Good Problem to Have.

As we move forward, we’re measuring requests coming across the wire and seeing which users have difficulties. We religiously write tests against any bugs and corner cases that are found.

I’ll write another post in the future to check in and see what kind of issues we’re dealing with then.

Discussion on Hacker News.

Feel free to tweet us any specific questions.

—Kelly

  1. randallb reblogged this from layervault
  2. kellysutton reblogged this from layervault and added:
    A little big blog post I wrote about scaling LayerVault. Check it out. Discuss it on Hacker News.
  3. layervault posted this