Scalable Background Job Processing With Ruby on Rails and Skynet

On a recent project, I needed to be able to run a large number of background jobs. Each job would likely last between 30 seconds and 5 minutes, depending on the amount of work involved. We needed something that could do this heavy lifting, so we spent some time surveying the Ruby/Rails job processing landscape.

Many Ruby/Rails Job Processing Libraries

There are quite a few Ruby and Rails job processing libraries – in fact, too many to list. Most of the libraries can be divided into the following groups:

  1. Spawners – spawn a new process and wait or fork and forget
  2. Single Process Workers – launch a single process, watch a queue or database table, and run each job in turn
  3. Multiple Process Workers – launch as many processes as you want and let each one run a job. Coordination between the processes is done by a) a managing process, b) a message broker, or c) database tables

Since the workload was going to be heavy and the hardware we had was capable of running multiple processes, I opted for the multiple process worker type. The next step was to figure out which library to use.

Why Skynet for Ruby on Rails Job Processing?

Enter Skynet, a Ruby-based library written by Adam Pisoni. Adam uses the Map Reduce framework, patented by Google and powering many of Google’s services, including search. Here is Adam’s summary of Map Reduce:

At its simplest level, a MapReduce job defines a data set, a map method and a reduce method. It may also define a partition method. The MapReduce server evenly splits up (partitions) the data given to it and sends those chunks of data, along with a copy of the code in the map method, to workers that execute the map method against the data it was given. The output from each worker is sent back to the MapReduce server. At  this point the MapReduce server evenly partitions the RESULT data returned from the workers and sends those chunks of data along with the reduce code to the workers to be executed. The reducers return the final result which is returned to the process that requested the job be done in the first place. Not all jobs need a reduce step, some may just have a map step.

You can also read the full paper from Google at http://labs.google.com/papers/mapreduce.html.

So, How Does It Work?

Skynet works by launching one or more processes that are either a) masters (coordinators), b) workers, or c) both. Any master pulls a job off the queue (stored in the database) and dispatches the work or runs it locally (if it can perform both tasks). The worker performs the mapping step first and then the reduce step, if provided. When the job has completed, the results are stored back into the database for later retrieval, or returned back to the caller if one is waiting.

The Project Specifics

For this project, we have long-running jobs that can be broken into smaller units of work and then reassembled. The nice thing about Skynet is that it allows us to break up this work without having to track the results and assemble them ourselves. This allowed us to define a map step of processing the smaller work unit, then a reduce step of reassembling the pieces back together. We could launch as few or as many workers as we needed, to balance memory consumption and time-to-completion.

The results were quite impressive: a 5 minute job would be broken into approx 10 smaller units of work. The total running time to create the job, let Skynet map and reduce it, and receive the results was approximately 45 seconds on a dual quad core server with plenty of memory. We further optimized this by using memcached to cache calculations and prevent redundant work, allowing us to reduce the jobs further to as low as 15 seconds.

Tips for Using Skynet with Ruby on Rails

Using Skynet can have a large learning curve. We’ve had over 10 years of distributed computing experience and it still took a little time to familiarize ourselves with the gem. Here are some tips and suggestions:

  1. Skynet requires a restart after a code change. So, write your job code in a separate class and test it. This may require you to design for testability, but it will allow you to ensure your code is stable before testing it live within Skynet
  2. By default, Skynet will start 2 dual master/workers and some worker-only processes. If your masters get overwhelmed, it can cause your map/reduce jobs to stall as there is no master to coordinate the effort. Launch master-only jobs to prevent this
  3. Skynet will dispatch work as it sees fit. If you want to ensure that your jobs are processed in a certain order, you may need to throttle the jobs into Skynet by fronting it with a message broker and queue processor. This is often overkill for most projects, but can help ensure that a job that has been split into smaller jobs is fully completed before the next one starts
  4. The last version we used required that we refresh all ActiveRecord connections at the start of our job. Otherwise, AR would receive a timeout from our database connection and fail to process
  5. Use good logging. Skynet provides a logging mechanism, so use it. You may need to include specific identifiers to provide context, as the log will fill quickly with the more processes you have running at once

You can view the full Skynet documentation at its Rubyforge home, or grab the latest source code from Github.

Finally, there seems to be a fork of the Skynet project by Brendan Baldwin. He looks to be making some really nice improvements, including Nagios monitoring support and dispatching jobs to Skynet from any object with a single call. We hope to evaluate his project fork soon.

Tags: , , , ,