Guest post by Jonathan Rosenblatt
This post is not intended to be a comprehensive review, but more of a “getting started guide”. If I did not mention an important tool or package I apologize, and invite readers to contribute in the comments.
I have recently had the delight to participate in a “Brain Hackathon” organized as part of the OHBM2013 conference. Being supported by Amazon, the hackathon participants were provided with Amazon credit in order to promote the analysis using Amazon’s Web Services (AWS). We badly needed this computing power, as we had 14*109 p-values to compute in order to localize genetic associations in the brain leading to Figure 1.
|Figure 1- Brain volumes significantly associated to genotype.|
While imaging genetics is an interesting research topic, and the hackathon was a great idea by itself, it is the AWS I wish to present in this post. Starting with the conclusion:
Storing your data and analyzing it on the cloud, be it AWS, Azure, Rackspace or others, is a quantum leap in analysis capabilities. I fell in love with my new cloud powers and I strongly recommend all statisticians and data scientists get friendly with these services. I will also note that if statisticians do not embrace these new-found powers, we should not be surprised if data analysis becomes synonymous with Machine Learning and not with Statistics (if you have no idea what I am talking about, read this excellent post by Larry Wasserman).
As motivation for analysis in the cloud consider:
- The ability to do your analysis from any device, be it a PC, tablet or even smartphone.
- The ability to instantaneously augment your CPU and memory to any imaginable configuration just by clicking a menu. Then scaling down to save costs once you are done.
- The ability to instantaneously switch between operating systems and system configurations.
- The ability to launch hundreds of machines creating your own cluster, parallelizing your massive job, and then shutting it down once done.
Here is a quick FAQ before going into the setup stages.
Q: How does R fit in?
A: Very naturally. Especially if you have an RStudio Server laying around.
Run it on your cloud machine, access it from anywhere using your browser and enjoy all the power of RStudio. Also note that RStudio server is not the only solution: you can also access your cloud machine using Remote Desktop (or VNC and the likes), VIM-R over SSH, ESS over SSH, or any of your preferred remote access schemes.
Q: Do I need to be a computer genius to set up my cloud machine?
A: The initial setup of your environment will require some knowledge. If you are completely unfamiliar with system administration, you should ask for assistance. Setting up a single cloud machine running RStudio server is a beginner’s level exercise in system administration and plenty of documentation is available (start at Louis Aslett’s site).
Once your machine has been properly set with your RStudio Server, the upload of data, the analysis and the retrieval of results is a breeze.
Q: Do I need to be a computer genius to set up cloud cluster?
A: If you want a whole cluster of machines in order to parallelize your jobs, you will probably require a little more technical expertise than for a single machine– certainly for the initial setup. Then again, once the cluster is up and running, there are many tools and packages facilitating the parallelization of jobs for the end-user. An overview of these tools can be found in Schmidberger et al . Particularly note the parallel package which will probably have all you need and the HPC CRAN Task View for a comprehensive list of tools.
Q: Isn’t this expensive?
A: It depends…
Running a small number of low performance machines is pretty cheap. Actually, Amazon provides about 750 hours of computing on a subsets of services for new users. Have a look at the AWS Free Usage Tier. On the other hand, running 100 high performance machines might cost about 100 USD per hour(!). Luckily, the parallelization itself will typically not increase costs. If you have 100 hours of computing to do, you will not be paying more if you do it in one hours on a 100 machines compared to 100 hours on one machine. You can also reduce costs by tailoring the hardware to the problem at hand, and making use of Amazon’s Spot Instances, for large cost reductions on non critical computations.
Q: I am very happy running all my analysis on my laptop using Dropbox for sharing and backup. Why bother with “the cloud”?
A: An interesting anecdote is that Dropbox actually stores your data on Amazon’s S3 storage, so you are actually using AWS without realizing it.
But more to the point– Dropbox (and the likes) is actually a pretty good way to share and backup data if it can handle your needs. Then again, on modern-day problems, your laptop’s hard disk and memory cannot possibly suffice.
The day you wish you had another 100 laptops, it is time to “go cloud”.
Q: My institution already provides me with a cluster. Why do I need the Amazon cloud?
A: If you have all the computing power you need with easy remote access, it seems you already have your own cloud; no need to look for a new one (unless you are planning to change institution and want your research to be portable).
Some possible considerations might also be the ease of use, and the institutions queuing system versus your own dedicated cluster. This cost-efficiency analysis should be done for your own specifics.
You should note that some institutions and organizations have problems working with the Amazon billing model as your costs are not predetermined when you use the service.
Before we go through the setup process, we will need some terminology.
- AWS: “Amazon Web Services”. Includes all of Amazon’s cloud services. Most are actually not be needed for mere data analysis.
- EC2: “Elastic Cloud”. The framework for running your cloud machines. This is the main service for our purposes.
- EBS: “Elastic Block Storage”. A storage framework. If you want to attach storage to your EC2 cloud machines.
- S3: “Simple Storage Service”. A central storage framework. If you want all your machines to read and write to a single cloud storage.
- AMI: “Amazon Machine Image”: A snapshot of your machines’ configurations for backup and reuse on other machines.
- Open a AWS account.
- In the EC2 Console, create your first machine instance. This is my favorite guide.
- In your newly created instance, setup RStudio server. I like Randy Zwitch’s guide.
- Consider adding Dropbox account to your cloud machine to facilitate data transfers. Here is a guide.
You can now connect to your machine using RStudio or directly by SSH and start working.
There are some existing AMI with preinstalled R and RStudio server. Particularly note Louis Aslet’s AMI. I admit I did not use them, as I like to tailor my working environment.
The way you will be working depends on the way you connect to your remote machine. In the following, I will assume you are connecting via SSH or RStudio Server to a Linux machine.
If are using Windows, you might prefer using a Remote Desktop connection. You will then have a full GUI so that working on the remote machine will be no different from working on your local machine.
- Upload data:
Dropbox is easy. Also consider scp, wget and sshfs.
- Analyze data:
With RStudio Server, you have all of RStudio’s capabilities in your browser.
Otherwise, use VIM-R, or ESS to work over SSH.
- Recover results: Once the analysis is over you might have several types of output.
Follow the instructions in this guide.
Unlike the single machine, you will have to pay attention to the following:
- Make sure all your machines have the same open ports so that you can connect to them directly (explained in the guide).
- Make sure all the machines have user accounts and RSA keys on the other machines so that they can “talk” to each other (explained in the guide).
- To exchange data between machines you have several options:
- Boot instances with pre-loaded data using your AMI. A large data set can be stored in your attached EBS volume. For super quick access to small data sets, you could actually cache it on the ephemeral disc of each machine.
- Pull and push data from your S3 bucket using the S3 API.
- Pull and push data from one of your EBS volumes. Consider setting up an NFS server for this purpose.
In my own personal experience, I loaded the input data in my AMI, and I wrote the output to my S3 bucket. This had the advantage that input data was ready to analyze on startup, and output data was not needlessly duplicated in different machines. As noted earlier, for small data and quick access, use your EC2 machine’s memory and not EBS.
If you have more than a handful of users, consider setting up an LDAP server to provide and manage user accounts.
For sensitive data, consider hiding your cluster behind a Virtual Private Cloud, which will block all the machines from the internet.
Now that you have a cluster up and running comes the possibly-most-interesting-question in this post: How to parallelize my jobs?
As you will see, parallelization is still a delicate art more than a “plug-and-play” process as there here are many ways of doing things with many considerations to bear in mind:
- If you have a one-time massive computation to do, and are unfamiliar with system administration, try using the parallel R package. Particularly the makePSOCKcluster() to tell R about your many machines and have it manage the parallelization for your using parApply() type commands. As previously remarked, some packages can help you setup the cluster, so they will already know which machines are available, unlike the parallel package which will have to be informed explicitly. On the other hand, if something goes sour, having another layer on your computations might complicate debugging.
- If you have a one-time massive computation to do, and you are familiar with system administration, you should consider manually batch processing it: have your machines communicate using SSH and dumping their output to an S3 bucket or your NFS server.
This might take some time to set up, but you will have a clear idea of what is going on and how to change/fix things.
- If you have many jobs to perform, and wish for some smart queuing and fault tolerance, consider setting up a task management system. HTCondor, and SGE are popular solutions. Amazon can also provide you with such a manager called Simple Workflow Service. As all of these managers will require some installation steps, and their own scripting syntax, consider them mostly for recurring tasks, or if your cluster will be used by more than just yourself so that queuing is essential.
- If you have MASSIVE data files (starting with tens of GB and beyond), you want some fault tolerant storage, and you use lapply type functions (i.e., easy to parallelize, as detailed here), consider setting up a Hadoop cluster , with the HDFS file system and the rmr2 package. Note however, that the parallelization abstraction layer, adding fault tolerance and speed, comes with some added system-administration overhead. I would thus only recommend it for recurring tasks
Remark #1: If you are completely unfamiliar with all the mentioned solutions, I would suggest managing your tasks with one of the R packages, and getting help at the installation stage.
 Schmidberger, Markus, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, and Ulrich Mansmann. 2009. “State-of-the-art in Parallel Computing with R.” Journal of Statistical Software 47 (1).