Tag Archives: cloud computing

brain_image01

Analyzing Your Data on the AWS Cloud (with R)

Guest post by Jonathan Rosenblatt

Disclaimer:
This post is not intended to be a comprehensive review, but more of a “getting started guide”. If I did not mention an important tool or package I apologize, and invite readers to contribute in the comments.

Introduction

I have recently had the delight to participate in a “Brain Hackathon” organized as part of the OHBM2013 conference. Being supported by Amazon, the hackathon participants were provided with Amazon credit in order to promote the analysis using Amazon’s Web Services (AWS). We badly needed this computing power, as we had 14*109 p-values to compute in order to localize genetic associations in the brain leading to Figure 1.

Figure 1- Brain volumes significantly associated to genotype.
brain_image01

While imaging genetics is an interesting research topic, and the hackathon was a great idea by itself, it is the AWS I wish to present in this post. Starting with the conclusion: 

Storing your data and analyzing it on the cloud, be it AWSAzureRackspace or others, is a quantum leap in analysis capabilities. I fell in love with my new cloud powers and I strongly recommend all statisticians and data scientists get friendly with these services. I will also note that if statisticians do not embrace these new-found powers, we should not be surprised if data analysis becomes synonymous with Machine Learning and not with Statistics (if you have no idea what I am talking about, read this excellent post by Larry Wasserman).

As motivation for analysis in the cloud consider:

  1. The ability to do your analysis from any device, be it a PC, tablet or even smartphone.
  2. The ability to instantaneously augment your CPU and memory to any imaginable configuration just by clicking a menu. Then scaling down to save costs once you are done.
  3. The ability to instantaneously switch between operating systems and system configurations.
  4. The ability to launch hundreds of machines creating your own cluster, parallelizing your massive job, and then shutting it down once done.

Here is a quick FAQ before going into the setup stages.

FAQ

Q: How does R fit in?

Continue reading

Richard Stallman talk+Q&A at the useR! 2010 conference (audio files attached)

The audio files of the full talk by Richard Stallman are attached to the end of this post.

—————–

Videos of all the invited talks of the useR! 2010 conference can be viewed on the R User Group blog

—————–

Last week I had the honor of attending the talk given by Richard Stallman, the last keynote speaker on the useR 2010 conference.  In this post I will give a brief context for the talk, and then give the audio files of the talk, with some description of what was said in the talk.

Context for the talk

Richard Stallman can be viewed as (one of) the fathers of free software (free as in speech, not as in beer).

He is the man who led the GNU project for the creation of a free (as in speech, not as in beer) operation systems on the basis of which GNU-Linux, with its numerous distributions, was created.
Richard also developed a number of pieces of widely used software, including the original Emacs,[4] the GNU Compiler Collection,[5], the GNU Debugger[6], and many tools in the GNU Coreutils

Richard also initiated the free software movement and in October 1985 he also founded it’s formal foundation and co-founded the League for Programming Freedom in 1989.

Stallman pioneered the concept of “copyleft” and he is the main author of several copyleft licenses including the GNU General Public License, the most widely used free software license.

You can read about him in the wiki article titles “Richard Stallman

The useR 2010 conference is an annual 4 days conference of the community of people using R.  R is a free open source software for data analysis and statistical computing (Here is a bit more about what is R).

The conference this year was truly a wonderful experience for me.  I  had the pleasure of giving two talks (about which I will blog later this month), listened to numerous talks on the use of R, and had a chance to meet many (many) kind and interesting people.

Richard Stallmans talk

The talk took place on July 23rd 2010 at NIST U.S.  and was the concluding talk for the useR2010 conference.  The talk consisted of a two hour lecture followed by a half-hour question and answer session.

On a personal note, I was very impressed by Richards talk.  Richard is not a shy computer geek, but rather a serious leader and thinker trying to stir people to action.  His speech was a sermon on free software, the history of GNU-Linux, the various versions of GPL, and his own history involving them.

I believe this talk would be of interest to anyone who cares about social solidarity, free software, programming and the hope of a better world for all of us.

I am eager for your thoughts in the comments (but please keep a kind tone).

Here is Richard Stallmans  (2 hours) talk:

Continue reading

Syncing files across computers using DropBox

Motivation

In the past few months I have been using DropBox for syncing my work files between my home and work computer. It has saved me from numerous mistakes and from sending the files to myself via e-mail.

Recently I found this service highly useful for sharing files with 4 other people with whom I am working on a data analysis project. Being so happy with it (and also by gaining more storage space by inviting friends to use it), I thought of sharing my experience here with other R users that might benefit from this cool (free) service.

What is Dropbox?

Dropbox is a Software/Web2.0 file hosting service which enable users to synchronize files and folders between computers across the internet.
This is done by installing a software and then picking a “shared folder” on your computer. From that moment on, that folder will be synced with any computer you choose to install the software on (for example, your home/work computer, your laptop – and so on)

DropBox also enables users to share some of their folders with other DropBox users. This seamless integration of the service with your OS file system (Windows, Mac or Linux) is what’s making this service so comfortable, by allowing me to work with co-workers and have the same “project tree” of folders, all of which are always synced.

You could also share a file “online”, by getting a link to it which you could share with others. So for example, you could write an R code, share it online, and call to it later with source(). This is the easiest way I know of how to do this.

Dropbox is a “cloud computing” Web2.0 file hosting service offering both free and paid services. The free version (which I use) offers 2GB of “shared storage” (unless you invite other users, in which case you get some extended storage space. Which is one of my motivations in writing this post).

Dropbox has other non-trivial uses allowing one to:

The service’s major competitors are Box.net, Sugarsync and Mozy, non of which I have had the chance of trying.

How to start?

Simply go to: DropBox.com
Sign up, install the software, use the new shared folder, and let me know if it helped you :)

How to get Extra space?

You can:

  • Earn another 750MB of space by connecting your dropbox to your twitter/facebook account and sending a status update about them. To get this bonus, head over to “Get extra space free!” page.
  • Refer a friend to open a dropbox account (every friend joining earns you another 250MB of space). This bonus is bounded by a total of 8GB of added space (after that, you won’t be allowed any more extra space)
  • Upgrade – pay 10$ a month and get extra 50GB

“The next big thing”, R, and Statistics in the cloud

A friend just e-mailed me about a blog post by Dr. AnnMaria De Mars titled “The Next Big Thing”.

In it Dr. De Mars wrote (I allowed myself to emphasize some parts of the text):

Contrary to what some people seem to think, R is definitely not the next big thing, either. I am always surprised when people ask me why I think that, because to my mind it is obvious. [...]
for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.

Here are my two cents on the subject:
Continue reading