• Hadoop, NoSQL & NewSQL
  • Experfy Editor
  • JUN 08, 2014

Can You Set Up an R-Hadoop System on Your Own?

Compared to the traditional data warehousing model, big data analytics delivers competitive advantage in two ways, as claimed by data scientists. The first claim is that big data analytics can do the job with a simple, smart algorithm “applied to large volumes of data,” which would be too large for the scope of traditional data warehouses.  The implication of such a claim is that the algorithm itself is not the competitive advantage; rather, the algorithm’s ability to create models from huge amounts of data is!

The second claim is that vendor-supplied algorithms can do a better job than data scientists. To challenge both the claims, companies and data scientists can look beyond packaged data models and learn to innovate with newer statistical programming languages.

As the amounts of data—especially unstructured data—collected by organizations and enterprises explode, Hadoop is rapidly becoming a technology of choice for data storing and processing.

A comment from Hadoop: The Definitive Guide, Second Edition contrasts the difference between HBase and traditional DBMSs, “We currently have tables with hundreds of millions of rows and tens of thousands of columns; the thought of storing billions of rows and millions of columns is exciting, not scary.”

You may think that in relation to big data and Hadoop— most data scientists tend to think of technologies such as Hive, Pig, and Impala as their main tools. Surprisingly, if you ask a data analyst or a data scientist, they will tell you that their primary tool for Hadoop and big- data environments is in fact “R.” R happens to be the open-source, statistical modeling language nurtured within the Hadoop ecosystem, particularly suited for data preparation, analytics, and correlation tasks required in a big data project.

Currently, many enterprises are turning to the “R” statistical programming language in combination with Hadoop as a potential solution to this unmet commercial need. To get started, you may follow this link –Big Data Analytics with R and Hadoop

You can also watch is video: Integrating R and Hadoop with RHadoop


Marriage of Hadoop and R 

Both Hadoop and R being open source, the marriage of R and Hadoop seems a natural one. But, some fundamental differences between the two need to be addressed in order to make the marriage work.  R, on one hand, supports an iterative process beginning with a hypothesis, exploring the data, trying different statistical models, drilling down to find the exact solution. On the other hand, Hadoop supports batch processing, leaving jobs queued and executed in sequence. R is designed for in-memory, data execution while Hadoop work on a distributed setup of parallel data slices. With R and Hadoop, a robust data analytics engine can be built, which can apply algorithms to large scale dataset in a scalable manner.

R is gradually becoming a de facto standard for data scientists as it enables full control over the statistical models, and also enables more automated execution of tests after development. As is the case with all effective data analysis—high volumes of data can help extract more insights, for which in-memory processing requirements are very high.  As memory constraints of even the most powerful machines hinders such memory-intensive data processing, it is imperative that the benefits of parallel computing available in the Hadoop environment can be leveraged by R to enhance the analytics capabilities for full blown actionable intelligence in real time. Ever thought of setting up your own R-Hadoop system with R? Begin here:  Step-by-Step Guide to Setting Up an R-Hadoop System.

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab


Matching Providers

Matching providers 2
comments powered by Disqus.