Addressing Big Data analytics with Hadoop: Q & A with chief Yellowfin architect, Peter Damen

Big Data is only becoming bigger; rapidly. Recognizing that more organizations are in the midst of undertaking initiatives to explore and exploit their ever-expanding data assets, the latest release of Yellowfin’s Business Intelligence (BI) software – Yellowfin 6.1 – offers the ability for customers to connect to a significantly extended number of databases, including Hadoop.

But why are Big Data applications like Hadoop becoming increasingly important and popular?

The Zettabyte age

According to a recent research report by the International Data Corporation (IDC), global data will grow to 2.7 zettabytes in 2012 – up 48% on 2011. IDC predicts this figure to balloon to eight zettabytes worth of data by 2015. And if that’s not astonishing, maybe a quick calculation will help: One zettabyte = 1,000,000,000,000,000,000,000 bytes. Not only is data creation accelerating, but the capacity to store it is has doubled every 3.5 years since 1980, according to Hilbert and López’s 2011 journal article The World’s Technological Capacity to Store, Communicate, and Compute Information. IBM claims that some 2.5 quintillion bytes of data have been created every day since the beginning of 2012.

Big Data is also big business

So why do organizations continue to mine more and more data? Well, it’s often been said that data is the new oil of the twenty-first century; where more knowledge about customers, transactions and marketplace trends leads to more effective and efficient products and services. And hence, increased profitability.

According to the globe’s largest open-network enterprise IT research and advisory firm Wikibon, and its formative Big Data Market Size and Vendor Revenues report, the Big Data marketplace will generate $5.1 billion in revenue during 2012, reaching a staggering $53.4 billion by 2017.

So the need for Big Data technologies, like Hadoop, is pretty clear. So how does Yellowfin 6.1 support Hadoop?

5 questions with chief Yellowfin architect, Peter Damen

Firstly, what is Hadoop?

Hadoop is a free open-source Java-based framework from the Apache Software Foundation that supports the distribution and running of applications on clusters of servers with thousands of nodes and petabytes of data.”

How does Yellowfin connect to, and extract data from, Hadoop?

“Yellowfin can take advantage of Hadoop’s awesome Big Data processing power by connecting to Hive.

“Hive is a database system that runs on-top-of Hadoop, providing an SQL-like language for tabular data selection and filtering.

“Hive can take advantage of the distributed nature of the Hadoop file system, allowing for petabytes of storage, using Map-Reduce functions to implement SQL-like queries.

“The Yellowfin interface allows users to quickly author reports and visualizations on Big Data, just like any other relational data source.

“When support for a new database management system is added to Yellowfin, an interface is developed describing the capabilities of the database. This allows Yellowfin to hide functionality that may not be available on a particular system. However, the SQL supported by Hive allows the availability of all functions.

“The implementation of a database interface for Hive in Yellowfin means that connection can be made using the connection wizard, prompting for host and port details, rather than entering a straight JDBC url.”

What were the challenges associated with integrating support for Hadoop in Yellowfin 6.1?

“One issue that we encountered whilst implementing Hive support was that the JDBC driver has a plethora of dependencies. In fact, to work correctly, the Hive JDBC driver required nearly all of the Hive binaries and Hadoop libraries to be within scope.

“This was an issue for Yellowfin as JDBC drivers are loaded in a shared Class-Loader within Yellowfin itself. Some of the dependencies required for the Hive driver conflicted with Yellowfin dependencies.

“We resolved this potential problem by constructing a separate JDBC driver that could be loaded within the shared Class-Loader, but that loaded dependencies using a separate Class-Loader for a particular location from the file system.”

Under what circumstances would you suggest a customer move to Hadoop over a traditional relational database?

“Essentially, Hadoop’s major advantage – when compared to traditional relational databases – is its ability to allow users to run queries against many petabytes of data.

“The term ‘Big Data’ is a contextual one – theoretically, if a small organization is struggling to process and leverage gigabytes of data then, relative to their size and capabilities, they have a Big Data issue to address. [Find out more in a presentation by Yellowfin CEO, Glen Rabie – Big Data: It’s not the size; it’s how you use it].

“So Hadoop is an excellent tool for dealing with particularly large volumes of data, but – despite the continuing proliferation of data, data types and sources – it’s not a tool that smaller organizations will need to consider using in the near future.”

How does a customer configure Yellowfin to use Hadoop?

“Connecting Yellowfin to Hadoop is relatively easy, but requires one extra step that most other Yellowfin database connections don’t require.

“Besides the standard information required when connecting Yellowfin to a database (like host, port and database), Hive requires a driver path to where required dependencies for the JDBC driver reside.

“The files in this path are loaded with a separate Class-Loader to avoid conflicting with other resources. Yellowfin presents a wizard for connecting, so you don’t need to enter a JDBC url. However, the option is available if desired.”

Where to next?

Check out our official Yellowfin 6.1 release page to find out more about how we’re helping to make Big Data analytics easy: Yellowfin 6.1 Release Page >