Monday, July 23, 2012

Why is Big Data trending?


The past couple of years we witnessed  “Cloud” as the hot fashionable trend in the technology realm. Anything accessible on the Internet was promptly appended with  “Cloud” and instantly it was ‘hot”.

Technology trends are analogous to fashion trends in the sense that it changes pretty quickly and the world has a tough time catching up. By the time you catch up, there is a new buzzword trending in the market place.

Now the latest fashion is Big Data and I wouldn’t be surprised if a new entrant comes up before I finish writing this blog.  A new trend may come up, but my guess is that Big Data is here to stay a while.

So what is this “Big Data”? It is exactly what it means  - large, massive amounts of data. Today with the rapid adoption of technology all over the world with Social Media, Email and Cloud (yes, Cloud!), we generate around 2.5 quintillion bytes of data every day 1. If we take the entire set of data created from the beginning of the computer era till today, we will find that 90% of this data was created in just the past 2 years 1. I can almost envision a new law that will come in force just like Moore’s law that will predict the amount of data created will be 90% or some multiple of all the data created in the past few years.

Whenever we see a new trend rising, it all boils down to plain hard cold cash. It is usually the case that the revenue opportunity exceeds capital and operational investment. Reduction in cost realized by distributed computing and increased storage density has contributed to the popularity of analyzing Big Data. The vast amount of data generated each day has trending information that could give any company an edge over its competitors. Thus an entire market profiting from selling data, storing data, manipulating data, structuring data and analyzing data has been established. So, we can safely say distributed computing, improved storage and just the sheer availability of data have contributed to the Big Data trend.

The way distributed computing has contributed to Big Data success is in increased compute power and reduced costs by harnessing the combined power of multiple commodity PCs (Intel/AMD dual or quad core configuration costing around $500 - $1000) .The ability to distribute computing across hundreds of thousands of commodity PCs running free open source software like Linux is extremely cost effective.  Apart from the initial capital investment, the additional operational cost in maintaining proprietary hardware and software in terms of licenses, resources and personnel are just eliminated with commodity PC clusters.

In addition to the strides made in distributed computing, it is the increase in storage capacity and throughput that has made analyzing Big Data feasible. SANs (Storage Area Networks) have been available for a while to provide petabytes of data storage. However, when analyzing Big Data, access and read/write performance is important, so that petabytes data can be analyzed within a day and not weeks. An example of storage devices providing high performance is Flash based SSD drives. Breakthroughs in storage technology have resulted in the ability to scale the Flash based SSD drives and retaining its low latency and high throughput. High capacity flash drives today can scale up to 500 TB in a single rack offering a throughput of more than a million read/write operations per second. There could be configurations that eliminate rack mounted storage and have storage residing in the PCs/servers that form the cluster from 500GB to more than 1 Terabyte providing 200K – 500 K read write operations per second.

Although it is distributed computing and improved storage that are key factors contributing to Big Data research, we need to go to the core of issue. The root is indeed the sheer availability of data generated by computers, blogs, websites and people all over the world. Gartner estimates that there is 329 Exabyte of personal data today, which will balloon to 4.1 Zettabyte of data by 2016. (A Zettabyte is more than 1 trillion gigabytes.) 2

With this amalgamation of distributed computing, storage and data availability, it is evident that Big Data will remain an important buzzword for the next few years.

References:
  1. http://www-01.ibm.com/software/data/bigdata
  2. http://www.networkworld.com/news/2012/062512-gartner-cloud-260450.html
  3. http://www.zdnet.com/blog/open-source/amazon-ec2-cloud-is-made-up-of-almost-half-a-million-linux-servers/10620
  4. http://www.techdirt.com/blog/?tag=kryder's+law
  5. http://en.wikipedia.org/wiki/Mark_Kryder
  6. http://mashable.com/2012/03/06/one-day-internet-data-traffic/