The past couple of years we witnessed “Cloud” as the hot fashionable trend in the
technology realm. Anything accessible on the Internet was promptly appended
with “Cloud” and instantly it was ‘hot”.
Technology trends are analogous to fashion trends in the
sense that it changes pretty quickly and the world has a tough time catching
up. By the time you catch up, there is a new buzzword trending in the market
place.
Now the latest fashion is Big Data and I wouldn’t be
surprised if a new entrant comes up before I finish writing this blog. A new trend may come up, but my guess is that
Big Data is here to stay a while.
So what is this “Big Data”? It is exactly what it means - large, massive amounts of data. Today with
the rapid adoption of technology all over the world with Social Media, Email
and Cloud (yes, Cloud!), we generate around 2.5 quintillion bytes of data every
day 1. If we take the entire set of data created from the beginning
of the computer era till today, we will find that 90% of this data was created
in just the past 2 years 1. I can almost envision a new law that
will come in force just like Moore’s law that will predict the amount of data
created will be 90% or some multiple of all the data created in the past few
years.
Whenever we see a new trend rising, it all boils down to
plain hard cold cash. It is usually the case that the revenue opportunity
exceeds capital and operational investment. Reduction in cost realized by
distributed computing and increased storage density has contributed to the
popularity of analyzing Big Data. The vast amount of data generated each day
has trending information that could give any company an edge over its
competitors. Thus an entire market profiting from selling data, storing data,
manipulating data, structuring data and analyzing data has been established.
So, we can safely say distributed computing, improved storage and just the
sheer availability of data have contributed to the Big Data trend.
The way distributed computing has contributed to Big Data
success is in increased compute power and reduced costs by harnessing the
combined power of multiple commodity PCs (Intel/AMD dual or quad core
configuration costing around $500 - $1000) .The ability to distribute computing
across hundreds of thousands of commodity PCs running free open source software
like Linux is extremely cost effective. Apart from the initial capital investment, the
additional operational cost in maintaining proprietary hardware and software in
terms of licenses, resources and personnel are just eliminated with commodity
PC clusters.
In addition to the strides made in distributed computing, it
is the increase in storage capacity and throughput that has made analyzing Big
Data feasible. SANs (Storage Area Networks) have been available for a while to
provide petabytes of data storage. However, when analyzing Big Data, access and
read/write performance is important, so that petabytes data can be analyzed
within a day and not weeks. An example of storage devices providing high
performance is Flash based SSD drives. Breakthroughs in storage technology have
resulted in the ability to scale the Flash based SSD drives and retaining its low
latency and high throughput. High capacity flash drives today can scale up to
500 TB in a single rack offering a throughput of more than a million read/write
operations per second. There could be configurations that eliminate rack
mounted storage and have storage residing in the PCs/servers that form the
cluster from 500GB to more than 1 Terabyte providing 200K – 500 K read write
operations per second.
Although it is distributed computing and improved storage
that are key factors contributing to Big Data research, we need to go to the
core of issue. The root is indeed the sheer availability of data generated by
computers, blogs, websites and people all over the world. Gartner estimates that there is 329 Exabyte of personal data today, which will balloon to 4.1 Zettabyte of data by 2016. (A Zettabyte is more than 1 trillion gigabytes.) 2
With this amalgamation of distributed computing, storage and
data availability, it is evident that Big Data will remain an important buzzword
for the next few years.
References:
- http://www-01.ibm.com/software/data/bigdata
- http://www.networkworld.com/news/2012/062512-gartner-cloud-260450.html
- http://www.zdnet.com/blog/open-source/amazon-ec2-cloud-is-made-up-of-almost-half-a-million-linux-servers/10620
- http://www.techdirt.com/blog/?tag=kryder's+law
- http://en.wikipedia.org/wiki/Mark_Kryder
- http://mashable.com/2012/03/06/one-day-internet-data-traffic/