Why Hadoop is Mostly Batch
posted by Anna Mar, December 14, 2012Hadoop is probably the most widely discussed cloud technology today. With all the buzz, it's easy to lose track of what Hadoop actually does.
One of the biggest points of confusion is whether or not Hadoop is a batch processing platform. The answer is yes … and no.
Hadoop and Batch
Hadoop implements MapReduce — a parallel processing approach developed by Google.MapReduce divides processing amongst many "map" steps and merges the results using "reduce" steps.
This approach is useful for batch processing big data and importing the results into relational databases for realtime use. This is exactly how Hadoop is most commonly used.
For example, a search index can be built from billions of unstructured documents and input into databases to support realtime search.
Real-time Hadoop
There is great interest in using Hadoop to process realtime transactions and decision support. It can be done.For example, Apache Hive provides data warehouse infrastructure and SQL-like query capabilities built on top of Hadoop.
The overhead of dividing processing between many physical machines and merging results is a consideration for realtime applications.
Real-time platforms for Hadoop employ strategies to reduce this overhead such as data indexes.
It's likely that Hadoop's future lies with these real-time platforms.
Big Data Guide A guide to big data including an overview of key technologies. |
Recently on Simplicable
3 Big Data Bulliesposted by Anna MarLike any powerful new tool, big data can either be used to improve life or to make life worse. |
IT ROI is Fading Fast as a Measure of IT Successposted by Anna MarSomebody tell the CFO — IT ROI has gone the way of the dinosaurs. |