Home
Business Guide
 
simplicable technology guide   »  cloud computing   »  what is hadoop

Hadoop: What It Is and How It Works

        posted by , November 30, 2012

Hadoop is an open source framework from Apache that can be used to process big data sets using distributed systems (such as cloud infrastructure).

What can Hadoop do?



Hadoop can break big processing tasks into many small tasks and distribute those tasks to commodity computers (e.g. on a cloud). In other words, it allows large problems to be solved using large numbers of computers.

Hadoop also includes a distributed, fault tolerate filesystem that can handle big data.

One of the advantages of Hadoop is that it can process structured (e.g. xml) and unstructured data (e.g. images) together.

Hadoop is used to solve a variety of business and scientific problems. For example, the New York Times used Hadoop to create 11 million PDFs from 4 terabytes of images in 24 hours.

Google uses Hadoop to build search indexes and calculate metrics from big sets of unstructured data.

What is MapReduce?



Hadoop implements MapReduce — a model for parallel processing developed by Google. MapReduce solves problems by breaking them into two steps:

1. Map
A master node takes a problem and divides it into sub-problems. It distributes the sub-problems to worker nodes. Worker nodes may also break problems into sub-problems and distribute them.

2. Reduce
The master node collects the solutions to sub-problems from worker nodes and combines them to form the answer to the problem.

Worker nodes may also process the reduce step.

How does Hadoop work?



Hadoop implements MapReduce using two logical layers.

hadoop architecture details

Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is a scalable distributed filesystem that can store very large files and/or large numbers of files. Files are replicated to make the file system fault tolerant.

Hadoop can also be deployed with an alternative distributed file system such as the Amazon S3 filesystem.

MapReduce Engine
Hadoop's MapReduce Engine uses a tool known as a JobTracker to break problems into sub-problems which it gives to worker nodes known as TaskTrackers.

Hadoop attempts to keep each sub-problem close to the data it requires (ideally on the same physical machine).

How do I use Hadoop?



A typical enterprise deployment of Hadoop passes big data from social, enterprise, legacy and industry data sources to a Hadoop instance for processing.

hadoop architecture

When Hadoop processing completes the result is imported into a database (RDBMS) or an enterprise application such as a business intelligence (BI) tool, analytics engine or ERP.




Related Articles



Enterprise Architecture
How to architect an organization.




Enterprise Architecture (EA) is supposed to help manage IT risks — but is it possible that EA itself introduces new risks?

Modern technology customers and industry insiders are faced with a constant stream of change. Human ability to adapt to this pace of change is remarkable.

Yes, architect is a verb. Some dictionaries list it as a verb and others do not. The ones that don't haven't caught up with the modern usage of the word architect.

Current state architectural blueprints.


Recently on Simplicable


5 Common Current State Architectural Blueprint Mistakes

posted by Anna Mar
A current state architectural blueprint is essential to your success as an IT organization. After all, you can't effectively manage a complex architecture that's not documented.

9 Reasons You Need a Current State Architectural Blueprint

posted by Anna Mar
A current state enterprise architecture blueprint represents your organization's high level architecture. It's probably the most important documentation that any IT organization can create and maintain.

10 Big Data Definitions: Take Your Pick

posted by John Spacey
As with any emerging field, the definition of big data is always in flex.

9 ITIL Implementation Challenges

posted by Anna Mar
ITIL implementation is no cakewalk. ITIL impacts your entire organization — your business, your IT department and your inflight projects.

about     contact     sitemap     privacy     terms of service     copyright