DevOps.com

Where the world meets DevOps

  • Home
  • Features
  • Neighborhoods
    • Cloud
    • Continuous Delivery
    • Continuous Testing
    • DevSecOps
    • Leadership Suite
    • Practices
    • ROELBOB
    • Toolbox
  • Webinars
    • Upcoming
    • On-Demand
  • Library
  • Chat
  • News
  • Authors
  • Directory
  • About
  • Related Sites
    • Container Journal
    • DevOps Connect
    • DevOps Dozen
    • DevOps Institute
    • Microservices Journal
    • Security Boulevard

Home » Features » Bigdata – Understanding Hadoop and Its Ecosystem

Bigdata – Understanding Hadoop and Its Ecosystem

Sudhi SeshachalaBy Sudhi Seshachala on June 1, 2015 5 Comments

It is quite interesting to envision how we could adopt the Hadoop eco system within the realms of DevOps. I will try to cover it in upcoming series.  Hadoop managed by the Apache Foundation is a powerful open-source platform written in java that is capable of processing large amounts of heterogeneous data-sets at scale in a distributive fashion on cluster of computers using simple programming models. It is designed to scale up from single server to thousands of machines, each offering local computation and storage and has become an in-demand technical skill. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.

 
Recent Posts By Sudhi Seshachala
  • Best Practices for User Management Models in AWS
  • Useful Big Data Terminologies, Part 1
  • Financial Drivers for Cloud Migration in Enterprise
Sudhi Seshachala More from Sudhi Seshachala
Related Posts
  • Datadog Adds Hadoop and Spark Integration to Leading Cloud-Scale Monitoring Platform
  • Commercial vendors gift to DevOps
  • IBM, Hortonworks Expand Partnership to Help Businesses Accelerate Data-Driven Decision Making
    Related Categories
  • Features
    Related Topics
  • apache
  • hadoop
  • hbase
  • hive
Show more
Show less
 

Hadoop Architecture:

 

The Apache Hadoop framework includes following four modules:

 
     
  • Hadoop Common: Contains Java libraries and utilities needed by other Hadoop modules. These libraries give filesystem and OS level abstraction and comprise of the essential Java files and scripts that are required to start Hadoop.
  •  
  • Hadoop Distributed File System (HDFS): A distributed file-system that provides high-throughput access to application data on the community machines thus providing very high aggregate bandwidth across the cluster.
  •  
  • Hadoop YARN: A resource-management framework responsible for job scheduling and cluster resource management.
  •  
  • Hadoop MapReduce: This is a YARN- based programming model for parallel processing of large data sets.
  •  
 

Below diagram portray four components that are available in Hadoop framework.

 

Hadoop

 

All the modules in Hadoop are designed with a fundamental assumption i.e., hardware failure, so should be automatically controlled in software by the framework. Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well: Apache Pig, Apache Hive, Apache HBase, and others.

 

Hadoop Ecosystem:

 

Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large amount of data, quickly and cost effectively through clusters of commodity hardware. It wont be wrong if we say that Apache Hadoop is actually a collection of several components and not just a single product.

 

With Hadoop Ecosystem there are several commercial along with an open source products which are broadly used to make Hadoop laymen accessible and more usable.

 

The following sections provide additional information on the individual components:

 

MapReduce

 

Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. In terms of programming, there are two functions which are most common in MapReduce.

 
     
  • The Map Task: Master computer or node takes input and convert it into divide it into smaller parts and distribute it on other worker nodes. All worker nodes solve their own small problem and give answer to the master node.
  •  
  • The Reduce Task: Master node combines all answers coming from worker node and forms it in some form of output which is answer of our big distributed problem.
  •  
 

Generally both the input and the output are reserved in a file-system. The framework is responsible for scheduling tasks, monitoring them and even re-executes the failed tasks.

 

Hadoop Distributed File System (HDFS)

 

HDFS is a distributed file-system that provides high throughput access to data. When data is pushed to HDFS, it automatically splits up into multiple blocks and stores/replicates the data thus ensuring high availability and fault tolerance.

 

Note: A file consists of many blocks (large blocks of 64MB and above).

 

Here are the main components of HDFS:

 
     
  • NameNode: It acts as the master of the system. It maintains the name system i.e., directories and files and manages the blocks which are present on the DataNodes.
  •  
  • DataNodes: They are the slaves which are deployed on each machine and provide the actual stor­age. They are responsible for serving read and write requests for the clients.
  •  
  • Secondary NameNode: It is responsible for performing periodic checkpoints. In the event of NameNode failure, you can restart the NameNode using the checkpoint.
  •  
 

Hive

 

Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

 

It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive also allows traditional map/reduce programmers to plug in their custom map­pers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

The main building blocks of Hive are –

 
     
  1. Metastore – To store metadata about columns, partition and system catalogue.
  2.  
  3. Driver – To manage the lifecycle of a HiveQL statement
  4.  
  5. Query Compiler – To compiles HiveQL into a directed acyclic graph.
  6.  
  7. Execution Engine – To execute the tasks in proper order which are produced by the compiler.
  8.  
  9. HiveServer – To provide a Thrift interface and a JDBC / ODBC server.
  10.  
 

HBase (Hadoop DataBase)

 

HBase is a distributed, column oriented database and uses HDFS for the underlying storage. As said earlier, HDFS works on write once and read many times pattern, but this isn’t a case always. We may require real time read/write random access for huge dataset; this is where HBase comes into the picture. HBase is built on top of HDFS and distributed on column-oriented database.

 

Here are the main components of HBase:

 
     
  • HBase Master: It is responsible for negotiating load balancing across all RegionServers and maintains the state of the cluster. It is not part of the actual data storage or retrieval path.
  •  
  • RegionServer: It is deployed on each machine and hosts data and processes I/O requests.
  •  
 

Zookeeper

 

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.

 

Mahout

 

Mahout is a scalable machine learning library that implements various different approaches machine learning. At present Mahout contains four main groups of algorithms:

 
     
  • Recommendations, also known as collective filtering
  •  
  • Classifications, also known as categorization
  •  
  • Clustering
  •  
  • Frequent itemset mining, also known as parallel frequent pattern mining
  •  
 

Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion and have been written to be executable in MapReduce. Mahout is scalable along three dimensions: It scales to reasonably large data sets by leveraging algorithm properties or implementing versions based on Apache Hadoop.

 

Sqoop (SQL-to-Hadoop)

 

Sqoop is a tool designed for efficiently transferring structured data from SQL Server and SQL Azure to HDFS and then uses it in MapReduce and Hive jobs. One can even use Sqoop to move data from HDFS to SQL Server.

 

Apache Spark:

 

Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning.

 

Pig

 

Pig is a platform for analyzing and querying huge data sets that consist of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.

 

Pig has three main key properties:

 
     
  • Extensibility
  •  
  • Optimization opportunities
  •  
  • Ease of programming
  •  
 

The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of MapReduce programs.

 

Oozie

 

Apache Oozie is a workflow/coordination system to manage Hadoop jobs.

 

Flume

 

Flume is a framework for harvesting, aggregating and moving huge amounts of log data or text files in and out of Hadoop. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices. Flume itself has a query processing engine, so it’s easy to transform each new batch of data before it is shuttled to the intended sink.

 

Ambari:

 

Ambari was created to help manage Hadoop. It offers support for many of the tools in the Hadoop ecosystem including Hive, HBase, Pig, Sqoop and Zookeeper. The tool features a management dashboard that keeps track of cluster health and can help diagnose performance issues.

 

Conclusion

 

Hadoop is powerful because it is extensible and it is easy to integrate with any component. Its popularity is due in part to its ability to store, analyze and access large amounts of data, quickly and cost effectively across clusters of commodity hardware. Apache Hadoop is not actually a single product but instead a collection of several components. When all these components are merged, it makes the Hadoop very user friendly.

  
Sponsored Content
Featured eBook
DevOps Challenges and Version Control: The 2018 Report

DevOps Challenges and Version Control: The 2018 Report

DevOps.com produced a survey to learn how organizations’ development teams rely on their version control software for successful DevOps implementation. Perforce Software sponsored the study. Version control software (or VCS) has emerged as a way to solve these DevOps challenges. For many companies, the right version control tool is a ... Read More
 

Filed Under: Features Tagged With: apache, hadoop, hbase, hive

  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • More
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Tumblr (Opens in new window)
« Secrets to win over DevOps buyers
DevOps, Containers and Faster Time-to-Parity »

Newsletter Sign-up

  • View DevOps.com Privacy Policy

RSS Container Journal

  • Google Advances Hybrid Cloud Strategy Based on Kubernetes
  • New Storage Challenges Emerge as Container Adoption Increases
  • Securing Container Images in the DevOps World
  • IBM Aligns With Twistlock on Container Security
  • 5 Key Considerations for Managed Kubernetes

RSS Security Boulevard

  • Security Startup Boldly Claims ‘No False Positives’
  • North Korean Lazarus Group Starts Targeting Russian Organizations
  • Failure to Plan: 3 Unexpected Security Challenges That Undermine Your CISO
  • Oklahoma Securities Commission’s Data Availed in the Wild
  • Vote Now: 2019 Security Blogger Awards Finalists

Upcoming Webinars

Thu 21

The Ultimate Application Monitoring Guide for Kubernetes

Thu, February 21, 1:00 pm - 2:00 pm EST
Thu 21

How Autodesk Delivers Seamless Customer Experience with Catchpoint

Thu, February 21, 3:00 pm - 4:00 pm EST
Wed 27

FIltering Internet Bound AWS Traffic; Learn Why Your NAT Gateway May Not Be Up to the Task

Wed, February 27, 1:00 pm - 2:00 pm EST

More Webinars

Past Webinars

Download Free eBook

How Best to Bring DevOps to the Mainframe
https://library.devops.com/how-best-to-bring-devops-to-the-mainframe

RSS DevOps Chat

  • The State of DevOps w/ Michael Stahnke, Puppet
  • Serverless App Building Made Easy w/ Ashu Agarwal, Nimbella
  • Mainframe DevOps Update w/ Chris O'Malley
  • Shifting DevSec Left with ShiftLeft /RSAC Special
  • DisruptOps: SecurityOps Disrupted / Special RSAC Edition

Past Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • Business Directory
  • About DevOps.com
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

© 2019 ·MediaOps Inc.All rights reserved.

    • Twitter
    • LinkedIn
    • Facebook
    • YouTube
    • RSS Feed
  • Home
  • Features
  • Neighborhoods
    • Cloud
    • Continuous Delivery
    • Continuous Testing
    • DevSecOps
    • Leadership Suite
    • Practices
    • ROELBOB
    • Toolbox
  • Webinars
    • Upcoming
    • On-Demand
  • Library
  • Chat
  • News
  • Authors
  • Directory
  • About
  • Related Sites
    • Container Journal
    • DevOps Connect
    • DevOps Dozen
    • DevOps Institute
    • Microservices Journal
    • Security Boulevard
  • Home
  • Business Directory
  • About DevOps.com
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Our website uses cookies. By continuing to browse the website you are agreeing to our use of cookies. For more information on how we use cookies and how you can disable them, please read our Privacy Policy.