Apache Hadoop

DistributedSystem

Apache Hadoop

데먕 2019. 9. 5. 03:44

1. Overview

Apache Hadoop is a set of software technology components that together form a scalable system optimized for analyzing data. Data analyzed on Hadoop has several typical characteristics.

Structured: For example, customer data, transaction data, and clickstream data that is recorded when people click links while visiting websites
Unstructured: For example, text from web-based news feeds, text in documents and text in social media such as tweets
Very large in volume
A high rate of speed for creation and arrival
Following ETL
- Extract: fetching the data from multiple sources
- Transform: convert the existing data to fit into the analytical needs
- Load: Right system to derive value in it

2. Description

2.1 Components

Components	Description
Hadoop Common	Contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System(HDFS)	A distributed file-system that store data on the commodity machines, providing very high aggregate bandwidth across the cluster
Hadoop YARN	A resource-management platform responsible for managing to compute resources in clusters and using them for scheduling of users' applications
Hadoop MapReduce	A programming model for large scale data processing

2.2 Categorizing Big Data

Features	Description
Structured	Which stores the data in rows and columns like relational data sets
Unstructured	Here data cannot be stored in rows and columns like video, images, etc.
Semi-structured	Data in formal XML are readable by machines and human

2.3 Advantages of Hadoop

Give access to the user to rapidly write and test the distributed systems
Automatically distributes the data and works across the machines
Utilizes the primary parallelism of the CPU cores
Hadoop library is developed to find/search and handle the failures at the application layer
A server can be added or removed from the cluster dynamically at any point of time
Open-source based on Java applications and hence compatible on all the platforms

2.4 Features

Features	Description
Distributed Processing	The data storage is maintained in a distributed manner in HDFS across the cluster, data is processed in parallel on a cluster of nodes
Fault Tolerance	By default, the three replicas of each block are stored across the cluster in Hadoop If any node goes down, the data on that node can be recovered easily and automatically from other nodes
Reliability	Data can be stored on the cluster of machine despite the machine failures thanks to the replication
High Availability	Data is available and accessible even there occurs a hardware failure due to multiple copies of data
Scalability	Providing horizontal scalability which means new nodes can be added on the top without any downtime
Economic	Not very expensive as it runs on a cluster of commodity hardware Not require any specialized machine for it Easy to add more nodes provides huge cost reduction Increasing of nodes without any downtime and without any much of pre-planning
Easy to use	No need of the client to deal with distributed computing, framework takes care of all the things
Data Locality	Works on data locality principle Movement of computation of data instead of data to computation Client submits his algorithm, then the algorithm is moved to data in the cluster instead of bringing data to the location where an algorithm is submitted and then processing it

3. References

https://en.wikipedia.org/wiki/Apache_Hadoop

https://www.ibmbigdatahub.com/blog/what-hadoop

https://www.bernardmarr.com/default.asp?contentID=1080

https://intellipaat.com/blog/tutorial/hadoop-tutorial/introduction-hadoop/

https://towardsdatascience.com/a-brief-summary-of-apache-hadoop-a-solution-of-big-data-problem-and-hint-comes-from-google-95fd63b83623