-
HadoopDistributedSystem/HadoopEcyosystem 2020. 3. 9. 21:57
1. Overview
Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallelly. There are basically two components in Hadoop:
1.1 Hadoop Distributed File System (HDFS)
HDFS allows dumping any kind of data across the cluster
1.2 Yet Another Resource Negotiator (YARN)
YARN allows parallel processing of the data stored in HDFS
2. Hadoop Distributed File System (HDFS)
HDFS creates an abstraction, let me simplify it for you. Similar to virtualization, you can see HDFS logically as a single unit for storing Big Data, but actually you are storing your data across multiple nodes in a distributed fashion. HDFS follows master-slave architecture.
In HDFS, Namenode is the master node and Datanodes are the slaves. Namenode contains the metadata about the data stored in data nodes, such as which data block is stored in which data node, where are the replications of the data block kept, etc. The actual data is stored in Data Nodes.
I also want to add, we actually replicate the data blocks present in Data Nodes, and the default replication factor is 3. Since we are using commodity hardware and we know the failure rate of this hardware is pretty high, so if one of the DataNodes fails, HDFS will still have a copy of those lost data blocks. You can also configure the replication factor based on your requirements.
3. Hadoop-as-a-Solution
3.1 Problem1: Storing Big data.
HDFS provides a distributed way to store Big data. Your data is stored in blocks across the DataNodes and you can specify the size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks. So HDFS will divide data into 4 blocks as 512/128=4 and store it across different DataNodes, it will also replicate the data blocks on different DataNodes. Now, as we are using commodity hardware, hence storing is not a challenge.
It also solves the scaling problem. It focuses on horizontal scaling instead of vertical scaling. You can always add some extra data nodes to HDFS cluster as and when required, instead of scaling up the resources of your DataNodes. Let me summarize it for you basically for storing 1 TB of data, you don’t need a 1TB system. You can instead do it on multiple 128GB systems or even less.
3.2 Problem2: Storing a variety of data.
With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. Since in HDFS, there is no pre-dumping schema validation. And it also follows write once and read many models. Due to this, you can just write the data once and you can read it multiple times for finding insights.
3.3 Problem3: Accessing & processing the data faster.
Yes, this is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. What does it mean? Instead of moving data to the master node and then processing it. In MapReduce, the processing logic is sent to the various slave nodes & then data is processed parallelly across different slave nodes. Then the processed results are sent to the master node where the results are merged and the response is sent back to the client.
4. Yet Another Resource Negotiator (YARN)
In YARN architecture, we have ResourceManager and NodeManager. ResourceManager might or might not be configured on the same machine as NameNode. But, NodeManagers should be configured on the same machine where DataNodes are present.
ResourceManager is again a master node. It receives the processing requests and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. NodeManagers are installed on every DataNode. It is responsible for the execution of the task on every single DataNode.
5. When to use Hadoop
Hadoop is used for:
- Search – Yahoo, Amazon, Zvents
- Log processing – Facebook, Yahoo
- Data Warehouse – Facebook, AOL
- Video and Image Analysis – New York Times, Eyealike
Till now, we have seen how Hadoop has made Big Data handling possible. But there are some scenarios where Hadoop implementation is not recommended.
6. When not to use Hadoop
- Low Latency data access: Quick access to small parts of data
- Multiple data modification: Hadoop is a better fit only if we are primarily concerned about reading data and not modifying data.
- Lots of small files: Hadoop is suitable for scenarios, where we have few but large files.
After knowing the best suitable use-cases, let us move on and look at a case study where Hadoop has done wonders
7. Reference
'DistributedSystem > HadoopEcyosystem' 카테고리의 다른 글
Big Data (0) 2019.09.25 MapReduce (0) 2019.09.25 Difference between Hadoop and Spark (0) 2019.09.25 Hadoop Yet Another Resource Negotiator(Yarn) (0) 2019.09.14 Hadoop Distributed File System(HDFS) (0) 2019.09.08