DistributedSystem/HadoopEcyosystem
-
HadoopDistributedSystem/HadoopEcyosystem 2020. 3. 9. 21:57
1. Overview Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallelly. There are basically two components in Hadoop: 1.1 Hadoop Distributed File System (HDFS) HDFS allows dumping any kind of data across the cluster 1.2 Yet Another Resource Negotiator (YARN) YARN allows parallel processing of the data stored in HDFS 2. Hadoo..
-
Big DataDistributedSystem/HadoopEcyosystem 2019. 9. 25. 13:35
1. Overview around 90% of the world's data was created in the last two years alone. Moreover, 80% of the data is unstructured or available in widely varying structures such as images, line streaming records, videos, sensor records, GPS tracking details, which are difficult to analyze. Traditional systems are useful in working with structured data(limited as well), but they can't manage such a la..
-
MapReduceDistributedSystem/HadoopEcyosystem 2019. 9. 25. 05:08
1. Overview a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data process..
-
Difference between Hadoop and SparkDistributedSystem/HadoopEcyosystem 2019. 9. 25. 04:26
1. Overview Clarify the difference between Hadoop and Spark 2. Description Difference between Hadoop and Spark Features Hadoop Spark Data processing Only for batch processing Batch processing as well as real-time processing Processing speed Slower than Spark cause of I/O disk latency 100x faster in memory and 10x faster while running on disk Category Data processing engine Data analytics engine ..
-
Hadoop Yet Another Resource Negotiator(Yarn)DistributedSystem/HadoopEcyosystem 2019. 9. 14. 16:28
1. Overview A platform that is responsible for managing computing resources in clusters and using them for scheduling users' applications. Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS(Hadoop Distributed File System). Apart from resource management, Yarn also does J..
-
Hadoop Distributed File System(HDFS)DistributedSystem/HadoopEcyosystem 2019. 9. 8. 21:28
1. Overview Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. HDFS follows a Master/Slave Architecture, where a cluster comprises a single Name node(Master node) and all the other nodes are Data nodes(slave nodes). HDFS can be deploye..