DistributedSystem
-
Message BrokerDistributedSystem/Streaming 2020. 5. 3. 14:59
1. Overview Message Broker is intermediary software that is also called middleware that passes messages between senders and receivers. It may provide additional capabilities like Data transformation, Validation, Queuing, and Routing. And most importantly it provides full decoupling between senders and receivers. 1.1 Motivation 1.1.1 Synchronous Network Communication One of the properties of dire..
-
Cluster Management, Registration, and DiscoveryDistributedSystem/Manager 2020. 3. 26. 13:42
1. Service Registry and Service Discovery 1.1 Motivation when a group of computers startup the only device they are aware of is themselves even if they're all connected to the same network the formal logical cluster the different nodes need to find out about each other somehow they need to learn who else is in the cluster and most importantly how to communicate with those other nodes the obvious..
-
HadoopDistributedSystem/HadoopEcyosystem 2020. 3. 9. 21:57
1. Overview Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallelly. There are basically two components in Hadoop: 1.1 Hadoop Distributed File System (HDFS) HDFS allows dumping any kind of data across the cluster 1.2 Yet Another Resource Negotiator (YARN) YARN allows parallel processing of the data stored in HDFS 2. Hadoo..
-
Big DataDistributedSystem/HadoopEcyosystem 2019. 9. 25. 13:35
1. Overview around 90% of the world's data was created in the last two years alone. Moreover, 80% of the data is unstructured or available in widely varying structures such as images, line streaming records, videos, sensor records, GPS tracking details, which are difficult to analyze. Traditional systems are useful in working with structured data(limited as well), but they can't manage such a la..
-
MapReduce Vs Spark RDDDistributedSystem/Spark 2019. 9. 25. 08:16
1. Overview MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance. But Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop a..
-
RDD Lineage and Logical Execution PlanDistributedSystem/Spark 2019. 9. 25. 05:37
1. Overview RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan. Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the acti..
-
MapReduceDistributedSystem/HadoopEcyosystem 2019. 9. 25. 05:08
1. Overview a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data process..
-
Difference between RDD and DSMDistributedSystem/Spark 2019. 9. 25. 04:30
1. Overview The RDD (resilient distributed DataSet) elastic distributed data set is the core data structure of spark. DSM (distributed shared memory) is a common memory data abstraction. In DSM, applications can read and write to any location in the global address space. The main difference between RDD and DSM is that not only can the RDD be created by bulk conversion (i.e. "write"), but it can ..