DistributedSystem/Spark
-
MapReduce Vs Spark RDDDistributedSystem/Spark 2019. 9. 25. 08:16
1. Overview MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance. But Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop a..
-
RDD Lineage and Logical Execution PlanDistributedSystem/Spark 2019. 9. 25. 05:37
1. Overview RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan. Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the acti..
-
Difference between RDD and DSMDistributedSystem/Spark 2019. 9. 25. 04:30
1. Overview The RDD (resilient distributed DataSet) elastic distributed data set is the core data structure of spark. DSM (distributed shared memory) is a common memory data abstraction. In DSM, applications can read and write to any location in the global address space. The main difference between RDD and DSM is that not only can the RDD be created by bulk conversion (i.e. "write"), but it can ..
-
Apache SparkDistributedSystem/Spark 2019. 9. 20. 00:55
1. Overview An open-source distributed general-purpose cluster computing framework with mostly in-memory data processing engine that can do ETL, analytics, machine learning, and graph processing on large volumes of data at rest(batch processing) or in motion(streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL 2. Description 2.1 A..
-
Resilient Distributed Dataset(RDD)DistributedSystem/Spark 2019. 9. 8. 22:32
1. Overview Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. There are two ways to create RDDs − parallelizi..