-
Difference between RDD and DSMDistributedSystem/Spark 2019. 9. 25. 04:30
1. Overview
The RDD (resilient distributed DataSet) elastic distributed data set is the core data structure of spark. DSM (distributed shared memory) is a common memory data abstraction. In DSM, applications can read and write to any location in the global address space. The main difference between RDD and DSM is that not only can the RDD be created by bulk conversion (i.e. "write"), but it can also be written to any memory location. RDD restricts the application to perform bulk write operations, which facilitates effective fault tolerance. In particular, because RDD can use lineage (descent) to recover a partition, there is basically no checkpoint overhead. Failure only requires recalculation of the lost RDD partitions, which can be executed in parallel on different nodes without the need to roll back the entire program.
2. Motivation
- Iterative algorithms
- Interactive data mining tools
- DSM(Distributed Shared Memory) is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on a commodity cluster
3. Description
Aspect RDD DSM Read - Either coarse-grained or fine-grained
- Coarse-grained: being able to transform the whole dataset but not an individual element on the dataset
- fine-grained: being able to transform individual element on the dataset
Fine-grained Write Coarse-grained Fine-grained Consistency High(Immutable in nature) Guarantees that if the programmer follows the rules Fault-Recovery Mechanism - The lost data can be easily recovered in Spark RDD using lineage graph at any moment
- For each transformation, new RDD is formed and RDDs are immutable in nature so it is easy to recover
- Being achieved by a checkpointing technique which allows applications to roll back to a recent checkpoint rather than restarting
Straggler Mitigation Possible to mitigate stragglers using backup task Quite difficult to achieve straggler mitigation Behavior if not enough RAM RDDs are shifted to disk Performance decreases if the RAM runs out of storage Bulk Operation Data locality 3.1 DSM
3.2 RDD
3.2.1 Iterative Operations on Spark RDD
3.2.2 Interactive Operations on Spark RDD
4. References
https://data-flair.training/blogs/spark-rdd-tutorial/
http://hadooptutorial.info/resilient-distributed-dataset/
http://andrewalexanderprice.com/blog20151021.php#.XYpwNpMzZTY
https://stackoverflow.com/questions/3766845/coarse-grained-vs-fine-grained
https://topic.alibabacloud.com/a/the-difference-between-rdd-and-dsm_8_8_10274074.html
'DistributedSystem > Spark' 카테고리의 다른 글
MapReduce Vs Spark RDD (0) 2019.09.25 RDD Lineage and Logical Execution Plan (0) 2019.09.25 Apache Spark (0) 2019.09.20 Resilient Distributed Dataset(RDD) (0) 2019.09.08