Difference between RDD and DSM

DistributedSystem/Spark 2019. 9. 25. 04:30

1. Overview

The RDD (resilient distributed DataSet) elastic distributed data set is the core data structure of spark. DSM (distributed shared memory) is a common memory data abstraction. In DSM, applications can read and write to any location in the global address space. The main difference between RDD and DSM is that not only can the RDD be created by bulk conversion (i.e. "write"), but it can also be written to any memory location. RDD restricts the application to perform bulk write operations, which facilitates effective fault tolerance. In particular, because RDD can use lineage (descent) to recover a partition, there is basically no checkpoint overhead. Failure only requires recalculation of the lost RDD partitions, which can be executed in parallel on different nodes without the need to roll back the entire program.

2. Motivation

Iterative algorithms
Interactive data mining tools
DSM(Distributed Shared Memory) is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on a commodity cluster

3. Description

Aspect	RDD	DSM
Read	Either coarse-grained or fine-grained Coarse-grained: being able to transform the whole dataset but not an individual element on the dataset fine-grained: being able to transform individual element on the dataset	Fine-grained
Write	Coarse-grained	Fine-grained
Consistency	High(Immutable in nature)	Guarantees that if the programmer follows the rules
Fault-Recovery Mechanism	The lost data can be easily recovered in Spark RDD using lineage graph at any moment For each transformation, new RDD is formed and RDDs are immutable in nature so it is easy to recover	Being achieved by a checkpointing technique which allows applications to roll back to a recent checkpoint rather than restarting
Straggler Mitigation	Possible to mitigate stragglers using backup task	Quite difficult to achieve straggler mitigation
Behavior if not enough RAM	RDDs are shifted to disk	Performance decreases if the RAM runs out of storage
Bulk Operation	Data locality

3.1 DSM

3.2 RDD

3.2.1 Iterative Operations on Spark RDD

3.2.2 Interactive Operations on Spark RDD

4. References

https://data-flair.training/blogs/spark-rdd-tutorial/

http://hadooptutorial.info/resilient-distributed-dataset/

http://andrewalexanderprice.com/blog20151021.php#.XYpwNpMzZTY

https://stackoverflow.com/questions/3766845/coarse-grained-vs-fine-grained

https://topic.alibabacloud.com/a/the-difference-between-rdd-and-dsm_8_8_10274074.html

'DistributedSystem > Spark' 카테고리의 다른 글

MapReduce Vs Spark RDD (0)	2019.09.25
RDD Lineage and Logical Execution Plan (0)	2019.09.25
Apache Spark (0)	2019.09.20
Resilient Distributed Dataset(RDD) (0)	2019.09.08

ABOUT ME

Demyank's Tlog Demyank's Tlog

1. Overview

2. Motivation

3. Description

3.1 DSM

3.2 RDD

3.2.1 Iterative Operations on Spark RDD

3.2.2 Interactive Operations on Spark RDD

4. References

'DistributedSystem > Spark' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Overview

2. Motivation

3. Description

3.1 DSM

3.2 RDD

3.2.1 Iterative Operations on Spark RDD

3.2.2 Interactive Operations on Spark RDD

4. References

'DistributedSystem > Spark' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바