ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Difference between RDD and DSM
    DistributedSystem/Spark 2019. 9. 25. 04:30

    1. Overview

    The RDD (resilient distributed DataSet) elastic distributed data set is the core data structure of spark. DSM (distributed shared memory) is a common memory data abstraction. In DSM, applications can read and write to any location in the global address space. The main difference between RDD and DSM is that not only can the RDD be created by bulk conversion (i.e. "write"), but it can also be written to any memory location. RDD restricts the application to perform bulk write operations, which facilitates effective fault tolerance. In particular, because RDD can use lineage (descent) to recover a partition, there is basically no checkpoint overhead. Failure only requires recalculation of the lost RDD partitions, which can be executed in parallel on different nodes without the need to roll back the entire program.

    2. Motivation

    • Iterative algorithms
    • Interactive data mining tools
    • DSM(Distributed Shared Memory) is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on a commodity cluster

    3. Description

    Aspect RDD DSM
    Read
    • Either coarse-grained or fine-grained
      • Coarse-grained: being able to transform the whole dataset but not an individual element on the dataset
      • fine-grained: being able to transform individual element on the dataset
    Fine-grained
    Write Coarse-grained Fine-grained
    Consistency High(Immutable in nature) Guarantees that if the programmer follows the rules
    Fault-Recovery Mechanism
    • The lost data can be easily recovered in Spark RDD using lineage graph at any moment
    • For each transformation, new RDD is formed and RDDs are immutable in nature so it is easy to recover
    • Being achieved by a checkpointing technique which allows applications to roll back to a recent checkpoint rather than restarting
    Straggler Mitigation Possible to mitigate stragglers using backup task Quite difficult to achieve straggler mitigation
    Behavior if not enough RAM RDDs are shifted to disk Performance decreases if the RAM runs out of storage
    Bulk Operation Data locality  

    3.1 DSM

    3.2 RDD

    3.2.1 Iterative Operations on Spark RDD

     

    3.2.2 Interactive Operations on Spark RDD

    4. References

    https://data-flair.training/blogs/spark-rdd-tutorial/

    http://hadooptutorial.info/resilient-distributed-dataset/

    http://andrewalexanderprice.com/blog20151021.php#.XYpwNpMzZTY

    https://stackoverflow.com/questions/3766845/coarse-grained-vs-fine-grained

    https://topic.alibabacloud.com/a/the-difference-between-rdd-and-dsm_8_8_10274074.html

    'DistributedSystem > Spark' 카테고리의 다른 글

    MapReduce Vs Spark RDD  (0) 2019.09.25
    RDD Lineage and Logical Execution Plan  (0) 2019.09.25
    Apache Spark  (0) 2019.09.20
    Resilient Distributed Dataset(RDD)  (0) 2019.09.08

    댓글

Designed by Tistory.