-
Resilient Distributed Dataset(RDD)DistributedSystem/Spark 2019. 9. 8. 22:32
1. Overview
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
There are two ways to create RDDs − parallelizing an existing collection in your driver program or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
2. Description
2.1 Keyword of RDDs
Feature Description Resilient fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures Distributed Data residing on multiple nodes in a cluster Dataset A collection of partitioned data with primitive values or values of values, e.g. tuples or other objects that represent records of the data you work with 2.2 Traits
Traits Description In-Memory The data inside RDD is stored in memory as much size and long time as possible Immutable(Read-only) Not change once created and can only be transformed using transformations to new RDDs Lazy evaluated The data inside RDD is not available or transformed until an action is executed that triggers the execution Cacheable Being able to hold all the data in persistent storage like memory which default and most preferred or disk the least preferred due to access speed Parallel process data in parallel Typed RDD records have types, e.g. Long in RDD[long] or (Int, String) in RDD[(Int, String)] Partitioned Records are partitioned split into logical partitions and distributed across nodes in a cluster Location-Stickiness RDD can define placement preferences to compute partitions as close to the records as possible 2.3 Terminology
Terminology Description Partitions - The units of parallelism
- Each partition comprised of records
- Being able to control the number of partitions of an RDD using repartition or coalesce transformations.
Transformations Lazy operations that return another RDD Actions Operations that trigger computation and return values Data locality Spark tries to be as close to data as possible without wasting time to send data across a network by means of RDD shuffling RDD shuffling - Process of redistributing data across partitions(aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire
- Leverage partial aggregation to reduce data transfer
Intermediate result Reused intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network 2.4 Contract
Contract Description Parent RDDs RDD dependencies An array of partitions a dataset is divided to A compute function Doing a computation on partitions Optional preferred locations aka locality info. hosts for a partition where the records live or are the closest to read from 2.5 Representing RDDs
Features Description Partition Represent atomic pieces of the dataset Dependencies List of dependencies that an RDD has on its parent RDDs or data sources
- Narrow Dependencies: Where each partition of the parent node is used by at most one child partition. For example, map and filter operations
- Wide Dependencies: Where multiple child partitions user a single parent partition.
Iterator a function that computes an RDD based on its parents partitioner Whether data is range/hash partitioned preferredLocation Nodes where a partition can be accessed faster due to data locality 2.6 Transformation
A lazy operation on an RDD that returns another RDD, like map, flatMap, filter, reduceBy, join, cogroup, etc.
2.7 Types of RDDs
name Description ParallelCollectionRDD - A collection of elements with numSlices partitions and optional location press
- The result of SparkContext.parallelize and SparkContext.makeRDD methods
- The data collection is split on the numSlices slices.
CoGroupedRDD - An RDD that cogroups its pair RDD parents
- For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key
HadoopRDD - An RDD that provides core functionality for reading data stored in HDFS using the older MapReduce API
- The most notable use case is the return RDD of SparkContext.textFile
MapPartitionsRDD - A result of calling operations like map, flatMap, filter, mapPartitions, etc.
CoalescedRDD - A result of repartition or coalesce transformations
- Coalesce Transformation can trigger RDD shuffling depending on the shuffle flag
ShuffledRDD A result of shuffling, e.g. after repartition or coalesce transformations PipedRDD An RDD created by piping elements to a forked external process PairRDD (implicit conversion by PairRDDFunctions) An RDD of key-value pairs that a result of groupByKey and join operations DoubleRDD (implicit conversion as org.apache.spark.rdd.DoubleRDDFunctions) that is an RDD of Double type SequenceFileRDD (implicit conversion as org.apache.spark.rdd.SequenceFileRDDFunctions) that is an RDD that can be saved as a SequenceFile 3. Example
3.1 Creating RDDs
3.1.1 SparkContext.parallelize
- Mainly used to learn Spark in the Spark shell
- Requires all the data to be available on a single machine
scala> val rdd = sc.parallelize(1 to 1000) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:25 scala> val data = Seq.fill(10)(util.Random.nextInt) data: Seq[Int] = List(-964985204, 1662791, -1820544313, -383666422, -111039198, 310967683, 1114081267, 1244509086, 1797452433, 124035586) scala> val rdd = sc.parallelize(data) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:29
3.1.2 SparkContext.textFile
Creating RDD from reading files
scala> val words = sc.textFile("README.md").flatMap(_.split("\\W+")).cache words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at <console>:24
4. References
https://data-flair.training/blogs/spark-rdd-tutorial/
https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd.html
'DistributedSystem > Spark' 카테고리의 다른 글
MapReduce Vs Spark RDD (0) 2019.09.25 RDD Lineage and Logical Execution Plan (0) 2019.09.25 Difference between RDD and DSM (0) 2019.09.25 Apache Spark (0) 2019.09.20