spark
-
Difference between Hadoop and SparkDistributedSystem/HadoopEcyosystem 2019. 9. 25. 04:26
1. Overview Clarify the difference between Hadoop and Spark 2. Description Difference between Hadoop and Spark Features Hadoop Spark Data processing Only for batch processing Batch processing as well as real-time processing Processing speed Slower than Spark cause of I/O disk latency 100x faster in memory and 10x faster while running on disk Category Data processing engine Data analytics engine ..
-
Resilient Distributed Dataset(RDD)DistributedSystem/Spark 2019. 9. 8. 22:32
1. Overview Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. There are two ways to create RDDs − parallelizi..