-
Difference between Hadoop and SparkDistributedSystem/HadoopEcyosystem 2019. 9. 25. 04:26
1. Overview
Clarify the difference between Hadoop and Spark
2. Description
- Difference between Hadoop and Spark
Features Hadoop Spark Data processing Only for batch processing Batch processing as well as real-time processing Processing speed Slower than Spark cause of I/O disk latency 100x faster in memory and 10x faster while running on disk Category Data processing engine Data analytics engine Costs Less costly comparing Spark Costlier cause of a large amount of RAM Scalability Limited 1000 Nodes in a single cluster Limited 1000 Nodes in a single cluster Machine Learning Compatible with Apache Mahout while integrating with machine learning built-in API's to machine learning Compatibility Majorly compatible with all the data sources and file formats Can integrate with all data sources and file formats supported by Hadoop cluster Security More secured than Spark being more evolving and getting matured Scheduler Dependent on an external scheduler Having own scheduler Fault tolerance Uses replications for fault tolerance Using RDD and other data storage models for fault tolerance Ease of Use A bit complex than Spark cause of JAVA APIs Easier to use cause of rich APIs Duplicate Elimination Do not support these features Eliminates duplication Language support Primary Java but also support C, C++, Ruby, Python, Perl, and Groovy Support Java, Scala, Python, and R Latency Very high latency Much faster than MapReduce of Hadoop Complexity Difficult to write and debug codes Easy to write and debug Apache Community Open-source framework Open-source framework Coding More lines of code Lesser lines of code Interactive Mode Not interactive Interactive Infrastructure Commodity Hardware's Mid to High-level hardware's SQL Support through Hive Query Language Support through Spark SQL Resource Manager Builtin HDFS Needs the plugin such as HDFS, Google cloud storage, Amazone S3, Microsoft Azure Cluster Manager Builtin Hadoop YARN Need the plugin such as YARN, MESOS, or Standalone Storage Persistent storage(HDFS) RDD Usages - Linear Processing of large Dataset
- No intermediate solution required
- Fast and interactive data processing
- Joining Datasets
- Graph processing
- iterative jobs
- Real-time processing
- Machine Learning
3. References
https://www.educba.com/mapreduce-vs-apache-spark/
https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html
'DistributedSystem > HadoopEcyosystem' 카테고리의 다른 글
Hadoop (0) 2020.03.09 Big Data (0) 2019.09.25 MapReduce (0) 2019.09.25 Hadoop Yet Another Resource Negotiator(Yarn) (0) 2019.09.14 Hadoop Distributed File System(HDFS) (0) 2019.09.08