Difference between Hadoop and Spark

데먕 2019. 9. 25. 04:26

1. Overview

Clarify the difference between Hadoop and Spark

2. Description

Difference between Hadoop and Spark

Features	Hadoop	Spark
Data processing	Only for batch processing	Batch processing as well as real-time processing
Processing speed	Slower than Spark cause of I/O disk latency	100x faster in memory and 10x faster while running on disk
Category	Data processing engine	Data analytics engine
Costs	Less costly comparing Spark	Costlier cause of a large amount of RAM
Scalability	Limited 1000 Nodes in a single cluster	Limited 1000 Nodes in a single cluster
Machine Learning	Compatible with Apache Mahout while integrating with machine learning	built-in API's to machine learning
Compatibility	Majorly compatible with all the data sources and file formats	Can integrate with all data sources and file formats supported by Hadoop cluster
Security	More secured than Spark	being more evolving and getting matured
Scheduler	Dependent on an external scheduler	Having own scheduler
Fault tolerance	Uses replications for fault tolerance	Using RDD and other data storage models for fault tolerance
Ease of Use	A bit complex than Spark cause of JAVA APIs	Easier to use cause of rich APIs
Duplicate Elimination	Do not support these features	Eliminates duplication
Language support	Primary Java but also support C, C++, Ruby, Python, Perl, and Groovy	Support Java, Scala, Python, and R
Latency	Very high latency	Much faster than MapReduce of Hadoop
Complexity	Difficult to write and debug codes	Easy to write and debug
Apache Community	Open-source framework	Open-source framework
Coding	More lines of code	Lesser lines of code
Interactive Mode	Not interactive	Interactive
Infrastructure	Commodity Hardware's	Mid to High-level hardware's
SQL	Support through Hive Query Language	Support through Spark SQL
Resource Manager	Builtin HDFS	Needs the plugin such as HDFS, Google cloud storage, Amazone S3, Microsoft Azure
Cluster Manager	Builtin Hadoop YARN	Need the plugin such as YARN, MESOS, or Standalone
Storage	Persistent storage(HDFS)	RDD
Usages	Linear Processing of large Dataset No intermediate solution required	Fast and interactive data processing Joining Datasets Graph processing iterative jobs Real-time processing Machine Learning

3. References

https://www.educba.com/mapreduce-vs-apache-spark/

https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html