DistributedSystem/HadoopEcyosystem
Difference between Hadoop and Spark
데먕
2019. 9. 25. 04:26
1. Overview
Clarify the difference between Hadoop and Spark
2. Description
- Difference between Hadoop and Spark
Features | Hadoop | Spark |
Data processing | Only for batch processing | Batch processing as well as real-time processing |
Processing speed | Slower than Spark cause of I/O disk latency | 100x faster in memory and 10x faster while running on disk |
Category | Data processing engine | Data analytics engine |
Costs | Less costly comparing Spark | Costlier cause of a large amount of RAM |
Scalability | Limited 1000 Nodes in a single cluster | Limited 1000 Nodes in a single cluster |
Machine Learning | Compatible with Apache Mahout while integrating with machine learning | built-in API's to machine learning |
Compatibility | Majorly compatible with all the data sources and file formats | Can integrate with all data sources and file formats supported by Hadoop cluster |
Security | More secured than Spark | being more evolving and getting matured |
Scheduler | Dependent on an external scheduler | Having own scheduler |
Fault tolerance | Uses replications for fault tolerance | Using RDD and other data storage models for fault tolerance |
Ease of Use | A bit complex than Spark cause of JAVA APIs | Easier to use cause of rich APIs |
Duplicate Elimination | Do not support these features | Eliminates duplication |
Language support | Primary Java but also support C, C++, Ruby, Python, Perl, and Groovy | Support Java, Scala, Python, and R |
Latency | Very high latency | Much faster than MapReduce of Hadoop |
Complexity | Difficult to write and debug codes | Easy to write and debug |
Apache Community | Open-source framework | Open-source framework |
Coding | More lines of code | Lesser lines of code |
Interactive Mode | Not interactive | Interactive |
Infrastructure | Commodity Hardware's | Mid to High-level hardware's |
SQL | Support through Hive Query Language | Support through Spark SQL |
Resource Manager | Builtin HDFS | Needs the plugin such as HDFS, Google cloud storage, Amazone S3, Microsoft Azure |
Cluster Manager | Builtin Hadoop YARN | Need the plugin such as YARN, MESOS, or Standalone |
Storage | Persistent storage(HDFS) | RDD |
Usages |
|
|
3. References
https://www.educba.com/mapreduce-vs-apache-spark/
https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html