ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Difference between Hadoop and Spark
    DistributedSystem/HadoopEcyosystem 2019. 9. 25. 04:26

    1. Overview

    Clarify the difference between Hadoop and Spark

     

    2. Description

    • Difference between Hadoop and Spark
    Features Hadoop Spark
    Data processing Only for batch processing Batch processing as well as real-time processing
    Processing speed Slower than Spark cause of I/O disk latency 100x faster in memory and 10x faster while running on disk
    Category Data processing engine Data analytics engine
    Costs Less costly comparing Spark Costlier cause of a large amount of RAM
    Scalability Limited 1000 Nodes in a single cluster Limited 1000 Nodes in a single cluster
    Machine Learning Compatible with Apache Mahout while integrating with machine learning built-in API's to machine learning
    Compatibility Majorly compatible with all the data sources and file formats Can integrate with all data sources and file formats supported by Hadoop cluster
    Security More secured than Spark being more evolving and getting matured
    Scheduler Dependent on an external scheduler Having own scheduler
    Fault tolerance Uses replications for fault tolerance Using RDD and other data storage models for fault tolerance
    Ease of Use A bit complex than Spark cause of JAVA APIs Easier to use cause of rich APIs
    Duplicate Elimination Do not support these features Eliminates duplication
    Language support Primary Java but also support C, C++, Ruby, Python, Perl, and Groovy Support Java, Scala, Python, and R
    Latency Very high latency Much faster than MapReduce of Hadoop
    Complexity Difficult to write and debug codes Easy to write and debug
    Apache Community Open-source framework Open-source framework
    Coding More lines of code Lesser lines of code
    Interactive Mode Not interactive Interactive
    Infrastructure Commodity Hardware's Mid to High-level hardware's
    SQL  Support through Hive Query Language Support through Spark SQL
    Resource Manager Builtin HDFS Needs the plugin such as HDFS, Google cloud storage, Amazone S3, Microsoft Azure
    Cluster Manager Builtin Hadoop YARN Need the plugin such as YARN, MESOS, or Standalone
    Storage Persistent storage(HDFS) RDD
    Usages
    • Linear Processing of large Dataset
    • No intermediate solution required
    • Fast and interactive data processing
    • Joining Datasets
    • Graph processing
    • iterative jobs
    • Real-time processing
    • Machine Learning

     

    3. References

    https://www.educba.com/mapreduce-vs-apache-spark/

    https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html

    'DistributedSystem > HadoopEcyosystem' 카테고리의 다른 글

    Hadoop  (0) 2020.03.09
    Big Data  (0) 2019.09.25
    MapReduce  (0) 2019.09.25
    Hadoop Yet Another Resource Negotiator(Yarn)  (0) 2019.09.14
    Hadoop Distributed File System(HDFS)  (0) 2019.09.08

    댓글

Designed by Tistory.