DistributedSystem
-
Difference between Hadoop and SparkDistributedSystem/HadoopEcyosystem 2019. 9. 25. 04:26
1. Overview Clarify the difference between Hadoop and Spark 2. Description Difference between Hadoop and Spark Features Hadoop Spark Data processing Only for batch processing Batch processing as well as real-time processing Processing speed Slower than Spark cause of I/O disk latency 100x faster in memory and 10x faster while running on disk Category Data processing engine Data analytics engine ..
-
Apache SparkDistributedSystem/Spark 2019. 9. 20. 00:55
1. Overview An open-source distributed general-purpose cluster computing framework with mostly in-memory data processing engine that can do ETL, analytics, machine learning, and graph processing on large volumes of data at rest(batch processing) or in motion(streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL 2. Description 2.1 A..
-
Hadoop Yet Another Resource Negotiator(Yarn)DistributedSystem/HadoopEcyosystem 2019. 9. 14. 16:28
1. Overview A platform that is responsible for managing computing resources in clusters and using them for scheduling users' applications. Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS(Hadoop Distributed File System). Apart from resource management, Yarn also does J..
-
Resilient Distributed Dataset(RDD)DistributedSystem/Spark 2019. 9. 8. 22:32
1. Overview Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. There are two ways to create RDDs − parallelizi..
-
Hadoop Distributed File System(HDFS)DistributedSystem/HadoopEcyosystem 2019. 9. 8. 21:28
1. Overview Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. HDFS follows a Master/Slave Architecture, where a cluster comprises a single Name node(Master node) and all the other nodes are Data nodes(slave nodes). HDFS can be deploye..
-
Apache HadoopDistributedSystem 2019. 9. 5. 03:44
1. Overview Apache Hadoop is a set of software technology components that together form a scalable system optimized for analyzing data. Data analyzed on Hadoop has several typical characteristics. Structured: For example, customer data, transaction data, and clickstream data that is recorded when people click links while visiting websites Unstructured: For example, text from web-based news feeds..