Data Engineering
-
AWS MSKData Engineering 2022. 7. 1. 20:05
Overview Alternative to Kinesis (Kafka vs Kinesis next lecture) Fully managed Apache Kafka on AWS Allow you to create, update, delete clusters MSK create & manages Kafka brokers nodes & Zookeeper nodes for you Deploy the MSK cluster in your VPC, multi-AZ (up to 3 for HA) Automatic recovery from common Apache Kafka failures Data is stored on EBS volumes You can build producers and consumers for y..
-
Collection IntroductionData Engineering 2022. 6. 29. 19:37
Real Time - Immediate actions Kinesis Data Streams (KDS) Simple Queue Service (SQS) Internet of Things (IoT) Near-real time - Reactive actions Kinesis Data Firehose (KDF) Database Migration Service (DMS) Batch - Historical Analysis Snowball Data Pipeline
-
Data FormatData Engineering 2022. 6. 17. 11:02
Parquet Binary Format Machine-Readable Splitable Column-wise Good for Read-Heavy Apps Compression-able Mostly used in Apache Spark Apps Avro Binary Format Machine-Readable Splitable Row-wise Good for Write-Heavy Apps Compression-able Schema Evolution-able Mostly used in Kafka Apps ORC Binary Format Machine-Readable Splitable Column-wise Good for Read-Heavy Apps Mostly used in Hive Apps Protocol ..
-
Apache AirflowData Engineering 2022. 6. 6. 11:32
Introduction Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Airflow is an orchestrator allowing you to execute your tasks at the right time, in the right way, in the right order. Airflow allows you to create a data pipeline that will interact with many different tools so that you can execute your tasks at the right time, in the right way, a..
-
KinesisData Engineering 2019. 9. 20. 00:42
Kinesis Data Stream Real-time Data Stream Retention between 1 day to 365 days Ability to reprocess (replay) data Once data is inserted in Kinesis, it can’t be deleted (immutability) Data that share the same partition goes to the same shard (ordering) Producers: AWS SDK, Kinesis Producer Library (KPL), Kinesis Agent Consumers Write your own: Kinesis Client Library (KCL), AWS SDK Managed: AWS Lamb..
-
Apache KafkaData Engineering 2019. 9. 5. 01:33
1. Overview Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open-sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform. Founded by the origin..