분류 전체보기
-
Collection IntroductionData Engineering 2022. 6. 29. 19:37
Real Time - Immediate actions Kinesis Data Streams (KDS) Simple Queue Service (SQS) Internet of Things (IoT) Near-real time - Reactive actions Kinesis Data Firehose (KDF) Database Migration Service (DMS) Batch - Historical Analysis Snowball Data Pipeline
-
Data FormatData Engineering 2022. 6. 17. 11:02
Parquet Binary Format Machine-Readable Splitable Column-wise Good for Read-Heavy Apps Compression-able Mostly used in Apache Spark Apps Avro Binary Format Machine-Readable Splitable Row-wise Good for Write-Heavy Apps Compression-able Schema Evolution-able Mostly used in Kafka Apps ORC Binary Format Machine-Readable Splitable Column-wise Good for Read-Heavy Apps Mostly used in Hive Apps Protocol ..
-
AthenaCloud/AWS 2022. 6. 17. 10:56
Overview Interactive query service for S3 (SQL) No need to load data, it stays in S3 Presto under the hood Serverless Unstructured, semi-structured, or structured Supports many data formats CSV JSON ORC Parquet Avro Examples ad-hoc queries of weblogs Querying staging data before loading to Redshift Analyze Cloudtail/CloudFront/VPC/ELB etc logs in S3 Integration with Jupiter, Zeppelin, R Studio n..
-
GlueCloud/AWS 2022. 6. 15. 17:59
Introduction Serverless discovery and definition of table definitions and schema S3 “Data Lakes” RDS Redshift Athena EMR Most other SQL databases Custom ETL jobs Trigger-driven, on a schedule, or on-demand Fully managed Use Apache Spark under the hood (Don’t need to manage the Spark Cluster) Glue Crawler Glue crawler scans data in S3, creates schema Can run periodically Populates the Glue Data C..
-
Apache AirflowData Engineering 2022. 6. 6. 11:32
Introduction Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Airflow is an orchestrator allowing you to execute your tasks at the right time, in the right way, in the right order. Airflow allows you to create a data pipeline that will interact with many different tools so that you can execute your tasks at the right time, in the right way, a..
-
AWS RedshiftCloud/AWS 2022. 6. 5. 17:30
Overview Fully-managed, petabyte scale data warehouse service 10 times better performance than other DW’s Via machine learning, massively parallel query execution(MPP), columnar storage Designed for OLAP, not OLTP Cost effective SQL, ODBC, JDBC interfaces Scale up or down on demand Built-in replication & backups Monitoring via CloudWatch/CloudTrail Use Cases Accelerate analytics workloads Unifie..
-
Lake FormationCloud/AWS 2022. 4. 26. 11:43
Introduction Can tie to IAM users/roles, SAML, or external AWS accounts Can use policy tags on databases, tables, or columns Can select specific permissions for tables or columns Overview “Makes it easy to set up a secure data lake in days” Loading data & monitoring data flows Setting up partitions Encryption & managing keys Defining transformation jobs & monitoring them Access control Auditing ..
-
LambdaCloud/AWS 2021. 3. 9. 09:46
1. Comparison between EC2 and Lambda 1.1 EC2 Virtual Servers in the Cloud Limited by RAM and CPU Continuously running Scaling means interventions to add/remove servers 1.2 Lambda Virtual functions - no servers to manage Limited by time - short executions Run on-demand Scaling is automated 2. Benefits of AWS Lambda 2.1 Easy Pricing Pay per request and compute time Free-tier of 1,000,000 AWS Lambd..