Cloud/AWS 2022. 6. 15. 17:59

Introduction

Serverless discovery and definition of table definitions and schema
1. S3 “Data Lakes”
2. RDS
3. Redshift
4. Athena
5. EMR
6. Most other SQL databases
Custom ETL jobs
1. Trigger-driven, on a schedule, or on-demand
2. Fully managed
3. Use Apache Spark under the hood (Don’t need to manage the Spark Cluster)

Glue Crawler

Glue crawler scans data in S3, creates schema
Can run periodically
Populates the Glue Data Catalog
- Store only table definition
- Original data stays in S3
Once cataloged, you can treat your unstructured data like it’s structured
- Redshift Spectrum
- Athena
- EMR
- Quicksight

Glue + Hive

Hive lets you run SQL-like queries from EMR
The Glue Data Catalog can serve as a Hive “metastore”
You can also import a Hive metastore into Glue
Glue catalog ↔ Hive metastore

Glue ETL

Glue ETL is a system that lets you automatically process and transform your data. You can do this by a graphical interface that lets you define how you want that transformation to work and it will actually do that using Apache Spark under the hood using Scala or Python code.

Automatic code generation
Scala or Python
Encryption
- Service-side (at rest)
- SSL (in transit)
Can be event-driven
Can provision additional “DPU’s” (data processing units) to increase the performance of underlying Spark jobs
- Enabling job metrics can help you understand the maximum capacity in DPU’s you need
Errors reported to CloudWatch
- Could tie into SNS for notification

Glue ETL - Functions

Transform data, Clean data, Enrich Data (before doing analysis)
- Generate ETL code in Python or Scala, you can modify the code
- Can provide your own Spark or PySpark scripts
- Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
Fully managed, cost-effective, pay only for the resources consumed
Jobs are run on a serverless Spark platform
Glue Scheduler to schedule the jobs
Glue Triggers automate job runs based on “events”

Glue ETL - The DynamicFrame

val pushdownEvents = glueContext.getCatalogSource(
database = "githubarchive_month", tableName = "data")

val projectedEvents = pushdownEvents.applyMapping(Seq(
("id", "string", "id", "long"), ("type", "string", "type", "string"),
("actor.login", "string", "actor", "string"), ("repo.name", "string", "repo", "string")
...
))

A DynamicFrame is a collection of DynamicRecords
- DynamicRecords are self-describing, have a schema
- Very much like a Spark DataFrame, but with more ETL stuff
- Scala and Python APIs

Glue ETL - Transformation

Bundled Transformations
- DropFields, DropNullFields: remove (null) fields
- Filter: Specify a function to filer records
- Join: to enrich data
- Map: Add fields, delete fields, perform external lookups
Machine Learning Transformations
- FindMatches ML: Identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly
Format Conversion: CSV, JSON, Avro, Parquet, ORC, XML
Apache Spark Transformation (ex. K-Means)

Glue ETL - ResolveChoice

Deals with ambiguities in a DynamicFrame and returns a new one. For example, two fields with the same name.

make_cols: Creates a new column for each type.
- price_double, price_string
cast: Casts all values to a specified type
make_struct: Creates a structure that contains each data type
project: Projects every type to a given type, for example, project:string

Glue ETL: Modifying the Data Catalog

ETL scripts can update your schema and partitions if necessary.

Adding new partitions
- Re-run the crawler
- Have the script use enableUpdateCatalog and partitioKeys options
Updating table schema
- Re-run the crawler
- Use enableUpdateCatalog / updateBehavior from script
Creating new tables
- enableUpdateCatalog / updateBehavior with setCatalogInfo
Restrictions
- S3 only
- JSON, csv, avro, parquet only
- Parquet requires special code
- Nested schemas are not supported

저작자표시 비영리 변경금지

'Cloud > AWS' 카테고리의 다른 글

Athena (0)	2022.06.17
AWS Redshift (0)	2022.06.05
Lake Formation (0)	2022.04.26
Lambda (0)	2021.03.09
Choosing the right database on AWS (0)	2021.03.08

ABOUT ME

Demyank's Tlog Demyank's Tlog

Introduction

Glue Crawler

Glue + Hive

Glue ETL

Glue ETL - Functions

Glue ETL - The DynamicFrame

Glue ETL - Transformation

Glue ETL - ResolveChoice

Glue ETL: Modifying the Data Catalog

'Cloud > AWS' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Introduction

Glue Crawler

Glue + Hive

Glue ETL

Glue ETL - Functions

Glue ETL - The DynamicFrame

Glue ETL - Transformation

Glue ETL - ResolveChoice

Glue ETL: Modifying the Data Catalog

'Cloud > AWS' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바