-
Introduction
- Serverless discovery and definition of table definitions and schema
- S3 “Data Lakes”
- RDS
- Redshift
- Athena
- EMR
- Most other SQL databases
- Custom ETL jobs
- Trigger-driven, on a schedule, or on-demand
- Fully managed
- Use Apache Spark under the hood (Don’t need to manage the Spark Cluster)
Glue Crawler
- Glue crawler scans data in S3, creates schema
- Can run periodically
- Populates the Glue Data Catalog
- Store only table definition
- Original data stays in S3
- Once cataloged, you can treat your unstructured data like it’s structured
- Redshift Spectrum
- Athena
- EMR
- Quicksight
Glue + Hive
- Hive lets you run SQL-like queries from EMR
- The Glue Data Catalog can serve as a Hive “metastore”
- You can also import a Hive metastore into Glue
- Glue catalog ↔ Hive metastore
Glue ETL
Glue ETL is a system that lets you automatically process and transform your data. You can do this by a graphical interface that lets you define how you want that transformation to work and it will actually do that using Apache Spark under the hood using Scala or Python code.
- Automatic code generation
- Scala or Python
- Encryption
- Service-side (at rest)
- SSL (in transit)
- Can be event-driven
- Can provision additional “DPU’s” (data processing units) to increase the performance of underlying Spark jobs
- Enabling job metrics can help you understand the maximum capacity in DPU’s you need
- Errors reported to CloudWatch
- Could tie into SNS for notification
Glue ETL - Functions
- Transform data, Clean data, Enrich Data (before doing analysis)
- Generate ETL code in Python or Scala, you can modify the code
- Can provide your own Spark or PySpark scripts
- Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
- Fully managed, cost-effective, pay only for the resources consumed
- Jobs are run on a serverless Spark platform
- Glue Scheduler to schedule the jobs
- Glue Triggers automate job runs based on “events”
Glue ETL - The DynamicFrame
val pushdownEvents = glueContext.getCatalogSource( database = "githubarchive_month", tableName = "data") val projectedEvents = pushdownEvents.applyMapping(Seq( ("id", "string", "id", "long"), ("type", "string", "type", "string"), ("actor.login", "string", "actor", "string"), ("repo.name", "string", "repo", "string") ... ))
- A DynamicFrame is a collection of DynamicRecords
- DynamicRecords are self-describing, have a schema
- Very much like a Spark DataFrame, but with more ETL stuff
- Scala and Python APIs
Glue ETL - Transformation
- Bundled Transformations
- DropFields, DropNullFields: remove (null) fields
- Filter: Specify a function to filer records
- Join: to enrich data
- Map: Add fields, delete fields, perform external lookups
- Machine Learning Transformations
- FindMatches ML: Identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly
- Format Conversion: CSV, JSON, Avro, Parquet, ORC, XML
- Apache Spark Transformation (ex. K-Means)
Glue ETL - ResolveChoice
Deals with ambiguities in a DynamicFrame and returns a new one. For example, two fields with the same name.
- make_cols: Creates a new column for each type.
- price_double, price_string
- cast: Casts all values to a specified type
- make_struct: Creates a structure that contains each data type
- project: Projects every type to a given type, for example, project:string
Glue ETL: Modifying the Data Catalog
ETL scripts can update your schema and partitions if necessary.
- Adding new partitions
- Re-run the crawler
- Have the script use enableUpdateCatalog and partitioKeys options
- Updating table schema
- Re-run the crawler
- Use enableUpdateCatalog / updateBehavior from script
- Creating new tables
- enableUpdateCatalog / updateBehavior with setCatalogInfo
- Restrictions
- S3 only
- JSON, csv, avro, parquet only
- Parquet requires special code
- Nested schemas are not supported
'Cloud > AWS' 카테고리의 다른 글
Athena (0) 2022.06.17 AWS Redshift (0) 2022.06.05 Lake Formation (0) 2022.04.26 Lambda (0) 2021.03.09 Choosing the right database on AWS (0) 2021.03.08 - Serverless discovery and definition of table definitions and schema