ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Glue
    Cloud/AWS 2022. 6. 15. 17:59

    Introduction

    1. Serverless discovery and definition of table definitions and schema
      1. S3 “Data Lakes”
      2. RDS
      3. Redshift
      4. Athena
      5. EMR
      6. Most other SQL databases
    2. Custom ETL jobs
      1. Trigger-driven, on a schedule, or on-demand
      2. Fully managed
      3. Use Apache Spark under the hood (Don’t need to manage the Spark Cluster)

    Glue Crawler

    • Glue crawler scans data in S3, creates schema
    • Can run periodically
    • Populates the Glue Data Catalog
      • Store only table definition
      • Original data stays in S3
    • Once cataloged, you can treat your unstructured data like it’s structured
      • Redshift Spectrum
      • Athena
      • EMR
      • Quicksight

    Glue + Hive

    • Hive lets you run SQL-like queries from EMR
    • The Glue Data Catalog can serve as a Hive “metastore”
    • You can also import a Hive metastore into Glue
    • Glue catalog ↔ Hive metastore

    Glue ETL

    Glue ETL is a system that lets you automatically process and transform your data. You can do this by a graphical interface that lets you define how you want that transformation to work and it will actually do that using Apache Spark under the hood using Scala or Python code.

    • Automatic code generation
    • Scala or Python
    • Encryption
      • Service-side (at rest)
      • SSL (in transit)
    • Can be event-driven
    • Can provision additional “DPU’s” (data processing units) to increase the performance of underlying Spark jobs
      • Enabling job metrics can help you understand the maximum capacity in DPU’s you need
    • Errors reported to CloudWatch
      • Could tie into SNS for notification

    Glue ETL - Functions

    • Transform data, Clean data, Enrich Data (before doing analysis)
      • Generate ETL code in Python or Scala, you can modify the code
      • Can provide your own Spark or PySpark scripts
      • Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
    • Fully managed, cost-effective, pay only for the resources consumed
    • Jobs are run on a serverless Spark platform
    • Glue Scheduler to schedule the jobs
    • Glue Triggers automate job runs based on “events”

    Glue ETL - The DynamicFrame

    val pushdownEvents = glueContext.getCatalogSource(
    database = "githubarchive_month", tableName = "data")
    
    val projectedEvents = pushdownEvents.applyMapping(Seq(
    ("id", "string", "id", "long"), ("type", "string", "type", "string"),
    ("actor.login", "string", "actor", "string"), ("repo.name", "string", "repo", "string")
    ...
    ))
    • A DynamicFrame is a collection of DynamicRecords
      • DynamicRecords are self-describing, have a schema
      • Very much like a Spark DataFrame, but with more ETL stuff
      • Scala and Python APIs

    Glue ETL - Transformation

    • Bundled Transformations
      • DropFields, DropNullFields: remove (null) fields
      • Filter: Specify a function to filer records
      • Join: to enrich data
      • Map: Add fields, delete fields, perform external lookups
    • Machine Learning Transformations
      • FindMatches ML: Identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly
    • Format Conversion: CSV, JSON, Avro, Parquet, ORC, XML
    • Apache Spark Transformation (ex. K-Means)

    Glue ETL - ResolveChoice

    Deals with ambiguities in a DynamicFrame and returns a new one. For example, two fields with the same name.

    • make_cols: Creates a new column for each type.
      • price_double, price_string
    • cast: Casts all values to a specified type
    • make_struct: Creates a structure that contains each data type
    • project: Projects every type to a given type, for example, project:string

    Glue ETL: Modifying the Data Catalog

    ETL scripts can update your schema and partitions if necessary.

    • Adding new partitions
      • Re-run the crawler
      • Have the script use enableUpdateCatalog and partitioKeys options
    • Updating table schema
      • Re-run the crawler
      • Use enableUpdateCatalog / updateBehavior from script
    • Creating new tables
      • enableUpdateCatalog / updateBehavior with setCatalogInfo
    • Restrictions
      • S3 only
      • JSON, csv, avro, parquet only
      • Parquet requires special code
      • Nested schemas are not supported

    'Cloud > AWS' 카테고리의 다른 글

    Athena  (0) 2022.06.17
    AWS Redshift  (0) 2022.06.05
    Lake Formation  (0) 2022.04.26
    Lambda  (0) 2021.03.09
    Choosing the right database on AWS  (0) 2021.03.08

    댓글

Designed by Tistory.