ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • AWS Redshift
    Cloud/AWS 2022. 6. 5. 17:30

    Overview

    • Fully-managed, petabyte scale data warehouse service
    • 10 times better performance than other DW’s
      • Via machine learning, massively parallel query execution(MPP), columnar storage
    • Designed for OLAP, not OLTP
    • Cost effective
    • SQL, ODBC, JDBC interfaces
    • Scale up or down on demand
    • Built-in replication & backups
    • Monitoring via CloudWatch/CloudTrail

    Use Cases

    • Accelerate analytics workloads
    • Unified data warehouse & data lake
    • Data warehouse modernization
    • Analyze global sales data
    • Store historical stock trade data
    • Analyze ad impressions & clicks
    • Aggregate gaming data
    • Analyze social trends

    Redshift Architecture

    • A cluster is the core infrastructure component of an AWS Redshift Data Warehouse
    • A cluster composed of a leader node and compute nodes
    • Compute nodes are between 1 and 128 depending on node type
    • Each cluster can contain one or more databases
    • The user data is going to be stored on the compute nodes
    • The leader node is just managing communication between the client programs and all communication with the compute nodes
    • The leader node of interface between your external clients to redshift and the compute nodes under the hood
    • The leader node receives all the queries from client application passes the queries and develops execution plans which are an ordered set of steps to process those queries. It coordinate the parallel execution of those plans with the compute nodes and also aggregates the intermediate results from those nodes
    • Each compute node has its own dedicated CPU, memory, and attached disk storage

    Type of node

    Dense storage (DS) Node type

    • DS allows you to create a very large data warehouse using hard disk drives or HDD for a very low price point.
      • el: 3 HDD with a total of 2 terabytes of magnetic storage
      • 8xl: 24 HDD with a total of 16 terabytes of magnetic storage

    Redshift Spectrum

    • Query exabytes of unstructured data in S3 without loading
    • Limitless concurrency
    • Horizontal scaling
    • Separate storage & compute resources
    • Wide variety of data formats: Avro, CSV, ORC, Parquet, JSON, Grok, Ion, RC file, Regx, Cirta, text file, TSV, and etc.
    • Support of Gzip and Snappy compression

    Redshift Performance

    • Massively Parallel Processing (MPP)
    • Columnar Data Storage
      • A block size of 1 MB which is more efficient and further reduces the number of IO request needed to perform any database loading or other operations that are part of a query execution
    • Column Compression

    Redshift Durability

    • Replication within cluster
    • Backup to S3: Asynchronously replicated to another region
    • Automated snapshots
    • Failed drives/nodes automatically replaced
    • However - limited to a single availability zone (AZ)

    Scaling Redshift

    • Vertical and horizontal scaling on demand
    • During scaling
      • A new cluster is created while your old one remains available for reads
      • CNAME is flipped to new cluster (a few minutes of downtime)
      • Data moved in parallel to new compute nodes

    Redshift Distribution Styles

    AUTO

    Redshift figures it out based on size of data

    EVEN

    • Rows distributed across slices in round-robin
    • This is appropriate when a table does not participate in joins or when there is not a clear choice between key distribution or all distribution

    KEY

    • Rows distributed based on one column
    • The rows are distributed according to the values in one columns
    • The leader node will place matching values on the same node slice and matching values from the common columns or physically stored together
    • Come in handy if you’re typically going to be doing queries based on a specific column in your data

    ALL

    • Entire table is copied to every node
    • Ensure that every row is co-located for every join that the table participates in
    • All distribution multiplies the storage required by the number of nodes in the cluster
    • It takes much longer to load update or insert data into multiple tables
    • Appropriate for a relatively slow moving tables: Tables that are not updated frequently or extensively
    • As the cost of redistribution is low, small dimension tables do not benefit significantly from ALL distribution

    Sort Keys

    • Rows are stored on disk in sorted order based on the column you designate as a sort key
    • Like an index
    • Makes for fast range queries
    • Choosing a sort key: Recency, Filtering, Joins
      • If you frequently join a table specify the join column as both the source key and the distribution key, that would enable the query optimizer to choose a sort merge join instead of a slower hash join
    • Single vs. Compound vs Interleaved sort keys

    Sort Keys: Single Column

    • Just using a single column, a single value to sort the data
    • Useful if you’re consistently querying for data within a single filer column
      • For example, if you were querying primarily by date, choose a single column sort key by date

    Sort Keys: Compound

    • All the columns listed in the sort key definition in the order they are listed in
    • This will be most useful when a query’s filter applies conditions such as filters and joins that use a prefix of the sort keys
    • Compound is the default sort type
    • Also help to improve compression

    Sort Keys: Interleaved

    • Gives equal weight to each column or subset of columns in the sort key
    • This can be useful if multiple queries use different columns for filters
    • it uses an internal compression scheme for a zone map values that enables them to better discriminate among column values

    Importing/Exporting data

    Copy command

    • Parallelized, efficient
      • Read from multiple data files or multiple data streams simultaneously
    • From S3, EMR, DynamoDB, remote hosts
    • S3 requires a manifest file and IAM role
    • Use COPY to load large amounts of data from outside of Redshift
    • If your data is already in Redshift in another table
      • Use INSERT INTO … SELECT
      • Or Create TABLE AS
    • COPY can decrypt data as it is loaded from S3
      • Hardware-accelerated SSL used to keep it fast
    • Gzip, Izop, and bzip2 compression supported to speed it up forther
    • Automatic compression option
      • Analyzes data being loaded and figure out optimal compression scheme for storing it
    • Special case: narrow tables (lots of row, few columns)
      • Load with a single COPY transaction if possible
      • Otherwise hidden metadata columns consume too much space

    Redshift copy grants for cross-region snapshot copies

    • Let’s say you have a KMS-encrypted Redshift cluster and a snapshot of it
    • You want to copy that snapshot to another region for backup
    • In the destination AWS region:
      • Create a KMS key if you don’t have one already
      • Specify a unique name for your snapshot copy grant
      • Specify the KMS key ID for which you’re creating the copy grant
    • In the source AWS region:
      • Enable copying of snapshots to the copy grant you just created

    UNLOAD command

    • Unload from a table into files in S3

    Enhanced VPC routing

    • Force all of your copy and unload traffic between your cluster and the repositories through your Amazon VPC
    • Otherwise, all of that copy and unload traffic will be routed through the internet
    • You might need to set up VPC endpoints or NAT gateways or Internet gateways within your VPC to make sure that copy and unload can communicate with the repositories

    DBLINK

    • Connect Redshift to PostgreSQL (possibly in RDS)
    • Good way to copy and sync data between PostgreSQL and Redshift

    Redshift Workload Management (WLM)

    • Prioritize short, fast queries vs. long, slow queries
    • Query queues
    • Via console, CLI, or API

    Automatic Workload Management

    • Creates up to 8 queues
    • Default 5 queues with even memory allocation
    • Large queries (i.e. big hash joins) —> concurrency lowered
    • Small queries (i.e. inserts, scans, aggregations) —> concurrency raised
    • Configuring query queues
      • Priority
      • Concurrency scaling mode
      • User groups
      • Query groups
      • Query monitoring rules

    Manual Workload Management

    • One default queue with concurrency level of 5 (5 queries at once)
    • Superuser queue with concurrency level 1
    • Define up to 8 queues, up to concurrency level 50
      • Each can have defined concurrency scaling mode, concurrency level, user groups, query groups, memory, timeout, query monitoring rules
      • Can also enable query queue hopping
        • Timed out queries “hop” to next queue to try again

    Short Query Acceleration (SQA)

    • Prioritize short-running queries over longer-running ones
    • Short queries run in a dedicated space, won’t wait in queue behind long queries
    • Can be used in place of WLM queues for short queries
    • Works with:
      • CREATE TABLE AS (CTAS)
      • Read-only queries (SELECT statements)
    • Uses machine learning to predict a query’s execution time
    • Can configure how many seconds is “short”

    VACUUM command

    • Recovers space from deleted rows
    • VACUUM FULL
    • VACUUM DELETE ONLY
    • VACUUM SORT ONLY
    • VACUUM REINDEX

    Redshift anti-patterns

    Concurrency Scaling

    • Automatically adds cluster capacity to handle increase in concurrent read queries
    • Support virtually unlimited concurrent users & queries
    • WLM queues manage which queries are sent to the concurrency scaling cluster

    Resizing Redshift Clusters

    Elastic resize

    • Quickly add or remove nodes of same type (It can change node types, but not without dropping connections - it creates a whole new cluster)
    • Cluster is down for a few minutes
    • Tries to keep connections open across the downtime
    • Limited to doubling or halving for some dc2 and ra3 node types

    Classic resize

    • Change node type and/or number of nodes
    • Cluster is read-only for hours to days

    Snapshot, restore, resize

    • Used to keep cluster available during a classic resize
    • Copy cluster, resize new cluster

    New Features

    RA3 nodes with managed storage

    • Enable independent scaling of compute and storage
    • SSD-based

    Redshift data lake export

    • Unload Redshift query to S3 in Apache Parquet format
    • Parquet is 2x faster to unload and consumes up to 6x less storage
    • Compatible with Redshift Spectrum, Athena, EMR, SageMaker
    • Automatically partitioned

    Spatial data types

    • GEOMETRY, GEOGRAPHY

    Cross-Region Data Sharing

    • Share live data across Redshift clusters without copying
    • Requires new RA3 node type
    • Secure, across regions and across accounts

    AQUA

    • Advanced Query Accelerator
    • Available on ra3.4xl, ra3.16xl
    • Pushes reduction and aggregation queries closer to the data
    • Up to 10x faster, no extra cost, no code changes
    • Also benefits from high-bandwidth connection to S3
    • All you have to do is turn it on in your cluster configuration ( when using the supported node types)

    Redshift Security Concerns

    • Using a Hardware Security Module (HSM)
      • Must use a client and server certificate to configure a trusted connection between Redshift and the HSM
      • If migrating an unencrypted cluster to an HSM-encrypted cluster, you must create the new encrypted cluster and then move data to it
    • Defining access privileges for user or group
      • Use the GRANT or REVOKE commands in SQL
      • Example: grant select on table foo to bob;

    Redshift Serverless

    • Automatic scaling and provisioning for your workload
    • Optimizes costs & performance
      • Pay only when in use
    • Uses ML to maintain performance across variable & sporadic workloads
    • Easy spinup of development and test environments
    • Easy ad-hoc business analysis
    • You get back a serverless endpoint, JDBC/ODBC connection, or just query via the console’s query editor

    Getting Started

    • Need an IAM role with this policy
    {
    	"Version": "2012-10-17",
    	"Statement": [
    		{
    			"Effect": "Allow",
    			"Action": "redshift-serverless:*",
    			"Resource": "*"
    		}
    	]
    }
    
    • Define your
      • Database name
      • Admin user credentials
      • VPC
      • Encryption settings: AWS-owned KMS by default
      • Audit logging
    • Can manage snapshots & recovery points after creation

    Resource Scaling in Redshift Serverless

    • Capacity measured in Redshift Processing Units (RPU’s)
    • You pay for RPU-hours (per second) plus storage
    • Base RPU’s
      • You can adjust base capacity
      • Default to AUTO
      • But you can adjust from 32-512 RPU’s to improve query performance
    • Max RPU’s
      • Can set a usage limit to control costs
      • Or, increase it to improve throughput

    Limitation

    • Unavailables
      • Redshift Spectrum
      • Parameter Groups
      • Workload Management
      • AWS Partner Integration
      • Maintenance windows/version tracks
    • No public endpoints (yet)
      • Must access within a VPC

    Monitoring

    • Monitoring views
      • SYS_QUERY_HISTORY
      • SYS_LOAD_HISTORY
      • SYS_SERVERLESS_USAGE
      • … and many more
    • CloudWatch logs
      • Connection & user logs enabled by default
      • Optional user activity log data
      • Under /aws/redshift/serverless
    • CloudWatch Metrics
      • QueriesCompletedPerSecond, QueryDuration, QueriesRunning, etc.
      • Dimensions: DatabaseName, latency (short/medium/long), QueryType, stage

    Reference

    https://aws.amazon.com/redshift/

    'Cloud > AWS' 카테고리의 다른 글

    Athena  (0) 2022.06.17
    Glue  (0) 2022.06.15
    Lake Formation  (0) 2022.04.26
    Lambda  (0) 2021.03.09
    Choosing the right database on AWS  (0) 2021.03.08

    댓글

Designed by Tistory.