-
AWS RedshiftCloud/AWS 2022. 6. 5. 17:30
Overview
- Fully-managed, petabyte scale data warehouse service
- 10 times better performance than other DW’s
- Via machine learning, massively parallel query execution(MPP), columnar storage
- Designed for OLAP, not OLTP
- Cost effective
- SQL, ODBC, JDBC interfaces
- Scale up or down on demand
- Built-in replication & backups
- Monitoring via CloudWatch/CloudTrail
Use Cases
- Accelerate analytics workloads
- Unified data warehouse & data lake
- Data warehouse modernization
- Analyze global sales data
- Store historical stock trade data
- Analyze ad impressions & clicks
- Aggregate gaming data
- Analyze social trends
Redshift Architecture
- A cluster is the core infrastructure component of an AWS Redshift Data Warehouse
- A cluster composed of a leader node and compute nodes
- Compute nodes are between 1 and 128 depending on node type
- Each cluster can contain one or more databases
- The user data is going to be stored on the compute nodes
- The leader node is just managing communication between the client programs and all communication with the compute nodes
- The leader node of interface between your external clients to redshift and the compute nodes under the hood
- The leader node receives all the queries from client application passes the queries and develops execution plans which are an ordered set of steps to process those queries. It coordinate the parallel execution of those plans with the compute nodes and also aggregates the intermediate results from those nodes
- Each compute node has its own dedicated CPU, memory, and attached disk storage
Type of node
Dense storage (DS) Node type
- DS allows you to create a very large data warehouse using hard disk drives or HDD for a very low price point.
- el: 3 HDD with a total of 2 terabytes of magnetic storage
- 8xl: 24 HDD with a total of 16 terabytes of magnetic storage
Redshift Spectrum
- Query exabytes of unstructured data in S3 without loading
- Limitless concurrency
- Horizontal scaling
- Separate storage & compute resources
- Wide variety of data formats: Avro, CSV, ORC, Parquet, JSON, Grok, Ion, RC file, Regx, Cirta, text file, TSV, and etc.
- Support of Gzip and Snappy compression
Redshift Performance
- Massively Parallel Processing (MPP)
- Columnar Data Storage
- A block size of 1 MB which is more efficient and further reduces the number of IO request needed to perform any database loading or other operations that are part of a query execution
- Column Compression
Redshift Durability
- Replication within cluster
- Backup to S3: Asynchronously replicated to another region
- Automated snapshots
- Failed drives/nodes automatically replaced
- However - limited to a single availability zone (AZ)
Scaling Redshift
- Vertical and horizontal scaling on demand
- During scaling
- A new cluster is created while your old one remains available for reads
- CNAME is flipped to new cluster (a few minutes of downtime)
- Data moved in parallel to new compute nodes
Redshift Distribution Styles
AUTO
Redshift figures it out based on size of data
EVEN
- Rows distributed across slices in round-robin
- This is appropriate when a table does not participate in joins or when there is not a clear choice between key distribution or all distribution
KEY
- Rows distributed based on one column
- The rows are distributed according to the values in one columns
- The leader node will place matching values on the same node slice and matching values from the common columns or physically stored together
- Come in handy if you’re typically going to be doing queries based on a specific column in your data
ALL
- Entire table is copied to every node
- Ensure that every row is co-located for every join that the table participates in
- All distribution multiplies the storage required by the number of nodes in the cluster
- It takes much longer to load update or insert data into multiple tables
- Appropriate for a relatively slow moving tables: Tables that are not updated frequently or extensively
- As the cost of redistribution is low, small dimension tables do not benefit significantly from ALL distribution
Sort Keys
- Rows are stored on disk in sorted order based on the column you designate as a sort key
- Like an index
- Makes for fast range queries
- Choosing a sort key: Recency, Filtering, Joins
- If you frequently join a table specify the join column as both the source key and the distribution key, that would enable the query optimizer to choose a sort merge join instead of a slower hash join
- Single vs. Compound vs Interleaved sort keys
Sort Keys: Single Column
- Just using a single column, a single value to sort the data
- Useful if you’re consistently querying for data within a single filer column
- For example, if you were querying primarily by date, choose a single column sort key by date
Sort Keys: Compound
- All the columns listed in the sort key definition in the order they are listed in
- This will be most useful when a query’s filter applies conditions such as filters and joins that use a prefix of the sort keys
- Compound is the default sort type
- Also help to improve compression
Sort Keys: Interleaved
- Gives equal weight to each column or subset of columns in the sort key
- This can be useful if multiple queries use different columns for filters
- it uses an internal compression scheme for a zone map values that enables them to better discriminate among column values
Importing/Exporting data
Copy command
- Parallelized, efficient
- Read from multiple data files or multiple data streams simultaneously
- From S3, EMR, DynamoDB, remote hosts
- S3 requires a manifest file and IAM role
- Use COPY to load large amounts of data from outside of Redshift
- If your data is already in Redshift in another table
- Use INSERT INTO … SELECT
- Or Create TABLE AS
- COPY can decrypt data as it is loaded from S3
- Hardware-accelerated SSL used to keep it fast
- Gzip, Izop, and bzip2 compression supported to speed it up forther
- Automatic compression option
- Analyzes data being loaded and figure out optimal compression scheme for storing it
- Special case: narrow tables (lots of row, few columns)
- Load with a single COPY transaction if possible
- Otherwise hidden metadata columns consume too much space
Redshift copy grants for cross-region snapshot copies
- Let’s say you have a KMS-encrypted Redshift cluster and a snapshot of it
- You want to copy that snapshot to another region for backup
- In the destination AWS region:
- Create a KMS key if you don’t have one already
- Specify a unique name for your snapshot copy grant
- Specify the KMS key ID for which you’re creating the copy grant
- In the source AWS region:
- Enable copying of snapshots to the copy grant you just created
UNLOAD command
- Unload from a table into files in S3
Enhanced VPC routing
- Force all of your copy and unload traffic between your cluster and the repositories through your Amazon VPC
- Otherwise, all of that copy and unload traffic will be routed through the internet
- You might need to set up VPC endpoints or NAT gateways or Internet gateways within your VPC to make sure that copy and unload can communicate with the repositories
DBLINK
- Connect Redshift to PostgreSQL (possibly in RDS)
- Good way to copy and sync data between PostgreSQL and Redshift
Redshift Workload Management (WLM)
- Prioritize short, fast queries vs. long, slow queries
- Query queues
- Via console, CLI, or API
Automatic Workload Management
- Creates up to 8 queues
- Default 5 queues with even memory allocation
- Large queries (i.e. big hash joins) —> concurrency lowered
- Small queries (i.e. inserts, scans, aggregations) —> concurrency raised
- Configuring query queues
- Priority
- Concurrency scaling mode
- User groups
- Query groups
- Query monitoring rules
Manual Workload Management
- One default queue with concurrency level of 5 (5 queries at once)
- Superuser queue with concurrency level 1
- Define up to 8 queues, up to concurrency level 50
- Each can have defined concurrency scaling mode, concurrency level, user groups, query groups, memory, timeout, query monitoring rules
- Can also enable query queue hopping
- Timed out queries “hop” to next queue to try again
Short Query Acceleration (SQA)
- Prioritize short-running queries over longer-running ones
- Short queries run in a dedicated space, won’t wait in queue behind long queries
- Can be used in place of WLM queues for short queries
- Works with:
- CREATE TABLE AS (CTAS)
- Read-only queries (SELECT statements)
- Uses machine learning to predict a query’s execution time
- Can configure how many seconds is “short”
VACUUM command
- Recovers space from deleted rows
- VACUUM FULL
- VACUUM DELETE ONLY
- VACUUM SORT ONLY
- VACUUM REINDEX
Redshift anti-patterns
Concurrency Scaling
- Automatically adds cluster capacity to handle increase in concurrent read queries
- Support virtually unlimited concurrent users & queries
- WLM queues manage which queries are sent to the concurrency scaling cluster
Resizing Redshift Clusters
Elastic resize
- Quickly add or remove nodes of same type (It can change node types, but not without dropping connections - it creates a whole new cluster)
- Cluster is down for a few minutes
- Tries to keep connections open across the downtime
- Limited to doubling or halving for some dc2 and ra3 node types
Classic resize
- Change node type and/or number of nodes
- Cluster is read-only for hours to days
Snapshot, restore, resize
- Used to keep cluster available during a classic resize
- Copy cluster, resize new cluster
New Features
RA3 nodes with managed storage
- Enable independent scaling of compute and storage
- SSD-based
Redshift data lake export
- Unload Redshift query to S3 in Apache Parquet format
- Parquet is 2x faster to unload and consumes up to 6x less storage
- Compatible with Redshift Spectrum, Athena, EMR, SageMaker
- Automatically partitioned
Spatial data types
- GEOMETRY, GEOGRAPHY
Cross-Region Data Sharing
- Share live data across Redshift clusters without copying
- Requires new RA3 node type
- Secure, across regions and across accounts
AQUA
- Advanced Query Accelerator
- Available on ra3.4xl, ra3.16xl
- Pushes reduction and aggregation queries closer to the data
- Up to 10x faster, no extra cost, no code changes
- Also benefits from high-bandwidth connection to S3
- All you have to do is turn it on in your cluster configuration ( when using the supported node types)
Redshift Security Concerns
- Using a Hardware Security Module (HSM)
- Must use a client and server certificate to configure a trusted connection between Redshift and the HSM
- If migrating an unencrypted cluster to an HSM-encrypted cluster, you must create the new encrypted cluster and then move data to it
- Defining access privileges for user or group
- Use the GRANT or REVOKE commands in SQL
- Example: grant select on table foo to bob;
Redshift Serverless
- Automatic scaling and provisioning for your workload
- Optimizes costs & performance
- Pay only when in use
- Uses ML to maintain performance across variable & sporadic workloads
- Easy spinup of development and test environments
- Easy ad-hoc business analysis
- You get back a serverless endpoint, JDBC/ODBC connection, or just query via the console’s query editor
Getting Started
- Need an IAM role with this policy
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "redshift-serverless:*", "Resource": "*" } ] }
- Define your
- Database name
- Admin user credentials
- VPC
- Encryption settings: AWS-owned KMS by default
- Audit logging
- Can manage snapshots & recovery points after creation
Resource Scaling in Redshift Serverless
- Capacity measured in Redshift Processing Units (RPU’s)
- You pay for RPU-hours (per second) plus storage
- Base RPU’s
- You can adjust base capacity
- Default to AUTO
- But you can adjust from 32-512 RPU’s to improve query performance
- Max RPU’s
- Can set a usage limit to control costs
- Or, increase it to improve throughput
Limitation
- Unavailables
- Redshift Spectrum
- Parameter Groups
- Workload Management
- AWS Partner Integration
- Maintenance windows/version tracks
- No public endpoints (yet)
- Must access within a VPC
Monitoring
- Monitoring views
- SYS_QUERY_HISTORY
- SYS_LOAD_HISTORY
- SYS_SERVERLESS_USAGE
- … and many more
- CloudWatch logs
- Connection & user logs enabled by default
- Optional user activity log data
- Under /aws/redshift/serverless
- CloudWatch Metrics
- QueriesCompletedPerSecond, QueryDuration, QueriesRunning, etc.
- Dimensions: DatabaseName, latency (short/medium/long), QueryType, stage
Reference
'Cloud > AWS' 카테고리의 다른 글
Athena (0) 2022.06.17 Glue (0) 2022.06.15 Lake Formation (0) 2022.04.26 Lambda (0) 2021.03.09 Choosing the right database on AWS (0) 2021.03.08