Cloud/AWS 2022. 6. 5. 17:30

Overview

Fully-managed, petabyte scale data warehouse service
10 times better performance than other DW’s
- Via machine learning, massively parallel query execution(MPP), columnar storage
Designed for OLAP, not OLTP
Cost effective
SQL, ODBC, JDBC interfaces
Scale up or down on demand
Built-in replication & backups
Monitoring via CloudWatch/CloudTrail

Use Cases

Accelerate analytics workloads
Unified data warehouse & data lake
Data warehouse modernization
Analyze global sales data
Store historical stock trade data
Analyze ad impressions & clicks
Aggregate gaming data
Analyze social trends

Redshift Architecture

A cluster is the core infrastructure component of an AWS Redshift Data Warehouse
A cluster composed of a leader node and compute nodes
Compute nodes are between 1 and 128 depending on node type
Each cluster can contain one or more databases
The user data is going to be stored on the compute nodes
The leader node is just managing communication between the client programs and all communication with the compute nodes
The leader node of interface between your external clients to redshift and the compute nodes under the hood
The leader node receives all the queries from client application passes the queries and develops execution plans which are an ordered set of steps to process those queries. It coordinate the parallel execution of those plans with the compute nodes and also aggregates the intermediate results from those nodes
Each compute node has its own dedicated CPU, memory, and attached disk storage

Type of node

Dense storage (DS) Node type

DS allows you to create a very large data warehouse using hard disk drives or HDD for a very low price point.
- el: 3 HDD with a total of 2 terabytes of magnetic storage
- 8xl: 24 HDD with a total of 16 terabytes of magnetic storage

Redshift Spectrum

Query exabytes of unstructured data in S3 without loading
Limitless concurrency
Horizontal scaling
Separate storage & compute resources
Wide variety of data formats: Avro, CSV, ORC, Parquet, JSON, Grok, Ion, RC file, Regx, Cirta, text file, TSV, and etc.
Support of Gzip and Snappy compression

Redshift Performance

Massively Parallel Processing (MPP)
Columnar Data Storage
- A block size of 1 MB which is more efficient and further reduces the number of IO request needed to perform any database loading or other operations that are part of a query execution
Column Compression

Redshift Durability

Replication within cluster
Backup to S3: Asynchronously replicated to another region
Automated snapshots
Failed drives/nodes automatically replaced
However - limited to a single availability zone (AZ)

Scaling Redshift

Vertical and horizontal scaling on demand
During scaling
- A new cluster is created while your old one remains available for reads
- CNAME is flipped to new cluster (a few minutes of downtime)
- Data moved in parallel to new compute nodes

Redshift Distribution Styles

AUTO

Redshift figures it out based on size of data

EVEN

Rows distributed across slices in round-robin
This is appropriate when a table does not participate in joins or when there is not a clear choice between key distribution or all distribution

KEY

Rows distributed based on one column
The rows are distributed according to the values in one columns
The leader node will place matching values on the same node slice and matching values from the common columns or physically stored together
Come in handy if you’re typically going to be doing queries based on a specific column in your data

ALL

Entire table is copied to every node
Ensure that every row is co-located for every join that the table participates in
All distribution multiplies the storage required by the number of nodes in the cluster
It takes much longer to load update or insert data into multiple tables
Appropriate for a relatively slow moving tables: Tables that are not updated frequently or extensively
As the cost of redistribution is low, small dimension tables do not benefit significantly from ALL distribution

Sort Keys

Rows are stored on disk in sorted order based on the column you designate as a sort key
Like an index
Makes for fast range queries
Choosing a sort key: Recency, Filtering, Joins
- If you frequently join a table specify the join column as both the source key and the distribution key, that would enable the query optimizer to choose a sort merge join instead of a slower hash join
Single vs. Compound vs Interleaved sort keys

Sort Keys: Single Column

Just using a single column, a single value to sort the data
Useful if you’re consistently querying for data within a single filer column
- For example, if you were querying primarily by date, choose a single column sort key by date

Sort Keys: Compound

All the columns listed in the sort key definition in the order they are listed in
This will be most useful when a query’s filter applies conditions such as filters and joins that use a prefix of the sort keys
Compound is the default sort type
Also help to improve compression

Sort Keys: Interleaved

Gives equal weight to each column or subset of columns in the sort key
This can be useful if multiple queries use different columns for filters
it uses an internal compression scheme for a zone map values that enables them to better discriminate among column values

Importing/Exporting data

Copy command

Parallelized, efficient
- Read from multiple data files or multiple data streams simultaneously
From S3, EMR, DynamoDB, remote hosts
S3 requires a manifest file and IAM role
Use COPY to load large amounts of data from outside of Redshift
If your data is already in Redshift in another table
- Use INSERT INTO … SELECT
- Or Create TABLE AS
COPY can decrypt data as it is loaded from S3
- Hardware-accelerated SSL used to keep it fast
Gzip, Izop, and bzip2 compression supported to speed it up forther
Automatic compression option
- Analyzes data being loaded and figure out optimal compression scheme for storing it
Special case: narrow tables (lots of row, few columns)
- Load with a single COPY transaction if possible
- Otherwise hidden metadata columns consume too much space

Redshift copy grants for cross-region snapshot copies

Let’s say you have a KMS-encrypted Redshift cluster and a snapshot of it
You want to copy that snapshot to another region for backup
In the destination AWS region:
- Create a KMS key if you don’t have one already
- Specify a unique name for your snapshot copy grant
- Specify the KMS key ID for which you’re creating the copy grant
In the source AWS region:
- Enable copying of snapshots to the copy grant you just created

UNLOAD command

Unload from a table into files in S3

Enhanced VPC routing

Force all of your copy and unload traffic between your cluster and the repositories through your Amazon VPC
Otherwise, all of that copy and unload traffic will be routed through the internet
You might need to set up VPC endpoints or NAT gateways or Internet gateways within your VPC to make sure that copy and unload can communicate with the repositories

DBLINK

Connect Redshift to PostgreSQL (possibly in RDS)
Good way to copy and sync data between PostgreSQL and Redshift

Redshift Workload Management (WLM)

Prioritize short, fast queries vs. long, slow queries
Query queues
Via console, CLI, or API

Automatic Workload Management

Creates up to 8 queues
Default 5 queues with even memory allocation
Large queries (i.e. big hash joins) —> concurrency lowered
Small queries (i.e. inserts, scans, aggregations) —> concurrency raised
Configuring query queues
- Priority
- Concurrency scaling mode
- User groups
- Query groups
- Query monitoring rules

Manual Workload Management

One default queue with concurrency level of 5 (5 queries at once)
Superuser queue with concurrency level 1
Define up to 8 queues, up to concurrency level 50
- Each can have defined concurrency scaling mode, concurrency level, user groups, query groups, memory, timeout, query monitoring rules
- Can also enable query queue hopping
  - Timed out queries “hop” to next queue to try again

Short Query Acceleration (SQA)

Prioritize short-running queries over longer-running ones
Short queries run in a dedicated space, won’t wait in queue behind long queries
Can be used in place of WLM queues for short queries
Works with:
- CREATE TABLE AS (CTAS)
- Read-only queries (SELECT statements)
Uses machine learning to predict a query’s execution time
Can configure how many seconds is “short”

VACUUM command

Recovers space from deleted rows
VACUUM FULL
VACUUM DELETE ONLY
VACUUM SORT ONLY
VACUUM REINDEX

Redshift anti-patterns

Concurrency Scaling

Automatically adds cluster capacity to handle increase in concurrent read queries
Support virtually unlimited concurrent users & queries
WLM queues manage which queries are sent to the concurrency scaling cluster

Resizing Redshift Clusters

Elastic resize

Quickly add or remove nodes of same type (It can change node types, but not without dropping connections - it creates a whole new cluster)
Cluster is down for a few minutes
Tries to keep connections open across the downtime
Limited to doubling or halving for some dc2 and ra3 node types

Classic resize

Change node type and/or number of nodes
Cluster is read-only for hours to days

Snapshot, restore, resize

Used to keep cluster available during a classic resize
Copy cluster, resize new cluster

New Features

RA3 nodes with managed storage

Enable independent scaling of compute and storage
SSD-based

Redshift data lake export

Unload Redshift query to S3 in Apache Parquet format
Parquet is 2x faster to unload and consumes up to 6x less storage
Compatible with Redshift Spectrum, Athena, EMR, SageMaker
Automatically partitioned

Spatial data types

GEOMETRY, GEOGRAPHY

Cross-Region Data Sharing

Share live data across Redshift clusters without copying
Requires new RA3 node type
Secure, across regions and across accounts

AQUA

Advanced Query Accelerator
Available on ra3.4xl, ra3.16xl
Pushes reduction and aggregation queries closer to the data
Up to 10x faster, no extra cost, no code changes
Also benefits from high-bandwidth connection to S3
All you have to do is turn it on in your cluster configuration ( when using the supported node types)

Redshift Security Concerns

Using a Hardware Security Module (HSM)
- Must use a client and server certificate to configure a trusted connection between Redshift and the HSM
- If migrating an unencrypted cluster to an HSM-encrypted cluster, you must create the new encrypted cluster and then move data to it
Defining access privileges for user or group
- Use the GRANT or REVOKE commands in SQL
- Example: grant select on table foo to bob;

Redshift Serverless

Automatic scaling and provisioning for your workload
Optimizes costs & performance
- Pay only when in use
Uses ML to maintain performance across variable & sporadic workloads
Easy spinup of development and test environments
Easy ad-hoc business analysis
You get back a serverless endpoint, JDBC/ODBC connection, or just query via the console’s query editor

Getting Started

Need an IAM role with this policy

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "redshift-serverless:*",
			"Resource": "*"
		}
	]
}

Define your
- Database name
- Admin user credentials
- VPC
- Encryption settings: AWS-owned KMS by default
- Audit logging
Can manage snapshots & recovery points after creation

Resource Scaling in Redshift Serverless

Capacity measured in Redshift Processing Units (RPU’s)
You pay for RPU-hours (per second) plus storage
Base RPU’s
- You can adjust base capacity
- Default to AUTO
- But you can adjust from 32-512 RPU’s to improve query performance
Max RPU’s
- Can set a usage limit to control costs
- Or, increase it to improve throughput

Limitation

Unavailables
- Redshift Spectrum
- Parameter Groups
- Workload Management
- AWS Partner Integration
- Maintenance windows/version tracks
No public endpoints (yet)
- Must access within a VPC

Monitoring

Monitoring views
- SYS_QUERY_HISTORY
- SYS_LOAD_HISTORY
- SYS_SERVERLESS_USAGE
- … and many more
CloudWatch logs
- Connection & user logs enabled by default
- Optional user activity log data
- Under /aws/redshift/serverless
CloudWatch Metrics
- QueriesCompletedPerSecond, QueryDuration, QueriesRunning, etc.
- Dimensions: DatabaseName, latency (short/medium/long), QueryType, stage

Reference

https://aws.amazon.com/redshift/

저작자표시 비영리 변경금지 (새창열림)

'Cloud > AWS' 카테고리의 다른 글

Athena (0)	2022.06.17
Glue (0)	2022.06.15
Lake Formation (0)	2022.04.26
Lambda (0)	2021.03.09
Choosing the right database on AWS (0)	2021.03.08

ABOUT ME

Overview

Use Cases

Redshift Architecture

Type of node

Dense storage (DS) Node type

Redshift Spectrum

Redshift Performance

Redshift Durability

Scaling Redshift

Redshift Distribution Styles

AUTO

EVEN

KEY

ALL

Sort Keys

Sort Keys: Single Column

Sort Keys: Compound

Sort Keys: Interleaved

Importing/Exporting data

Copy command

Redshift copy grants for cross-region snapshot copies

UNLOAD command

Enhanced VPC routing

DBLINK

Redshift Workload Management (WLM)

Automatic Workload Management

Manual Workload Management

Short Query Acceleration (SQA)

VACUUM command

Redshift anti-patterns

Concurrency Scaling

Resizing Redshift Clusters

Elastic resize

Classic resize

Snapshot, restore, resize

New Features

RA3 nodes with managed storage

Redshift data lake export

Spatial data types

Cross-Region Data Sharing

AQUA

Redshift Security Concerns

Redshift Serverless

Getting Started

Resource Scaling in Redshift Serverless

Limitation

Monitoring

Reference

'Cloud > AWS' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바