Tuesday, 3 January 2023

Amazon Redshift Overview

Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services.
- Redshift is based on PostgreSQL, but it's not used for OLTP. 
- It's OLAP - Online Analytical Processing (analysis and data warehousing).
- Built on top of technology from the massive parallel processing data warehouse company ParAccel, to handle large scale data sets and database migrations.
- 10X better performance than other data warehouses, Scale to PBs of data.
- Columnar storage of data (instead of row based data).
- Pay as you go based on the instances provisioned.
- Can use SQL interface for performing the queries.
- BI tools such as AWS QuickSight or Tableau integrate with it.
- How to load data into Redshift?
Data is loaded from Amazon S3(using COPY command), Kinesis Firehose (To load data near real-time), DynamoDB, DMS (Data Migration Service)...
- Based on node type : up to 100+ nodes, each node can have up to 16 TB of storage space.
- Can provision multiple nodes, with Multi-AZ only for some clusters.
- There are two types of nodes in Redshit :
- Leader Node : For query planning and result aggregation from compute node.
- Compute Node : For performing the queries, and send result to leader node.
- As Redshit is a managed service; so we only get Backup & Restore, Security VPC enhancements /IAM (for accessing cluster)/KMS (for encryption), Monitoring using Cloudwatch.
- Redshift Enhanced VPC routing: COPY / UNLOAD goes through VPC (for better performance and lower cost).
- Redshift is provisioned, so it's worth it when you have a sustained usage (use Athena if the queries are sporadic instead).

Redshift - Snapshot & DR

- Snapshots are point-in-time backups of your cluster, stored internally in S3.
- Snapshots are incremental (Only what has changed in your redshift cluster will be saved).
- You can restore a snapshot into a new cluster (You've to create new cluster to restore the data).
- There are two types of snapshots :
- Automated Snapshot : Happens every 8 hours or every 5 GB of data change in the cluster or on a schedule you set. And set retention ( for example, like for 30 Days and after 30 days snapshot will be automatically deleted).
- Manual Snapshot : On-demand snapshot, retained until you delete it.
It is very similar, how RDS works. But one cool thing in Redshift, you can configure Amazon Redshift to automatically copy snapshots(automated or manual) of a cluster to another AWS Region and this is very useful to setup disaster recovery for your Redshift cluster.

Cross-Region Snapshot Copy for an KMS-Encrypted Redshift Cluster

How to copy a cross-regions snapshot for kms encrypted redshift snapshots?

You have the source and your snapshot is in your source region it's encrypted using the KMS Key A, and you want to copy it into destination came as KMS key B so what we need to do is call a redshift "snapshot copy grant" and that it will allow redshift the service to perform encryption operations in the destination region when this is done then you can copy your snapshot from the original one into the region 2 and then it will be encrypted by redshift with the correct kms key so the the really magical thing here that happens is that you need to have and create a snapshot to copy.

Redshift Spectrum

- Query data that is already in S3 without loading it. - Must have a Redshift cluster available to start the query. - The query is then submitted to thousand of Redshift Specturm nodes.
How does that work?
Here, our existing redshift cluster has a leader node and has a bunch of compute nodes then if we do a query on data on Amazon S3 that will look like above image then what's going to happen is that redshift is going to spin up a lot of virtual spectrum nodes they will do the computation and the query over the data set in Amazon S3 and then once the result is there is going to be sent back to the compute nose for aggregation and then will be rolled back into the leader node.

Redshift WorkLoad Management (WLM)

- It enables you to flexibly manage queries priorities within workloads.
- Use case for this is to prevent short files from inquiries from getting stuck behind long running queries.
- You can define multiple query queues you get a very simple super user queue and user defined queue and so on and then,
- Route your query to the proper queue at runtime.
Lets take an example, here in Amazon redshift and have three queues. One is the Super User Queue, the short running queue and the long running queue the names are pretty explicits and say we have two kinds of users we have Admin and User.


Now, they want their queries to be done as soon as possible right so they will do system queries and then these queries will go directly into the Super User Queue.
And this will have priority the user may have some short-running queries and these we want to have into their short-running query queue this way we are sure that it will just be a lot of short-running queries.
But if the user is submitting a longer running query and we're not gonna take a lot of time then we should send it directly into the long running queue to make sure not to block the short running queries.
So we have two kind of work load management :
- Automatic WLM : where the cues and resources are managed by shift.
- Manual WLM : which accusing resources are managed by you (i.e. the user).

Redshift Concurrency Scaling Cluster

- It enables you to provide closely fast performance with virtually and limited number of users and queries.
- Redshift automatically adds additional cluster capacity on the fly so this additional concert capacity is called Concurrencies Scaling Cluster and allows you to process an increase in request.
- Ability to decide which queries sent to the Concurrencies Scaling Cluster using WLM.

So let's say in the below image, we have a normal version cluster with a few nodes under and then this sustains you know a bunch of users that we have normally let's say tomorrow we get a lot more users submitting queries into Amazon redshift then automatically the concurrency scaling cluster will add those automatically to accommodate for these number of users.
Now we get the ability to decide which queries get sent to the concurrency is getting cluster using the WLM feature we just saw in previous discussion and this feature is charged per second.

No comments:

Post a Comment