Data and AI Knowledge: January 2023

Monday 30 January 2023

Amazon Athena Overview

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL.
Use cases : Buisness intelligence / analytics / reporting, analyze, & query VPC Flow logs, ELB logs, CloudTrails, etc.

- Serverless query service to analyse data stored in Amazon S3.
- Use standard SQL language to query files(built on pesto).
- Support CSV, JSON, ORC, Avro and Parquet.
- Pricing $5.00 per TB of data scanned.
- Commonly used with Amazon Quicksight for reporting/dashboards.

Amazon Athena : Performance Improvement

- Use Columnar Data for cost savings (by doing less scan).
- Apache Parquet or ORC is recommended.
- Huge performance improvement.
- Use Glue to convert your data to Parquet or ORC format.
- Compress data for smaller retrievals (bzip2, gzip, Iz4, snappy, zlip...)
- Partition datasets in S3 for easy querying on virtual columns
- s://yourbucketname/pathtotable
/[PARTITION_COLUMN_NAME]=[VALUE]
/[PARTITION_COLUMN_NAME]=[VALUE]
etc..

- Example s3://athenabucket/flight/parquet/year=1999/month=1/day=1/

- Use larger files (>128MB) to minimize overhead.

Amazon Athena : Federated Query

- Allows you to run SQL queries across data stored in relation, non-relational, objects or custom data sources (AWS or on-premises).

- Uses Data Sources Connectors that run on AWS Lambda to run federated queries (like on Cloudwatch logs, DynamoDB, RDS DB, Elasticache etc..)

- Store results back in Amazon S3.

Monday 23 January 2023

Amazon DocumentDB Overview

We will talk about DocumentDB so we all know about Aurora RDS and Aurora RDS is AWS implementation of Postgresql or MySQL because it has the same API but it's built on AWS and it has some nice benefits to be cloud native.
DocumentDB is the same as Aurora RDS but for MongoDB (Which is NoSQL database).

- Mongodb is used to store, query and index JSON data so thanks to DocumentDB.
- Similar "deplyement concept" as Aurora RDS.
- Fully Managed, highly available with replication across 3 AZ.
- DocumentDB storage automatically grows in increments of 10GB.
- Automatically scales to workloads with millions of requests per seconds.

If using MongoDB on AWS so you pay for what you use in DocumentDB and there is no upfront cost but you need to understand the Architecture of DocumentDB, which is very similar to Aurora in order to understand its pricing.

DocumentDB - Pricing

- Pay as you Go, so pay for what you use. No up front cost.
- On-Demand Instances (pay per second with the minimum of 10 minutes)
- Database I/O: amount of IO used when Read/Write (per million IOs)
- Database Storage (per GB/month)
- Backup Storage (per GB/month)

We have database storage and then we have On Demand instances. We have primary instance and then replica instances that's up to you're going to pay for these on-demand instances, you can pay per second with the minimum of 10 minutes then these instances are going to do a read and write operations against your database storage so you're going to base for database IOs and this going to be counted per million IOs then of course the data is being stored on your database storage and therefore you're going to pay for the derby storage itself in gigabytes per month.

Monday 9 January 2023

Amazon Timestream Overview

As name indicates, Amazon Timestream is a time series database so it's fully managed, fast, scalable, serverless time series database. Use cases for timestream would be to have an IoT application, operational applications, real-time Analytics or everything related to a Time series database.

- Automatically scale up/down to scale capacity.
- Store and analyse trillions of events per day.
- 1000s times faster & 1/10th the cost of relation databases.
- Schedule queries, multi-measure records and SQL Compatibility.
- Data storage tiering : Recent data kept in memory and historical data kept in a cost-optimised storage.
- Built-in time series analytics functions (helps you identify patterns in your data in real-time.
- Supports Encryption in-transit and at rest.

Amazon Timestream Architecture

Timeseries database can receive data from AWS IoT so internet of things, the Kineses Data Streams through them that can receive data as well, Prometheus, Telegraf, their integrations for that case the streams as well through conditions data analytics for Apache link can receive data into Amazon time stream and Amazon msk as well through the same process.

In terms of what can connect to do timesstream where can build a dashboards using Amazon Quicksight and machine learning using Amazon SageMaker, can connect to grafana or because there is a standard jdbc connection into your database and the application that is compatible with jdbc and SQL can leverage Amazon timestream.

Tuesday 3 January 2023

Amazon Redshift Overview

Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services.

- Redshift is based on PostgreSQL, but it's not used for OLTP.
- It's OLAP - Online Analytical Processing (analysis and data warehousing).
- Built on top of technology from the massive parallel processing data warehouse company ParAccel, to handle large scale data sets and database migrations.
- 10X better performance than other data warehouses, Scale to PBs of data.
- Columnar storage of data (instead of row based data).
- Pay as you go based on the instances provisioned.
- Can use SQL interface for performing the queries.
- BI tools such as AWS QuickSight or Tableau integrate with it.
- How to load data into Redshift?
Data is loaded from Amazon S3(using COPY command), Kinesis Firehose (To load data near real-time), DynamoDB, DMS (Data Migration Service)...
- Based on node type : up to 100+ nodes, each node can have up to 16 TB of storage space.
- Can provision multiple nodes, with Multi-AZ only for some clusters.
- There are two types of nodes in Redshit :
- Leader Node : For query planning and result aggregation from compute node.
- Compute Node : For performing the queries, and send result to leader node.
- As Redshit is a managed service; so we only get Backup & Restore, Security VPC enhancements /IAM (for accessing cluster)/KMS (for encryption), Monitoring using Cloudwatch.
- Redshift Enhanced VPC routing: COPY / UNLOAD goes through VPC (for better performance and lower cost).
- Redshift is provisioned, so it's worth it when you have a sustained usage (use Athena if the queries are sporadic instead).

Redshift - Snapshot & DR

- Snapshots are point-in-time backups of your cluster, stored internally in S3.
- Snapshots are incremental (Only what has changed in your redshift cluster will be saved).
- You can restore a snapshot into a new cluster (You've to create new cluster to restore the data).
- There are two types of snapshots :

- Automated Snapshot : Happens every 8 hours or every 5 GB of data change in the cluster or on a schedule you set. And set retention ( for example, like for 30 Days and after 30 days snapshot will be automatically deleted).
- Manual Snapshot : On-demand snapshot, retained until you delete it.
It is very similar, how RDS works. But one cool thing in Redshift, you can configure Amazon Redshift to automatically copy snapshots(automated or manual) of a cluster to another AWS Region and this is very useful to setup disaster recovery for your Redshift cluster.

Cross-Region Snapshot Copy for an KMS-Encrypted Redshift Cluster

How to copy a cross-regions snapshot for kms encrypted redshift snapshots?

You have the source and your snapshot is in your source region it's encrypted using the KMS Key A, and you want to copy it into destination came as KMS key B so what we need to do is call a redshift "snapshot copy grant" and that it will allow redshift the service to perform encryption operations in the destination region when this is done then you can copy your snapshot from the original one into the region 2 and then it will be encrypted by redshift with the correct kms key so the the really magical thing here that happens is that you need to have and create a snapshot to copy.

Redshift Spectrum

- Query data that is already in S3 without loading it. - Must have a Redshift cluster available to start the query. - The query is then submitted to thousand of Redshift Specturm nodes.

How does that work?

Here, our existing redshift cluster has a leader node and has a bunch of compute nodes then if we do a query on data on Amazon S3 that will look like above image then what's going to happen is that redshift is going to spin up a lot of virtual spectrum nodes they will do the computation and the query over the data set in Amazon S3 and then once the result is there is going to be sent back to the compute nose for aggregation and then will be rolled back into the leader node.

Redshift WorkLoad Management (WLM)

- It enables you to flexibly manage queries priorities within workloads.
- Use case for this is to prevent short files from inquiries from getting stuck behind long running queries.
- You can define multiple query queues you get a very simple super user queue and user defined queue and so on and then,
- Route your query to the proper queue at runtime.

Lets take an example, here in Amazon redshift and have three queues. One is the Super User Queue, the short running queue and the long running queue the names are pretty explicits and say we have two kinds of users we have Admin and User.

Now, they want their queries to be done as soon as possible right so they will do system queries and then these queries will go directly into the Super User Queue.
And this will have priority the user may have some short-running queries and these we want to have into their short-running query queue this way we are sure that it will just be a lot of short-running queries.
But if the user is submitting a longer running query and we're not gonna take a lot of time then we should send it directly into the long running queue to make sure not to block the short running queries.
So we have two kind of work load management :
- Automatic WLM : where the cues and resources are managed by shift.
- Manual WLM : which accusing resources are managed by you (i.e. the user).

Redshift Concurrency Scaling Cluster

- It enables you to provide closely fast performance with virtually and limited number of users and queries.
- Redshift automatically adds additional cluster capacity on the fly so this additional concert capacity is called Concurrencies Scaling Cluster and allows you to process an increase in request.
- Ability to decide which queries sent to the Concurrencies Scaling Cluster using WLM.

So let's say in the below image, we have a normal version cluster with a few nodes under and then this sustains you know a bunch of users that we have normally let's say tomorrow we get a lot more users submitting queries into Amazon redshift then automatically the concurrency scaling cluster will add those automatically to accommodate for these number of users.

Now we get the ability to decide which queries get sent to the concurrency is getting cluster using the WLM feature we just saw in previous discussion and this feature is charged per second.

Choosing the Right Cloud Database Service offered by cloud providers : AWS, Azure and GCP

The majority of cloud providers started off treating databases as application that could be run on general-purpose compute instances. However, they quickly started adding higher-level application services to their IaaS offerings. Cloud databases have grown to be a crucial area of technological advancement, with manufacturers battling for customers by introducing several database kinds.

Architects, administrators, and developers must be aware of the various database options available because the IT industry is currently experiencing a significant displacement of data-centric workloads to the cloud.

Ensure that the IT team (including DBA and App Team) choose the cloud database provider that best suits the requirements of your business by following the below tips :

1. Become familiar with the database's options:

The deployment architecture and the database requirements are the two main factors to take into account while selecting cloud database services. Each element will include parameters specific to a given application.

The following are the variables that influence cloud database selection:
- Deployment environment: on-site, in a public cloud or clouds, or a combination of hosted and private resources.
- Database manager: specifically, whether the database will be self-managed or a cloud service (DBaaS).
- Database type: including relational SQL (RDBS), distributed SQL, NoSQL variants like a column (key-value) store, wide column or document DB, cache, graph, or various specialty databases like time series and quantum ledger.
- Availability and resilience requirements: these will dictate whether many replicated and synchronized instances are required.
- Performance and capacity requirements: these include transactions per second, query latency, and scalability of both database throughput and capacity.
- Flexibility: defining precise version criteria and the degree of authority required for an IT staff to manage more intricate elements like database setup and configuration.

IT teams including DBA's and Application team should investigate the spectrum of database services offered by AWS, Microsoft Azure, and Google Cloud Platform to suit their individual requirements.

2. Examine cloud trade-offs:

Cloud database services are more cost-effective for businesses looking to cut costs than self-managed internal database systems because they do not require hardware, system administration, or any DBA overhead. Using cloud goods built on open source or in-house produced software, which does not require license costs, also increases the savings for businesses.

Cloud database services work well for businesses that:
- Desire the simplest, fastest implementation possible.
- Desire access to a large selection of database products without having to set up and maintain each one.
- Reluctant to oversee the software.
- Require maximum expandability.
- Desire high availability through a number of widely separated instances.
- Recognize the benefits of the cloud's intrinsic security and its shared security paradigm; additionally, you don't have to micromanage every database option.

In general, cloud users need to understand the limitations on the capacity and performance of specific database instances. Nonetheless, problems may frequently be handled using a distributed architecture that takes advantage of networking and replication capabilities included in cloud solutions.

3. Analyze your consumption and business:

Consider some of the following aspects while making the best choice for the workloads at your organisation.
First, database and cloud infrastructure products are better suited for new applications than for aging ones. IT departments may decide to shift old systems to the cloud, meanwhile, if they wish to get off outdated hardware, experience utilization spikes, and anticipate cost savings from the cloud's consumption-based pricing. IT teams may need to relocate a historical database in order to connect it with more recent apps if an organization is pursuing a cloud-native strategy.

However, a hybrid cloud deployment—where on-premises data is replicated to or accessed from the cloud—is typically a better starting point and provides better risk management than a lift-and-shift database migration if you're hesitant to jump right into the cloud. By using a hybrid strategy, IT can also move its primary system from the on-premises database to the cloud, with the on-premises database serving as a backup.

Systems architects and application developers need to be especially mindful of service costs when designing cloud-based environments, regardless of the cloud database they use. This is because careless usage and setups can quickly negate any potential savings from cloud deployments.