Monday 30 January 2023

Amazon Athena Overview

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL.
Use cases : Buisness intelligence / analytics / reporting, analyze, & query VPC Flow logs, ELB logs, CloudTrails, etc.
- Serverless query service to analyse data stored in Amazon S3.
- Use standard SQL language to query files(built on pesto).
- Support CSV, JSON, ORC, Avro and Parquet.
- Pricing $5.00 per TB of data scanned.
- Commonly used with Amazon Quicksight for reporting/dashboards.


Amazon Athena : Performance Improvement

- Use Columnar Data for cost savings (by doing less scan).
- Apache Parquet or ORC is recommended.
- Huge performance improvement.
- Use Glue to convert your data to Parquet or ORC format.
- Compress data for smaller retrievals (bzip2, gzip, Iz4, snappy, zlip...)
- Partition datasets in S3 for easy querying on virtual columns
- s://yourbucketname/pathtotable
/[PARTITION_COLUMN_NAME]=[VALUE]
/[PARTITION_COLUMN_NAME]=[VALUE]
etc..

- Example s3://athenabucket/flight/parquet/year=1999/month=1/day=1/
- Use larger files (>128MB) to minimize overhead.

Amazon Athena : Federated Query


- Allows you to run SQL queries across data stored in relation, non-relational, objects or custom data sources (AWS or on-premises).

- Uses Data Sources Connectors that run on AWS Lambda to run federated queries (like on Cloudwatch logs, DynamoDB, RDS DB, Elasticache etc..)

- Store results back in Amazon S3.


1 comment: