Explain the features of Amazon Athena

In this recipe, we will learn about Amazon Athena. We will also learn about the features of Amazon Athena.

Recipe Objective - Explain the features of Amazon Athena?

The Amazon Athena is widely used and is defined as an interactive query service that makes it easy to analyze data in Amazon S3 using the standard SQL. Amazon Athena is serverless, so there is no infrastructure to manage, and users pay only for the queries that they run. Amazon Athena is easy to use and simply point to users' data in Amazon S3, define the schema, and start querying using standard SQL. Further, most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare user's data for the analysis and this makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Amazon Athena is out-of-the-box integrated with the AWS Glue Data Catalog allowing users to create the unified metadata repository across various services, crawl data sources to discover schemas and populate their Catalog with new and modified table and partition definitions, and maintain the schema versioning. Amazon Athena is the serverless data query tool which means it is scalable and cost-effective at the same time. Usually, customers are charged on a pay per query basis which further translates to the number of queries that are executed at a given time. The normal charge for scanning 1TB of data from S3 is 5 USD and although it looks quite a small amount at a first glance when users have multiple queries running on hundreds and thousands of GB of data, the price might get out of control at times.

Benefits of Amazon Athena

The Amazon Athena offers only Pay per query i.e. users pay only for the queries that they run. So, users are charged $5 per terabyte scanned by their queries. Also, users can save from 30% to 90% on their per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. Athena queries data directly in Amazon S3 and there are no additional storage charges beyond S3. With Amazon Athena, Users don't have to worry about having enough compute resources to get fast, interactive query performance. Amazon Athena automatically executes queries in parallel, so most results come back within seconds and thus it is Fast and is fast. Amazon Athena uses Presto with the ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Amazon Athena is ideal for quick, ad-hoc querying but it can further also handle complex analysis, including large joins, window functions, and arrays. Amazon Athena is highly available and executes queries using the compute resources across multiple facilities and multiple devices in each facility. Amazon Athena uses Amazon S3 as its underlying data store making user's data highly available and durable and thus it is Open, powerful and standard. Amazon Athena is serverless and users can quickly query their data without having to set up and manage any servers or data warehouses. Just point to the user's data in the Amazon S3, define the schema, and start querying using the built-in query editor. Amazon Athena allows users to tap into all their data in S3 without the need to set up further complex processes to extract, transform, and load the data (ETL) and thus provides querying data instantly.

Explore SQL Database Projects to Add them to Your Data Engineer Resume.

System Requirements

  • Any Operating System(Mac, Windows, Linux)

This recipe explains Amazon Athena and the Features of Amazon Athena.

Features of Amazon Athena

    • It provides easy to query, just use standard SQL

Amazon Athena uses Presto which is an open-source, distributed SQL query engine optimized for low latency, ad hoc analysis of data and this means users can run queries against the large datasets in Amazon S3 using ANSI SQL, with full support for large joins, window functions, and arrays. Amazon Athena supports a wide variety of data formats such as CSV, JSON, ORC, Avro, or Parquet. Users can also connect to Athena from a wide variety of BI tools using Athena's JDBC driver.

    • It provides Pay per query

Amazon Athena offers users pay only for the queries that they run. Users are charged based on the amount of data scanned by each query and users can get significant cost savings and performance gains by compressing, partitioning, or converting your data to a columnar format, because each of those operations reduces the amount of data that Athena needs to scan to execute a query.

    • It provides Fast performance and Highly available & durable

Amazon Athena, users don’t have to worry about managing or tuning clusters to get fast performance. Amazon Athena is optimized for fast performance with Amazon S3. Amazon Athena automatically executes queries in parallel, so that you get query results in seconds, even on large datasets. Amazon Athena is highly available and executes the queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Also, Amazon Athena uses the Amazon S3 as its underlying data store, making their data highly available and durable and Amazon S3 further provides durable infrastructure to store the important data and is designed for durability of 99.999999999% of objects. User data is redundantly stored across multiple facilities and multiple devices in each facility.

    • It provides Security

Amazon Athena allows users to control access to their data by using AWS Identity and Access Management (IAM) policies, access control lists (ACLs), and Amazon S3 bucket policies. With IAM policies, users can grant IAM users fine-grained control to their S3 buckets. By controlling access to data in S3, users can restrict users from querying it using Athena. Amazon Athena also allows users to easily query encrypted data stored in the Amazon S3 and write encrypted results back to your S3 bucket. Both, server-side encryption and client-side encryption are further supported.

    • It provides Integration

Amazon Athena integrates the out-of-the-box with AWS Glue. With Glue Data Catalog, users will be able to create the unified metadata repository across the various services, crawl data sources to further discover data and populate the user's Data Catalog with new and modified table and partition definitions, and maintain schema versioning. Users can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize query performance and finally reduce costs.

    • It provides Federated query

Amazon Athena enables users to run the SQL queries across data stored in the relational, non-relational, object, and custom data sources and users can use the familiar SQL constructs to JOIN data across the multiple data sources for quick analysis, and store results in Amazon S3 for subsequent use. Amazon Athena executes federated queries using the Athena Data Source Connectors that run on AWS Lambda. AWS has open-source data source connectors for the Amazon DynamoDB, Apache HBase, Amazon Document DB, Amazon Redshift, AWS CloudWatch, AWS CloudWatch Metrics, and JDBC-compliant relational databases such as MySQL and PostgreSQL. Users can also use these connectors to run the federated SQL queries in Athena. Additionally, using the Athena Query Federation SDK, users can build connectors to any data source.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.