Explain the features of Amazon Athena

In this recipe, we will learn about Amazon Athena. We will also learn about the features of Amazon Athena.

Recipe Objective - Explain the features of Amazon Athena?

The Amazon Athena is widely used and is defined as an interactive query service that makes it easy to analyze data in Amazon S3 using the standard SQL. Amazon Athena is serverless, so there is no infrastructure to manage, and users pay only for the queries that they run. Amazon Athena is easy to use and simply point to users' data in Amazon S3, define the schema, and start querying using standard SQL. Further, most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare user's data for the analysis and this makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Amazon Athena is out-of-the-box integrated with the AWS Glue Data Catalog allowing users to create the unified metadata repository across various services, crawl data sources to discover schemas and populate their Catalog with new and modified table and partition definitions, and maintain the schema versioning. Amazon Athena is the serverless data query tool which means it is scalable and cost-effective at the same time. Usually, customers are charged on a pay per query basis which further translates to the number of queries that are executed at a given time. The normal charge for scanning 1TB of data from S3 is 5 USD and although it looks quite a small amount at a first glance when users have multiple queries running on hundreds and thousands of GB of data, the price might get out of control at times.

Benefits of Amazon Athena

The Amazon Athena offers only Pay per query i.e. users pay only for the queries that they run. So, users are charged $5 per terabyte scanned by their queries. Also, users can save from 30% to 90% on their per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. Athena queries data directly in Amazon S3 and there are no additional storage charges beyond S3. With Amazon Athena, Users don't have to worry about having enough compute resources to get fast, interactive query performance. Amazon Athena automatically executes queries in parallel, so most results come back within seconds and thus it is Fast and is fast. Amazon Athena uses Presto with the ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Amazon Athena is ideal for quick, ad-hoc querying but it can further also handle complex analysis, including large joins, window functions, and arrays. Amazon Athena is highly available and executes queries using the compute resources across multiple facilities and multiple devices in each facility. Amazon Athena uses Amazon S3 as its underlying data store making user's data highly available and durable and thus it is Open, powerful and standard. Amazon Athena is serverless and users can quickly query their data without having to set up and manage any servers or data warehouses. Just point to the user's data in the Amazon S3, define the schema, and start querying using the built-in query editor. Amazon Athena allows users to tap into all their data in S3 without the need to set up further complex processes to extract, transform, and load the data (ETL) and thus provides querying data instantly.

Explore SQL Database Projects to Add them to Your Data Engineer Resume.

System Requirements

  • Any Operating System(Mac, Windows, Linux)

This recipe explains Amazon Athena and the Features of Amazon Athena.

Features of Amazon Athena

    • It provides easy to query, just use standard SQL

Amazon Athena uses Presto which is an open-source, distributed SQL query engine optimized for low latency, ad hoc analysis of data and this means users can run queries against the large datasets in Amazon S3 using ANSI SQL, with full support for large joins, window functions, and arrays. Amazon Athena supports a wide variety of data formats such as CSV, JSON, ORC, Avro, or Parquet. Users can also connect to Athena from a wide variety of BI tools using Athena's JDBC driver.

    • It provides Pay per query

Amazon Athena offers users pay only for the queries that they run. Users are charged based on the amount of data scanned by each query and users can get significant cost savings and performance gains by compressing, partitioning, or converting your data to a columnar format, because each of those operations reduces the amount of data that Athena needs to scan to execute a query.

    • It provides Fast performance and Highly available & durable

Amazon Athena, users don’t have to worry about managing or tuning clusters to get fast performance. Amazon Athena is optimized for fast performance with Amazon S3. Amazon Athena automatically executes queries in parallel, so that you get query results in seconds, even on large datasets. Amazon Athena is highly available and executes the queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Also, Amazon Athena uses the Amazon S3 as its underlying data store, making their data highly available and durable and Amazon S3 further provides durable infrastructure to store the important data and is designed for durability of 99.999999999% of objects. User data is redundantly stored across multiple facilities and multiple devices in each facility.

    • It provides Security

Amazon Athena allows users to control access to their data by using AWS Identity and Access Management (IAM) policies, access control lists (ACLs), and Amazon S3 bucket policies. With IAM policies, users can grant IAM users fine-grained control to their S3 buckets. By controlling access to data in S3, users can restrict users from querying it using Athena. Amazon Athena also allows users to easily query encrypted data stored in the Amazon S3 and write encrypted results back to your S3 bucket. Both, server-side encryption and client-side encryption are further supported.

    • It provides Integration

Amazon Athena integrates the out-of-the-box with AWS Glue. With Glue Data Catalog, users will be able to create the unified metadata repository across the various services, crawl data sources to further discover data and populate the user's Data Catalog with new and modified table and partition definitions, and maintain schema versioning. Users can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize query performance and finally reduce costs.

    • It provides Federated query

Amazon Athena enables users to run the SQL queries across data stored in the relational, non-relational, object, and custom data sources and users can use the familiar SQL constructs to JOIN data across the multiple data sources for quick analysis, and store results in Amazon S3 for subsequent use. Amazon Athena executes federated queries using the Athena Data Source Connectors that run on AWS Lambda. AWS has open-source data source connectors for the Amazon DynamoDB, Apache HBase, Amazon Document DB, Amazon Redshift, AWS CloudWatch, AWS CloudWatch Metrics, and JDBC-compliant relational databases such as MySQL and PostgreSQL. Users can also use these connectors to run the federated SQL queries in Athena. Additionally, using the Athena Query Federation SDK, users can build connectors to any data source.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.