HANDS-ON-LAB

Spark Job for Filtering and Processing Wikiticker Data

Problem Statement

This hands-on lab exercise aims to create a Spark job that processes the Wikiticker data and filters it based on specific conditions. The job should filter the data to include only records where the cityName is "London" and the delta edits are greater than 20. Additionally, the job should print the total count of records after filtering and write the filtered data to a separate folder in S3.

Tasks

  1. Create a new Spark job using the Spark framework.

  2. Load the Wikiticker data into a Spark DataFrame.

  3. Apply filters to the DataFrame to include records where the cityName is "London" and the delta edits are greater than 20.

  4. Print appropriate statements in the Spark job, including the total count of records after filtering.

  5. Write the filtered data to a specified folder in S3.


Explore the AWS Project for Batch Processing with PySpark on AWS EMR and enhance your practical skills in performing batch processing with PySpark on the AWS EMR platform.

Learnings

  • Hands-on experience in creating and running Spark jobs.

  • Understanding of filtering data using Spark DataFrame operations.

  • Knowledge of working with structured data in Spark.

  • Experience in printing statements and debugging Spark jobs.

  • Familiarity with writing data to S3 using Spark.

FAQs

Q1. How can I create a Spark job for filtering and processing the Wikiticker data?

To create a Spark job, you can use the Spark framework and write code in a programming language such as Scala or Python. Load the Wikiticker data into a Spark DataFrame and apply filters to include records that meet the specified conditions. Finally, print the necessary statements and write the filtered data to an S3 folder.

 

Q2.  How can I print the total count of records after filtering in the Spark job?

In the Spark job code, you can use appropriate print statements to display the total count of records after applying the filters. You can utilize DataFrame transformations and actions provided by Spark to calculate and print the count.