Recap of Apache Spark News for April

Recap of Apache Spark News for April

News on Apache Spark - April 2016

Apache Spark News April

Databricks launches new APIs to make Apache Spark perform faster for production data driven applications. April 1, 2016.

Apache Spark is becoming the execution engine in Hadoop. So much so – that enterprises are second to none in adopting Spark for building data driven applications. Databricks has launched new APIs to automate their Spark infrastructure. While the DevOps team work with command line APIs to automate the infrastructure, Data Scientists need a more visual platform to run algorithms. There is no unifying structure to bring the different workflows of these two teams together.

(Source: )

Learn Apache Spark Online

If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.

High-performance woman working on making Spark Simple. April 6, 2016

Holden Karau, principal software engineer of Big Data at IBM launched her book on “High Performance Spark” with her co-author Rachel. The first four chapters of the book have been released highlighting -Introduction to High-Performance Spark, How Spark Works, DataFrame’s, Datasets and Spark SQL and Joins (SQL and Core).Addressing the release of the first few chapters of the book, Karau says that there are many exciting things happening with Apache Spark 2.0 as it moves from RDD to dataset model. This will help business analyst’s work easily with Spark and productionize it with the help of data engineers.

(Source -

For the complete list of big data companies and their salaries- CLICK HERE

Cloudera launches Cloudera Enterprise 5.7 for bring better performance and operation efficiency to workloads. April 7, 2016.

Cloudera announced the general availability of Cloudera Enterprise 5.7. This release is set to improve performance across key workloads and operations. It is set to improve data processing by 3x times with the added support of Hive-On-Spark. Apache Spark is playing a very important role in this new release and is set to replace MapReduce.

(Source: )

IBM has extended Apache Spark to Mainframe to deliver real time insights on data. April 7, 2016.

IBM’s new z/OS platform for Apache Spark will make the lives of data scientists and developers easier – by giving real-time, secure access to Mainframe data. IBM is holding up its commitment to Spark, which was made last year of dedicating 3500 IBM researchers and developers to work on Spark projects. As a part of that endeavour, z Systems at IBM have also established a GitHub organization for developers to collaborate on Apache Spark.

(Source: )

Think Big expands capabilities for building data lakes with Apache Spark. April 15, 2016.

As the interest in Apache Spark continues to grow, Think Big has incorporated Spark in its big data frameworks for developing enterprise quality data lakes and big data analytic applications. Its customers can now make use of Apache Spark framework in the cloud, on general commodity built hadoop environments, optimizing them to run enterprise class mission critical big data workloads.


IBM Expands Access and Value of z Systems Mainframe Data with Apache Spark. April 25th, 2016.

IBM z Systems mainframe with the novel z/OS platform for apache spark is meant to ease and speed up data analysis so that data scientists and big data developers can apply advanced analytics to large data sets to glean real-time insights. IBM’s new z/OS platform for apache spark runs on z/OS mainframe operating system enabling data scientists to analyse the data on the system origin without have to extract, transform or load it.


Spark rival Apache Apex hits top-level status. April 26, 2016.

Apache Apex, an open-source batch and stream processing platform compatible with HDFS and YARN. This new Apache project meets the big data needs of enterprises for real-time reporting, monitoring and learning with millisecond data point precision. Apex might seem similar to other open-source data frameworks like Apache Spark, Storm or Samza but it is likely to rival all these frameworks on usability features.


Apache Spark powers live SQL analytics in SnappyData. April 27, 2016

Pivotal launched a new database solution powered by in-memory transactional data store Gemfire and Apache Spark called Snappydata. Snappydata uses spark in-memory data analytics engine or perform live SQL analytics on static data streams or data sets. Users can write queries either in SQL or as spark abstractions. Snappydata extends the features of Spark streaming in various ways by allowing users to manipulate and query data streams as if they were tables.






Certified Apache Spark Training

Relevant Projects

Hive Project - Visualising Website Clickstream Data with Apache Hadoop
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

Data Mining Project on Yelp Dataset using Hadoop Hive
Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.