Recap of Apache Spark News for August

Recap of Apache Spark News for August

News on Apache Spark - August 2016

Apache Spark News

Apache Spark 2.0: MLib Preview. August 6, 2016.

In the recent Spark Summit 2016, Joseph K Bradley of Databricks focussed on “Apache Spark MLib 2.0 Preview: Data Science and Production”. The focus of MLib 2.0 is the use of APIs critical to data science. Also the pain points addressed are customising pipelines and improving persisting models of production.


Apache Spark Training

MapR’s $50mn funding leverages Apache Hadoop and Spark. August 9, 2016.

The popular big data platform - MapR announced its $50mn funding that it got. MapR is now looking at an initial IPO. This move strengthens the rise of MapR’s core product - MapR Converged Data Platform - which leverages Hadoop and Spark. This data platform offers features like global event streaming, real-time database capabilities and enterprise storage for developing and running innovative data applications.

For the complete list of big data companies and their salaries- CLICK HERE


Securing Apache Spark Shuffle Using Apache Commons Crypto. August 11, 2016.

Apache Commons Crypto, a cryptographic library optimized with enhanced encryption standards will provide performance advantages to Spark shuffle encryption over the existing approach. Programmers can use Apache Commons Crypto to implement high performance AES encryption or decryption methods with minimal code and effort. Apache Commons Crypto project was developed by Intel with the name Chimera and is now available to developers as a sub project of Apache Commons.

(Source: )

Can Spark do for machine learning what it’s done for data? August 17, 2016.

Spark is breaking the language barrier by offering algorithmic implementations and API’s for multiple languages. As of now Spark’s structured streaming can apply to batch for learning tasks and predictions can be made using structured streaming. Apache Spark 2.0 is touted to have continuous streaming app capabilities and will expand support for training machine leaning models.

(Source: )

Spark innovation: Catalyst Optimizer simplifies complicated queries. August 19, 2016.

The fundamental piece of Spark- Spark SQL’s Catalyst Optimizer simplifies the execution of complicated queries and provide high performance. Catalyst Optimizer supports Databricks new dataframe API to make big data accessible and simple for users.

(Source: )

Hadoop Based Data Lakes are augmenting the use of Apache Spark., August 22, 2016.

Adoption of Apache Spark is still taken with a grain of salt by the developers, mostly because Apache Spark lacks the distributed storage space. Apache Spark is becoming invaluable in terms of real time data processing and companies involved in gaming, betting and also providing financial solutions in fraud detection - are swearing by Spark. But Apache Spark needs Hadoop’s storage and that is where Hadoop Data Lakes come in.


Bridging the Gap with Spark and SAP HANA. August 24, 2016.

 Hadoop’s capabilities helped businesses store and access large amounts of data but organizations are still encountering high-performance demands. Emerging in-memory frameworks like SAP HANA Vora and Apache Spark are providing organizations with tools to overcome the limitations of batch oriented processing to help them achieve real-time iterative access to data on Hadoop clusters.

(Source: )

HPE adapts Vertica analytical database to world with Hadoop, Spark. August 31, 2016.SearchDataManagement

To compete in a field of diverse data tools, HPE has adapted to Vertica 8.0 that expanded its analytical database support for Hadoop, Spark and Kafka. High performance querying capabilities of Vertica 8.0 can now reach to hadoop and spark to bring in valid results sets back to the database environment

(Source: )




Apache Spark Training

Relevant Projects

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Hive Project - Visualising Website Clickstream Data with Apache Hadoop
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.