Recap of Apache Spark News for November

Recap of Apache Spark News for November

News on Apache Spark - November 2016

Apache Spark News November

The newly released Redis-ML component for the popular in-memory data store accelerates machine learning functions with Apache, November 2,2016.

Redis,in-memory datastore recently expanded its functionality through a module architecture featuring a  new machine learning add-on that speeds up the delivery of results through trained data instead of training the model itself. This module works with the machine learning components of Apache Spark which handles the data gathering phase.Redis plugs into the Apache Spark cluster through the Redis SparK ML Module.

(Source: )

Learn Spark Online

Industry Trends and Apache Spark's Evolving Role in the Big Data, November 4, 2016.

Apache Spark has been the biggest trend in the field of Big Data as of now, with novel opportunities. In future Apache Spark will be more useful than Hadoop for the computational purposes, Spark workloads will increasingly move into production. Trends show that in future companies will start drinking their own Spark champagne. This means that Spark will no longer just be used for customer centric use cases but will be used to build models to places, stress test the risk involved in financial instruments and lots more.

(Source: )

Machine learning and data science workloads ignite Apache Spark adoption., November 8,2016.

According to a Cloudera study conducted by Taneja Group on 7000 professionals involved in big data-  54% of them are using Apache Spark actively whilst 64% are finding it to be invaluable. It is proving  to be invaluable for 64% of the people as they plan to expand their usage over the next one year.With the emergence of  machine learning applications and 71% employing Apache Spark for data science- Spark is here to stay to support the increasing number of workloads for real-time processing.

(Source: )

Databricks Sets New World Record for CloudSort Benchmark Using Apache Spark., November 16,2016

Databricks has broken a third party benchmarking competition - CloudSort Benchmark for processing large datasets.Databricks in collaboration with Nanjing University and Alibaba Group architected an efficient cloud platform for processing of large datasets. The platform sorted 100TB of data at an economical cost of $1.44 per TB outperforming the earlier record of $4.51 per TB.The benchmark is meant to measure the lowest possible cost in the public cloud pricing per TB.

(Source : )

Review: Spark lights up machine, November 16,2016.

Apache Spark's machine learning library MLlib is bringing in machine learning capabilities to large compute clusters by combining it with TensorFlow for deep learning.Users can now make use of Databricks configuration of Apache Spark Clusters to use GPU’s rather than using stock CPU. GPU’s will give users 10 times better speed for training complex machine learning algorithms with big data.

(Source :

DWP explores use of Apache,November 9,2016.

Data scientists at Department for Work and Pensions (DWP) are exploring Apache Spark for processing large datasets . Data science team at DWP is working with Spark technology for investigations of AI, Machine learning and latest uses of data. Team of 20 people and more are working on this platform since half a year hoping to create an application with a specific service which will be valuable in a short period of time.Their goal is to build up some of their own capabilities and look at the analysis that can be performed using Apache Spark.


Spark 2.0.2,November 14,2016.

Apache Spark 2.0.2 has been recently released in the market.Databricks is strongly recommending all its existing users to upgrade to the latest version as it has included the fixes across several areas, Kafka 0.10 along with runtime metric support. The latest version also fixes several bug fixes on top of 2.0.1.


Couchbase 4.6 Developer Preview Released, Adds Real-Time Connectors for Apache Spark 2.0 and, November 28,2016.

The Couchbase 4.6 has added few new features to it that include full text search capability based Golang based open source library bleve. The next feature which is included is Cross Datacenter Replication. The main focus of this feature is to ensure that the applications used in different geographic locations remain in a consistent state. The other main features include connectors for real-time analytics technologies Spark 2.0 and Kafka. Spark 2.0 connector supports structured streaming and automatic flow control on a Couchbase cluster.

(Source :

Apache Spark News




Relevant Projects

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.