Spark MLlib for Scalable Machine Learning with Spark
Spark MLlib for Scalable Machine Learning with Spark
Cloudera co-founder Mike Olson said in a Strata + Hadoop World keynote –
“Spark allowed people to build and deploy scale-out machine learning applications much faster than they had previously done. [Why?] Its flexibility and ease of programming meant that you could build machine learning apps, train up models on massive data very, very quickly. That has led to huge interest in the ecosystem.”
Applying machine learning algorithms to massive datasets is challenging because most of the top machine learning algorithms are not designed for parallel architectures. Considering the iterative nature of machine learning algorithms, Apache Spark is among one of the few competing big data frameworks for parallel computing that provides a combination of in-memory processing, fault-tolerance, scalability, speed and ease of programming.
Using iterative machine learning algorithms on large datasets is now possible with Apache Spark. Spark can store big datasets in cluster memory with paging from disk as required and can effectively run various machine learning algorithms without having to sync multiple times to the disk, making them run 100 times faster. All thanks to Spark MLlib library for making machine learning with spark easy and scalable. A 2015 Spark respondents survey revealed that Apache Spark is gaining importance for machine learning with 64% of Spark users using it for advanced analytics and 44% of them using it for building recommendation systems.
Why you should use Apache Spark for Machine Learning?
With over 140 contributors from across 50 organizations, Spark MLlib provides developers with various tools that simplify the development of machine learning pipelines in production. Spark MLlib is designed mainly for large-scale learning settings which benefit from model parallelism.
Benefits of Spark MLlib
Spark MLlib is tightly integrated on top of Spark which eases the development of efficient large-scale machine learning algorithms as are usually iterative in nature.
Spark’s open source community has led to the rapid growth and adoption of Spark MLlib. There are more than 200 individuals from across 75 organizations providing approximately 2000+ patches to MLlib alone.
MLlib is easy to deploy and does not require any pre-installation, if Hadoop 2 cluster is already installed and running.
Spark MLlib’s scalability, simplicity, and language compatibility (you can write applications in Java, Scala, and Python) helps data scientists solve iterative data problems faster. Data Scientists can focus on data problems that are important whilst transparently leveraging speed, ease and tight integration of Spark’s unified platform.
MLlib provides ultimate performance gains to data scientists and is 10 to 100 times faster than Hadoop and Apache Mahout. Alternating Least Squares machine learning algorithms on Amazon Reviews on a dataset of 660M users, 2.4M items, and 3.5 B ratings runs in 40 minutes with 50 nodes.
What’s in MLlib?
It contains fast and scalable implementations of standard machine learning algorithms. Through Spark MLlib, data engineers and data scientists have access to different types of statistical analysis, linear algebra and various optimization primitives. Spark Machine Learning library MLlib contains the following applications –
Collaborative Filtering for Recommendations – Alternating Least Squares
Regression for Predictions – Logistic Regression, Lasso Regression, Ridge Regression, Linear Regression and Support Vector Machines (SVM).
Clustering – Linear Discriminant Analysis, K-Mean and Gaussian,
Classification Algorithms – Naïve Bayes, Ensemble Methods, and Decision Trees.
Dimensionality Reduction –PCA (Principal Component Analysis) and Singular Value Decomposition (SVD).
Features of Spark MLlib Library
MLlib provides algorithmic optimizations for accurate predictions and efficient distributed learning. For instance, the alternating least squares machine learning algorithms for making recommendations effectively uses blocking to reduce JVM garbage collection overhead.
MLlib benefits from its tight integration with various spark components. MLlib leverages high level libraries packaged with the Spark framework – Spark Core (has over 80 operators for data cleaning and featurization), Spark SQL, Spark Streaming and GraphX.
MLlib provides a package called spark.ml to simplify the development and performance tuning of multi-stage machine learning pipelines. When working with large datasets, the process of patching an end-to-end pipeline is expensive in terms of network overhead and labour-intensive. MLlib eases this by providing high-level API’s which help data scientists swap out a standard learning approach instead of using their own specialized machine learning algorithms.
MLlib provides fast and distributed implementations of common machine learning algorithms along with a number of low-level primitives and various utilities for statistical analysis, feature extraction, convex optimizations, and distributed linear algebra.
Spark MLlib library has extensive documentation which describes all the supported utilities and methods with several spark machine learning example codes and the API docs for all the supported languages.
MLlib has a very active open source community and frequent event meetups to encourage community contributions and enhancements to the library over time. With wide number of Spark use cases for MLlib and contributions from large number of developers, the adoption of Spark MLlib for Machine Learning is growing rapidly.
Spark MLlib Use Cases
Some of the common business use cases for the Spark Machine Learning library include – Operational Optimization, Risk Assessment, Fraud Detection, Marketing optimization, Advertising Optimization, Security Monitoring, Customer Segmentation, and Product Recommendations.
Companies Using Apache Spark MLlib
24  is a predictive analytics company that captures around 2.5B customer interactions and uses this data to build machine learning models that predict customer intent across various channels – chat, online and voice. It uses Spark MLlib for machine learning and automated feature engineering.
Spark MLlib is used for frequent pattern mining and is core to the analytics platform of Huawei’s big data solution, Fusion Insight that is used by more than 100 customers across the world.
Toyota’s Customer 360 Insights Platform leverages MLlib library for categorizing and prioritizing its customers social media interactions in real-time.
Spark MLlib is an integral part of Open Table’s dining recommendations.
ING ‘s machine learning pipeline uses Spark MLlib’s K-Means Clustering and Decision Tree Ensembles for anomaly detection.
Netflix and Spotify use Spark Streaming and Spark MLlib to make user recommendations that best fit in its customer tastes and buying histories. These companies use live stream clicks and user preferences to update their recommendation systems every few seconds.
Spark MLlib is in active development and all thanks to all the MLlib contributors for joining hands to speed up machine learning development with valuable contributions to improve MLlib capabilities over time.
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.
Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.