Recap of Apache Spark News for February 2018

Recap of Apache Spark News for February 2018

News on Apache Spark - February 2018

Apache Spark News 2018

MapR Simplifies End-to-End Workflow for Data Scientists., February 8, 2018.

MapR announced the availability of its MapR Extension Pack (MEP) 4.1 that lets data engineers and data scientists create scalable deep learning pipelines to make operation data immediately available for data science and obtain 2X performance enhancements across various ad-hoc queries. MEP 4.1 provides data scientists with the ability to build real-time pipelines  with support for new programming languages. MEP 4.1 adds support for distributing Python archives for PySpark allowing data scientists to leverage data science libraries in Python in a distributed manner to create scalable deep learning pipelines. As a part of MEP 4.1, Python and Java bindings for MapR-DB OJAI Connector for Spark let developers read or write to MapR-DB from Apache Spark using Python or Java so that developers can build data intensive business applications in Python and Java.

(Source : )

Apache Spark Training

If you would like more information about Apache Spark Training and Certification, click the Request Info button on top of this page.

You've got a yottabyte on your hands: How analytics is changing storage., February 9, 2018.

The extensive use of advanced analytics today is leaving IT professionals with a huge responsibility for storage , security and accessibility of the large data pool. Managing the huge volumes of data pouring into the organisation is a big challenge because even the HDD RAID arrays that store an exabyte of raw data are likely to hit the budget hard of many companies.Organizations are increasingly either using Apache Spark Hadoop or Spark on top of Hadoop to serve the software side of big data analytics. Regardless of whether the big data cluster is built on these open source frameworks or any other commercial big data frameworks , it will impact the storage decision. Whether you are using Spark on Hadoop or Spark Hadoop , the ultimate goal should be to scale up real-time analytics or scale out to include the large data sets in the analytics environment based on the workload in question.
(Source : )

Qubole and Snowflake Bring Machine Learning to the Cloud Data, February 13, 2018.

The cloud big-data-as-a-service company, Qubole and the only data warehouse built for the cloud, Snowflake Computing announced a new partnership  which will allow organizations to use spark in Qubole with data stored in a snowflake.This new integration will help organizations build, train and deploy powerful and AI and ML models in production using the data stored in Snowflake. Data engineers  can now use Qubole to read and write data in Snowflake for advanced data preparation such as data augmentation and data wrangling to clean exisitng snowflake datasets.

(Source : )

SpaRC: Scalable Sequence Clustering using Apache, February 26, 2018.

SparkReadClust(SpaRC) is Apache Spark based scalable sequence clustering application which reads genomes based on their molecule of origin to facilitate downstream assembly optimization. SpaRC software can run on various cloud computing platforms without any alterations while delivering same performance.The best thing about SpaRC is that it produces high clustering performance  on metagenomes and transcriptomes from both long and short read sequencing technologies. SpaRC is the most scalable solution for clustering billions of reads from latest sequencing experiments and Apache Spark forms a cost effective solution for the same with faster deployment cycles for similar kind of large scale sequence data analysis problems.
(Source : )

Databricks to Showcase Unified Analytics Platform at Gartner Data & Analytics Summit, February 27, 2018.

Databricks will showcase its Unified Analytics Platform as a Silver sponsor  at the Gartner Data & Analytics Summit to be held  in Grapevine, Texas in March 5-8. Several organizations are already using Databricks’ Unified Analytics Platform as it provides a simplified approach for the data engineering and data science teams in an organization to speed up innovation and data-driven business decision making using AI and big data analytics. “Most data and analytics leaders realize that when it comes to embarking on new AI and Machine Learning initiatives, it’s still really about the data first and foremost.  Their teams need to figure out how you get a massive amount of data, often in real-time, to your model in a way that supports an iterative process and generates a meaningful business result. The Databricks Unified Analytics Platform addresses precisely this problem and, as such, we expect strong engagement from the attendees of Gartner Data & Analytics Summit, many of whom already use Spark.” - said Rick Schultz, chief marketing officer at Databricks. 
(Source : )

Data Science Projects

Spark 2.3.0 Released. , February 28, 2018.

Databricks released the fourth version of Apache Spark in the 2.x line. Spark 2.3.0 release provides support for continuous processing in structured streaming together with a novel Kubernetes Scheduler backend. Other new updates in the latest release include new DataSource and Structured Streaming API’s along with various PySpark performance enhancements. The main focus of Spark 2.3.0 release is to focus on usability and stability while continuing to resolve 1400 tickets.
(Source : )

Winners and Losers from Gartner’s Data Science and ML Platform,February 28, 2018.

Gartner released its latest Magic Quadrant for machine learning and data science platforms last week with 16 vendors making an entry into the report.Databricks debuts in the Visionaries Quadrant  for its cloud based offering based on Apache Spark.Gartner admired Databricks for its flexibility , as witnessed in its work in machine learning, deep learning , spark streaming and IoT with support for multiple programming languages such as Python, R, and Scala. Few other prominent winners in the Gartner Magic Quadrant include H2O, Alteryx, Domino Data Lab, Anaconda, and KNIME.
(Source : )

 Apache Spark News



Relevant Projects

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.