Recap of Apache Spark News for May 2017

Apache Spark Monthly News Update -Learn what happened in the Apache Spark Community in the month of May 2017.

Recap of Apache Spark News for May 2017
 |  BY ProjectPro

News on Apache Spark - May 2017

Apache Spark News for  May 2017

Using Apache Spark Machine Learning for Pattern Detection. RTInsights.com, May 2, 2017.

Predicting consumer behavior is the key to successful marketing but a  major challenge here is on how to filter out the noise from customers who are ready to buy. Consumer behavior data is usually on the scale of petabytes, analytic queries can tax data stores. Alexander Sadovsky, director of data science at Oracle  said that they are solving the problem by moving Oracle Data Cloud from an on-premise single machine for data processing to cloud based Hive and ultimately to Apache Spark cluster. Moving to spark cluster will help them process data faster with several in-built machine learning libraries.

(Source : https://www.rtinsights.com/using-apache-spark-machine-learning-to-predict-consumer-behavior/ )

Access Solved Big Data and Data Science Projects

Nordea banks on move to machine-led decision making.Diginomica.com, May 9, 2017.

Swedish Bank Nordea is moving to machine-led decision making. It has realized the need to implement a better analytic system. The bank has implemented Cloudera data lake architecture based on Hadoop in an attempt to enhance its data gathering capabilities. This architecture helps the bank to  produce, report, and monitor core data at a rapid pace.Alasdair Anderson, Nordea’s head of data engineering states that the key to its set-up is the use of cluster technology -Apache Spark that underlies the hadoop data lake. For the team at Nordea, Apache Spark is a natural evolution from the MapReduce era of Hadoop.

(Source : http://diginomica.com/2017/05/09/nordea-banks-move-machine-led-decision-making/ )

Impetus Technologies to Host Webinar Highlighting Real-Time Data360 on Apache Spark.PRNewsWire.com, May 10, 2017

Impetus Technologies hosted a  “Real-Time Data360 on Apache Spark” webinar on May 12 for people who want to learn all-in-one apache spark strategy for big data analytics. The main focus of the webinar was to discuss about the challenges IT teams face when choosing one vendor for data ingestion , one vendor for data wrangling and another one for machine learning analytics and some other vendor for data visualization. For organizations that have already opted to use Apache Spark as a big data framework , it is easier to do big data analytics as it is an all-in-one platform.The only challenge here in using Spark as the all-in-one platform is finding skilled and talented Scala / Java programmers for building Spark applications.

(Source : http://www.prnewswire.com/news-releases/impetus-technologies-to-host-webinar-highlighting-real-time-data360-on-apache-spark-300455601.html )

Big Data Projects

ESG Lab Affirms Trifacta's Photon Compute Engine is the Fastest, Most Efficient Non-Distributed Processing Engine for Data Wrangling. MarketWired.com, May 16, 2017.

Enterprise Strategy Group (ESG) evaluated the efficiency and speed of Trifacta’s Photon Compute Engine for data wrangling. Apart from Photon, Trifacta also supports other multi-purpose engines like Apache Spark and Google Dataflow. However, Trifacta’s in-memory engine, Photon was evaluated as the most efficient and fastest data processing engine for wrangling data sets which do not require parallel processing. Photon can complete transformations on single node environments 6 times faster than spark whilst using 98% less memory when compared to Spark.

(Source : http://www.marketwired.com/press-release/esg-lab-affirms-trifactas-photon-compute-engine-is-fastest-most-efficient-non-distributed-2216670.htm)

TPC-H Benchmarks Show Memory1 Improves Spark SQL Performance By As Much As 289% While Lowering Costs By Up To 51%.PRNewswire.com, May 17, 2017.
 

Spark SQL , an important Spark module for structured data processing that allows spark queries to be run on Spark data is an ideal application for Memory 1.Diablo technologies released a TPC-H benchmark data showcasing the performance benefits of Memory1 for Spark SQL workloads.The benchmark results revealed that the data processing time increased by 289% with Total Cost of Ownership reduced to as much as 51% on increasing the cluster memory size with Memory1.This reveals that on increasing the application memory using Diablo Memory1 , each server can achieve three time the actual performance at half the overall cost.

(Source : http://www.prnewswire.com/news-releases/tpc-h-benchmarks-show-memory1-improves-spark-sql-performance-by-as-much-as-289-while-lowering-costs-by-up-to-51-300459084.html )

Cloudera To Accelerate Machine Learning In The Enterprise.CXOToday.com, May 22,2017.

Cloudera announced the launch of its self-service tool for data scientists - Cloudera Data Science Workbench.Cloudera Data Science Workbench will deliver a self service data science experience with the popular data science programming languages Python , R and Scala directly in the web browser.The data science workbench easily integrates with many of the deep learning frameworks including the deep learning library BigDL for Apache Spark.BigDL integration into the data science workbench helps data scientists leverage deep learning libraries and tactics on CPU architecture without requiring any additional hardware considerations or separate environments. This will make it easy to create spark data science pipelines and integrate them with BigDL and other Spark or Hadoop components on the data science workbench.

(Source : http://www.cxotoday.com/story/cloudera-to-accelerate-machine-learning-in-the-enterprise/ )

Pepperdata Takes On Spark Performance Challenges.Datanami.com, May 24, 2017.

Apache Spark framework has revolutionized the manner in which big data applications are developed, however, troubleshooting Apache Spark Jobs on Hadoop clusters is not easy.Pepperdata has come up with a solution for this through a new tool named Code Analyzer for Apache Spark. Using Code Analyzer tools spark developers and administrators can track down the root cause of the performance glitch in a spark job and get the spark jobs back up and running at complete speed.The new tool unveiled by Pepperdata can tell developers what stages of a big data applications are causing a performance bottleneck and which lines of code are slowing down the application.

(Source : https://www.datanami.com/2017/05/24/pepperdata-takes-spark-performance-challenges/ )

Unifying Oil and Gas Data at Scale.NextPlatform.com, May 30, 2017.


Oil and Gas industry has adopted multiple open-source distributed frameworks to manage storage, processing and data analysis process. Hadoop HDFS, MapReduce, Hive , HBase and Apache Spark are the big data frameworks used in the oil and gas industry to analyse the large amount of data collected. However, one problem with Apache Spark is that the Resilient Distributed Datasets cannot be changed over time .Apache Ignite framework addresses this issue with mutable RDD’s  which can be used for distributed caching.With so many frameworks. With multiple open source frameworks in place, the oil and gas industry is still trying to come up with a comprehensive data architecture.Lambda Architecture has been proposed for private clouds where in Kafka is used to pull data from stream and batch layers while Spark Streaming divides live data streams into smaller batches.

(Source : https://www.nextplatform.com/2017/05/30/unifying-oil-gas-data-scale/ )

Hadoop and Spark Projects

PREVIOUS

NEXT

About the Author

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author arrow link