News on Apache Spark - January 2017
Apache Spark: Promises and Challenges. Dzone.com, January 5,2017.
Apache Spark is best fitted for both batch and real time processing , however it might not be perfect for all use cases. Gaining popularity for its speed , Apache Spark can be used for machine learning, analytics, IoT and trending data use cases. Despite its popularity and multiple spark use cases, it does have some issues like memory usage monitoring, tricky deployment,poor documentation, incomplete support for Python programming, and constant API changes.
(Source : https://dzone.com/articles/apache-spark-promises-and-challenges )
Hazelcast IMDG Adds Connector Support For Apache Spark.InsideBigData, January 11,2017
Leading open source in-memory data grid , Hazelcast with 17 million servers and thousands of installed clusters announced a new solution to integrate Hazelcast’s IMDG and Apache Spark. The new integrated solution will provide developers with data storage and compute capabilities for big data requirements beyond the historical limitations of a single JVM. From version 3.7 , Hazelcast IMDG has shipped an open source connector that will allow it to be used as a storage medium for apache spark.
(Source : http://insidebigdata.com/2017/01/11/hazelcast-imdg-adds-connector-support-for-apache-spark/)
Apache Spark 2.1 Improves Structured Streaming.adtmag.com, January 11,2017.
Having better streaming analytics has always been the hot topic in the big data world. Databricks, recently announced the release of Spark version 2.1 that focuses on stability, refinement and usability resolving over 1200 tickets when compared to earlier spark releases. Spark streaming and in-memory processing were important attractions of Apache Spark when it dbeuted in 2014. Spark 2.1 comes with novel enhancements to the structured streaming feature including support for all kinds of file based formats like CSV, Avro, Text, JSON, etc and also supports the use of Kafka for ingesting and managing data streams. Spark version 2.1 also meets the manageability requirements and provides stringent visibility that streaming analytics demand from underlying big data systems.
(Source : https://adtmag.com/articles/2017/01/11/spark-2-1.aspx )
Apache Beam and Spark: New coopetition for squashing the Lambda Architecture? Zdnet.com, January 12, 2017.
The latest manifestation of Google’s novel open technology strategy is Apache Beam. It is an API which separates building of a data processing pipeline from the actual engine on which it would run. It consists of abstractions for specifying the data pipeline, actual data stream similar to Spark RDD’s, functions , transformations, and the sources and targets. Apache Beam is one of the growing approaches for flattening lamda architecture along with Apache Spark.It helps combine both batch and real time processing on the same code cluster and code base.
(Source : http://www.zdnet.com/article/apache-beam-and-spark-new-coopetition-for-squashing-the-lambda-architecture/ )
Intel Open-Sources BigDL, Distributed Deep Learning Library for Apache Spark. InfoQ, January 13, 2017.
Intel open-sources BigDL, a distributed deep learning library that allows execution of deep learning computations and eases data loading of big datasets stored in Hadoop. BigDL is implemented in Scala programming languages and is modelled after Torch. BigDL runs on Apache Spark and supports version 1.5, 1.6 and 2.0. Using this open-source library, deep learning can be embedded into Spark based programs. It contains various methods that help convert Spark RDD’s into BigDL datasets which can be used directly with Spark machine learning pipelines.
(Source : https://www.infoq.com/news/2017/01/bigdl-deep-learning-on-spark)
Percipient Launches SparkPLUS to Solve Apache Spark’s Out-of-memory Problems. InsideBigData.com, January 24, 2017.
There have been several complaints about Apache Spark’s out of memory errors when dealing with multiple join conditions and diverse data sources. Under such instances, Spark’s memory utilization becomes vastly elevated and it runs out of memory. A Singapore based startup, Percipient is trying to address the memory issues faced by Spark users. SparkPLUS solution launched by Percipient has the ability to multiply the computing space of apache spark helping users enhance its utility for analytical and other real time applications.
(Source : http://insidebigdata.com/2017/01/24/percipient-launches-sparkplustm-to-solve-apache-sparks-out-of-memory-problems/ )
IBM's Spark-Driven Data Science Experience Cozies Up to GitHub.Ostatic.com, January 27,2017.
IBM’s first Spark-driven cloud based environment for real-time, high performance analytics will provide data scientists with the ability to access and ingest data for delivering insight-driven models to developers. IBM’s Cloud Bluemix platform , the Data Science Experience has several open source tools , provides 250 datasets and a shared workspace so that data scientists can uncover and share data-driven insights with developers.This will make it easier to develop big data applications that are infused with intelligence. IBM has recently announced the integration of its IBM Data Science Experience platform with GitHub to enhance the collaborative experience between data scientists.
(Source : http://ostatic.com/blog/ibms-spark-driven-data-science-experience-cozies-up-to-github )