News on Apache Spark - February 2018
MapR Simplifies End-to-End Workflow for Data Scientists. GlobeNewsWire.com, February 8, 2018.
MapR announced the availability of its MapR Extension Pack (MEP) 4.1 that lets data engineers and data scientists create scalable deep learning pipelines to make operation data immediately available for data science and obtain 2X performance enhancements across various ad-hoc queries. MEP 4.1 provides data scientists with the ability to build real-time pipelines with support for new programming languages. MEP 4.1 adds support for distributing Python archives for PySpark allowing data scientists to leverage data science libraries in Python in a distributed manner to create scalable deep learning pipelines. As a part of MEP 4.1, Python and Java bindings for MapR-DB OJAI Connector for Spark let developers read or write to MapR-DB from Apache Spark using Python or Java so that developers can build data intensive business applications in Python and Java.
(Source : https://globenewswire.com/news-release/2018/02/08/1336073/0/en/MapR-Simplifies-End-to-End-Workflow-for-Data-Scientists.html )
If you would like more information about Apache Spark Training and Certification, click the Request Info button on top of this page.
You've got a yottabyte on your hands: How analytics is changing storage. TheRegister.co.uk, February 9, 2018.
The extensive use of advanced analytics today is leaving IT professionals with a huge responsibility for storage , security and accessibility of the large data pool. Managing the huge volumes of data pouring into the organisation is a big challenge because even the HDD RAID arrays that store an exabyte of raw data are likely to hit the budget hard of many companies.Organizations are increasingly either using Apache Spark Hadoop or Spark on top of Hadoop to serve the software side of big data analytics. Regardless of whether the big data cluster is built on these open source frameworks or any other commercial big data frameworks , it will impact the storage decision. Whether you are using Spark on Hadoop or Spark Hadoop , the ultimate goal should be to scale up real-time analytics or scale out to include the large data sets in the analytics environment based on the workload in question.
(Source : https://www.theregister.co.uk/2018/02/09/how_analytics_is_changing_storage/ )
Qubole and Snowflake Bring Machine Learning to the Cloud Data Warehouse.Globenewswire.com, February 13, 2018.
The cloud big-data-as-a-service company, Qubole and the only data warehouse built for the cloud, Snowflake Computing announced a new partnership which will allow organizations to use spark in Qubole with data stored in a snowflake.This new integration will help organizations build, train and deploy powerful and AI and ML models in production using the data stored in Snowflake. Data engineers can now use Qubole to read and write data in Snowflake for advanced data preparation such as data augmentation and data wrangling to clean exisitng snowflake datasets.
(Source : https://globenewswire.com/news-release/2018/02/13/1339908/0/en/Qubole-and-Snowflake-Bring-Machine-Learning-to-the-Cloud-Data-Warehouse.html )
SpaRC: Scalable Sequence Clustering using Apache Spark.InsideHPC.com, February 26, 2018.
SparkReadClust(SpaRC) is Apache Spark based scalable sequence clustering application which reads genomes based on their molecule of origin to facilitate downstream assembly optimization. SpaRC software can run on various cloud computing platforms without any alterations while delivering same performance.The best thing about SpaRC is that it produces high clustering performance on metagenomes and transcriptomes from both long and short read sequencing technologies. SpaRC is the most scalable solution for clustering billions of reads from latest sequencing experiments and Apache Spark forms a cost effective solution for the same with faster deployment cycles for similar kind of large scale sequence data analysis problems.
(Source : https://insidehpc.com/2018/02/sparc-scalable-sequence-clustering-using-apache-spark/ )
Databricks to Showcase Unified Analytics Platform at Gartner Data & Analytics Summit 2018.GlobeNewsWire.com, February 27, 2018.
Databricks will showcase its Unified Analytics Platform as a Silver sponsor at the Gartner Data & Analytics Summit to be held in Grapevine, Texas in March 5-8. Several organizations are already using Databricks’ Unified Analytics Platform as it provides a simplified approach for the data engineering and data science teams in an organization to speed up innovation and data-driven business decision making using AI and big data analytics. “Most data and analytics leaders realize that when it comes to embarking on new AI and Machine Learning initiatives, it’s still really about the data first and foremost. Their teams need to figure out how you get a massive amount of data, often in real-time, to your model in a way that supports an iterative process and generates a meaningful business result. The Databricks Unified Analytics Platform addresses precisely this problem and, as such, we expect strong engagement from the attendees of Gartner Data & Analytics Summit, many of whom already use Spark.” - said Rick Schultz, chief marketing officer at Databricks.
(Source : https://globenewswire.com/news-release/2018/02/27/1396085/0/en/Databricks-to-Showcase-Unified-Analytics-Platform-at-Gartner-Data-Analytics-Summit-2018.html )
Spark 2.3.0 Released. spark.apache.org , February 28, 2018.
Databricks released the fourth version of Apache Spark in the 2.x line. Spark 2.3.0 release provides support for continuous processing in structured streaming together with a novel Kubernetes Scheduler backend. Other new updates in the latest release include new DataSource and Structured Streaming API’s along with various PySpark performance enhancements. The main focus of Spark 2.3.0 release is to focus on usability and stability while continuing to resolve 1400 tickets.
(Source : https://spark.apache.org/releases/spark-release-2-3-0.html )
Winners and Losers from Gartner’s Data Science and ML Platform Report.Datanami.com,February 28, 2018.
Gartner released its latest Magic Quadrant for machine learning and data science platforms last week with 16 vendors making an entry into the report.Databricks debuts in the Visionaries Quadrant for its cloud based offering based on Apache Spark.Gartner admired Databricks for its flexibility , as witnessed in its work in machine learning, deep learning , spark streaming and IoT with support for multiple programming languages such as Python, R, and Scala. Few other prominent winners in the Gartner Magic Quadrant include H2O, Alteryx, Domino Data Lab, Anaconda, and KNIME.
(Source : https://www.datanami.com/2018/02/28/winners-losers-gartners-data-science-ml-platform-report/ )