News on Apache Spark - January 2018
Top 5 Mistakes when Writing Spark Applications.InsideBigData.com, January 7, 2018
Apache Spark has eased development and is helping developers begin writing distributed programs in just few lines of code that can run on multiple machines and generate business value. Despite the fact that Apache Spark program are easy to read and write , it does not mean that users will not come across issues like out of memory error, slow performing jobs and long running jobs. Most of the issues with Apache Spark have nothing to do with spark features but the approach that we follow when using it.Mark Grover, a software engineer at Cloudera goes on to highlight the common mistakes that developers should avoid when writing Apache Spark applications -
i) No Spark Shuffle block should be greater than 2GB.
ii) Follow appropriate DAG management
iii) To avoid exceptions , perform Shading.
iv) For every spark application you build, the Number of Executors, Cores of each executor and Memory for each executor must be decided on carefully.
v) Follow two stage aggregation using salted and unsalted keys to speed up jobs.
(Source : https://insidebigdata.com/2018/01/07/top-5-mistakes-writing-spark-applications/ )
If you would like more information about Apache Spark Training and Certification, click the Request Info button on top of this page.
Databricks Cache Boosts Apache Spark Performance. Databricks.com, January 9, 2018.
Databricks announced the availability of Databricks Cache, a runtime feature as a part of its unified analytics platform can enhance the scan speed of Apache Spark workloads by 10x without having to make any application code changes.Databricks cache automatically caches any input data for a particular user and load balances it across a spark cluster. It also makes use of NVMe SSD hardware with columnar compression techniques that help improve interactive and reporting workloads performance by 10 times. This novel runtime feature can cache 30 times more data than Apache Spark’s in-memory cache.
(Source : https://databricks.com/blog/2018/01/09/databricks-cache-boosts-apache-spark-performance.html )
How Big Data Can Help Rebuild America's Aging Infrastructure.ScientificAmerican.com, January 17,2018.
America’s infrastructure systems were built decades ago and studies show that the increased maintenance costs and delayed maintenance are obstructing the economic performance.Moreover, engineers are raising safety concerns and warnings that most of the bridges are structurally unsound and obsolete waste water systems pose threat to the common public.Big data and related technologies can help America establish the foundation for future adoption of AI and robotics that will create a zero failure, sustainable and highly resilient infrastructure systems. The infrastructure systems can be developed by utilizing sensors that include wireless sensor networks as a component of IoT. Wireless sensor networks will warn engineers when the key elements of highways, buildings and bridges are weak so that immediate action could be taken.These sensors will create huge amounts of data.Earlier, organizing huge amounts of engineering data was impossible economically, but today integrating big data with traditional engineering practices can help improve the effectiveness and efficiency of engineering process systems at economical cost.
(Source : https://blogs.scientificamerican.com/observations/how-big-data-can-help-rebuild-americas-aging-infrastructure/ )
GigaSpaces Closes 2017 Poised for Continued Growth and Profitability.Businesswire.com, January 17,2018
GigaSpaces announced the launch of its open source InsightEdge platform InsightEdge platform is a hybrid transaction and analytical processing platform that speeds up intelligent application innovation to leverage real-time analytics on transaction data. that uses Apache Spark big data framework to power real-time insights for data driven decision making in exciting areas such as machine learning, customer 360, IoT and transactional processing. Gigaspaces saw an increased demand for its products, expanded important partnerships and launched several initiatives that its customers can use to simplify and speed up big data analytic innovations.
(Source : https://www.businesswire.com/news/home/20180117005194/en/GigaSpaces-Closes-2017-Poised-Continued-Growth-Profitability )
Can precision medicine break chokehold on healthcare big data? Siliconangle.com, January 22, 2018.
Big data is positioning medicine on the brink of a revolution through person-specific treatment using precision medicine. Precision medicine will help healthcare experts to treat diseases depending on a single patient’ genetic. The global precision medicine market in anticipated to reach $88.64 billion by end of 2022. The platforms services such as Apache Spark clusters or Amazon EMR speed up aggregation and analytics of large healthcare datasets.The power of these big data platforms makes the process of analysis much faster for data scientists who can answer research questions faster and discover insights to get a clear picture of a person's lifestyle and health.
(Source : https://siliconangle.com/blog/2018/01/22/can-precision-medicine-break-chokehold-on-healthcare-big-data-reinvent-womenintech/ )