Recap of Hadoop News for September 2018

Recap of Hadoop News for September 2018

Big Data Hadoop News for September 2018

Hadoop-as-a-Service: The Need Of The Hour For Superior Business, September 7, 2018

Hadoop is the cornerstone of the big data industry, however, the challenges involved in maintaining the hadoop network has led to the development and growth of Hadoop-as-a-Service (HaaS) market.Industry research reveals that the global Hadoop-as-a-Service market is anticipated to reach $16.2 billion by 2020 growing a a compound annual growth rate of 70.8% from 2014 to 2020.With market leaders like Microsoft and SAP expanding their horizons at the end user industry, HaaS is likely to witness rapid growth in the next 7 years.Organizations like Commerzbank have already launched new platforms based on HaaS solutions which demonstrate that HaaS is a promising solution for building and managing big data clusters. HaaS will compel organizations to consider Hadoop as a solution to various big data challenges.

(Source - )

Online Hadoop Training

Hortonworks unveils roadmap to make Hadoop, September 10, 2018

Considering the importance cloud, Hortonworks is partnering with RedHat and IBM to transform Hadoop into a cloud-native platform.Today Hadoop can run in the cloud but it cannot exploit the capabilities of the cloud architecture to the fullest.The idea to make hadoop cloud-native is not a mere matter of buzzword compliance,but the goal is to make it more fleet-footed.25% of workloads from Hadoop incumbents - MapR, Hortonworks, and Cloudera are running in the cloud ,however, by next year it is anticipated that half of all the new big data workloads will be deployed on the cloud.Hortonworks is unveiling the Open Hybrid Architecture initiative for transforming Hadoop into a cloud-native platform that will address containerization, support Kubernetes, and include the roadmap to encompass separating compute from data.

(Source - )

Master Hadoop Skills by working on interesting Hadoop Projects

LinkedIn open-sources a tool to run TensorFlow on, September 13, 2018.

LinkedIn’s open-source project Tony aims at scaling and managing deep learning jobs in Tensorflow using YARN scheduler in Hadoop.Tony uses YARN’s resource and task scheduling system to run Tensorflow jobs on a Hadoop cluster. LinkedIn’s open source project Tony can also schedule GPU based tensorflow jobs through Hadoop,allocate memory separately for Tensorflow nodes , request different types of resources (CPU’s vs GPU’s), and ensures that the job outcomes are saved at regular intervals on HDFS  and resumed from where the jobs were interrupted or crashed.LinkedIn claims that there is no additional overhead for Tensorflow jobs when using Tony because it is present at a layer which orchestrates distributed Tensorflow and does not interrupt the execution of tensorflow jobs.Tony is also used for visualizing, optimization, and debugging of Tensorflow apps.

(Source - )

Big Data Hadoop Projects

Microsoft’s SQL Server gets built-in support for Spark and Hadoop. September 24, 2018.

Microsoft has announced the addition of new connectors which will allow businesses to use SQL server to query other databases like MongoDB, Oracle, and Teradata. This will make Microsoft SQL server into a virtual integration layer where the data will never have to be replicated or moved to the SQL server. SQL server in 2019 will come with in-built support for Hadoop and Spark. SQL server will provide support for big data clusters through Google-incubated Kubernetes container orchestration system. Every big data cluster will include SQL server, Hadoop and Spark file system.

(Source -

Big-data project aims to transform farming in world’s poorest countries.September 24, 2018,

Big data is really changing the way we use data for agriculture. FAO, the Bill and Melinda Gates Foundation and national governments have launched a US$500-million effort to help developing countries collect data on small-scale farmers to help fight hunger and and promote rural development. Collecting accurate information about seed varieties ,farmer’s technological capacity, and farmers income will help coalition members understand how  ongoing agricultural investments are making an impact.This data will also enable governments to customize policies to help farmers. 

(Source - 

Mining equipment-maker uses BI on Hadoop to dig for, September 26, 2018.

Milwaukee based maker of mining equipment Count Komatsu Mining Corp. is looking to churn more data in place and share BI analytics of the data within and outside the organization.To enhance the efficiency, Count Komatsu has combined several big data tools that include Spark, Hadoop, Kafka , Kudu, and Impala from Cloudera. It has also included on-cluster analytics software from BI on Hadoop analytics toolmaker Arcadia Data. This big data platform has been assembled to analyse sensor data collected by the equipments in the field to keep a track on wear and tear of massive shovels and earth movers.The company forsees a future in which the platform will utilize IoT application data for better predictive and prescriptive equipment maintenance.

(Source - )

Online Hadoop Training


Relevant Projects

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Data Mining Project on Yelp Dataset using Hadoop Hive
Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.