Tutorial- Hadoop Multinode Cluster Setup on Ubuntu

This tutorial is a step-by-step guide for installation of Hadoop multinode cluster on Ubuntu 12.04. Tutorial on how to set up Hadoop multi node cluster on Ubuntu, Hadoop Map-Reduce and YARN configuration,create hdfs storage directories on multi nodes.

Data Visualizations Tools in R

This tutorial contains installation giudelines, getting started and examples for the data visualizations packages for R. The packages covered in this tutorials are GGPlot, GGVis, Lattice.

R Statistical and Language tutorial

R is a programming language and software provider for statistical computing and graphical visualization. It has many features which has in-built functions as well as functional coding. Both the ways it can be done in R. R is a freely available under GNU general public License. R provides a wide variety of statistics and graphical techniques which includes both linear and non-linear models, time series analysis, classification analysis, clustering, forecasting, classical test and many more.

Now a days R has become data mining tool as it is used by many data miners. R has only static graphics. But if we need dynamic graphics, which requires special packages need to be installed.

Introduction to Data Science with R

Data Science is a multidisciplinary branch created from various parental disciplines of software engineering, data engineering, business intelligence, scientific methods, visualization, statistics and a mishmash of many other disciplines. R is a statistical programming language which will help us analyzing the data in a very fine manner. In data science now a days R is playing a major role and creates a lot of scope to explore every day. This tutorial series explains how to perform Data Science application using R programming language.

Apache Pig Tutorial: User Defined Function Example

This case study of Apache Pig programming will cover how to write a user defined function. The example of student grades database is used to illustrate writing and registering the custom scripts in Python for Apache Pig. The theme of this example is to analyze the performance of students. The database in question, contains student, subject and score details. The custom script presented in this case study build using Python calculates the weighted average or grade point average of the student.

Apache Pig Tutorial Example: Web Log Server Analytics

This Case study contains examples of Apache Pig commands to query and perform analysis on web server report. The log reports used in this example is generated by various web servers. The log reports contains time-stamped details of requested links, IP address, request type, server response and other data. The same data set is used for analysis in the MapReduce case studies and this case studies illustrated the simplicity of processing and analytics for Apache Pig over Hadoop MapReduce.

The analysis done in this case study reveals visits of specific user, visits per unit time, failed request. This analysis will be carried out by executing queries on the web server log report database.

Impala Case Study: Web Traffic

Storing Internet generate traffic data and processing to get useful insights could help us in understanding the customer behavior and to serve the users better to make lives better in the advanced world. Data that has been generating over the network is increasing exponentially. But the existing data warehouse systems does not provide much scalability at less cost with higher performance. Instead of using costly warehouse systems, with the help of commodity hardware and distribution process we can serve the customers at any scale. Even if the Data generated is exponential to 10, it could be scalable simply by using Hadoop. In this case we just need to add few more nodes to increase the size of the cluster. Because, storage is cheaper than processor.

Impala Case Study: Flight Data Analysis

In this use case we are going to deal with Airport information systems data, which gives us the information regarding flight delays, reason flight get delayed, time in different formats, source and destination details including diverted routes. The data that maintained is big in size and it is increasing. Processing the data multiple times is a time taking process. Visualization tools needs to fetch the data in real time and the graphs or charts made on top of data needs to be updated quickly.

Hadoop Impala Tutorial

Impala is an open source massively parallel processing query engine on top of clustered systems like Apache Hadoop. It was created based on Google’s Dremel paper. It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Impala uses HDFS as its underlying storage.

Apache Hive Tutorial: Tables

There are 2 types of tables in Hive, Internal and External. This case study describes creation of those tables, loading data, partitioning, querying and dropping table on weather data.

Flume Hadoop Tutorial: Twitter Data Extraction

In this case study, flume agent is configured to retrieve data from Twitter. We know that Twitter is a huge source of data with people's opinions and preferences. The data can be used to analyse the public opinion or review on a specific topic or a product. Various types of analysis can be done based on the tweet data and location.

Flume Hadoop Tutorial: Website Log Aggregation

This case study, focuses on a multi hop flume agent to aggregate the log reports from various web servers which have to be analysed with the help of Hadoop. Consider a scenario, where there are multiple servers located in various locations serving from different data centers. The objective is to distribute the log files based on the device type and store a backup of all logs.

Relevant Projects

You might also like

Tutorials

Top 15 Latest Recipes