How to do group by in R using dplyr?

How to do group by in R using dplyr?

How to do group by in R using dplyr?

This recipe helps you do group by in R using dplyr


Recipe Objective

Aggregation is one of the fundamental techniques in data manipulation that a data scientist should know. In R, we have dplyr package which is an add-on package most widely used to carry out data manipulation tasks. To carry out the task of aggregation, dplyr package provides us with group_by() function. ​

The group_by() function groups multiple rows of the dataframe based on a categorical column. When combined with summarise() function, it gives us a way to calculate mean, sum, count, minimium or maximum using in-built functions for the specified variables.

There are two ways in which we can use group_by() function :

  1. Using dplyr pipe operator (%>%)
  2. Using summarise_at()

In this recipe, we will learn how to use group_by() fuction by dplyr package in R. ​

Step 1: Loading the required library and Creating a DataFrame

Creating a STUDENT dataframe with Name and marks of two subjects in 3 Trimester exams. ​

# data manipulation library(dplyr) library(tidyverse) STUDENT = data.frame(Name = c("Ram","Ram", "Ram", "Shyam", "Shyam", "Shyam", "Jessica", "Jessica", "Jessica"), Science_Marks = c(55, 60, 65, 80, 70, 75, 45, 65, 70), Math_Marks = c(70, 75, 73, 50, 53, 55, 65, 78, 75), Trimester = c(1, 2, 3, 1, 2, 3, 1, 2, 3)) glimpse(STUDENT)
Rows: 9
Columns: 4
$ Name           Ram, Ram, Ram, Shyam, Shyam, Shyam, Jessica, Jessica,...
$ Science_Marks  55, 60, 65, 80, 70, 75, 45, 65, 70
$ Math_Marks     70, 75, 73, 50, 53, 55, 65, 78, 75
$ Trimester      1, 2, 3, 1, 2, 3, 1, 2, 3

Step 2: Application of group_by Function

Syntax: group_by(x, ...) ​

where: ​

  1. x = dataframe
  2. ... = variables by which grouping needs to take place
# to check the variois arguements of the function ?group_by()

Query 1: To find the average marks for each student in a year (Trimester 1, 2 and 3) ​

Approach 1: Using pipe operator (%>%) ​

# first grouping the columns by student names and then carrying out summarise function on it STUDENT %>% group_by(Name) %>% summarise_at(vars(c(Science_Marks, Math_Marks)), funs(mean(.)))
Name	Science_Marks	Math_Marks
Jessica	60		72.66667
Ram	60		72.66667
Shyam	75		52.66667

Approach 2: Using summarise_at() ​

summarise_at(group_by(STUDENT,Name), vars(c(Science_Marks, Math_Marks)), funs(mean(.)))
Name	Science_Marks	Math_Marks
Jessica	60		72.66667
Ram	60		72.66667
Shyam	75		52.66667

Relevant Projects

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Mercari Price Suggestion Challenge Data Science Project
Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.