How to delete duplicates from a Pandas DataFrame?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to delete duplicates from a Pandas DataFrame?

How to delete duplicates from a Pandas DataFrame?

This recipe helps you delete duplicates from a Pandas DataFrame

Recipe Objective

In many dataset we find many duplicate values so how to remove that.

So this is the recipe on how we can delete duplicates from a Pandas DataFrame.

Step 1 - Importing Library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Creating DataFrame

We have created a dataframe of which we will delete duplicate values. raw_data = {"first_name": ["Jason", "Jason", "Jason","Tina", "Jake", "Amy"], "last_name": ["Miller", "Miller", "Miller","Ali", "Milner", "Cooze"], "age": [42, 42, 1111111, 36, 24, 73], "preTestScore": [4, 4, 4, 31, 2, 3], "postTestScore": [25, 25, 25, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "preTestScore", "postTestScore"]) print(); print(df)

Step 3 - Removing Duplicate Values

We will drop duplicate values which will come after the first in the feature first name we will keep the last duplicate item. print(df.duplicated()) print(df.drop_duplicates(keep="first")) print(df.drop_duplicates(["first_name"], keep="last")) So the output comes as

  first_name last_name      age  preTestScore  postTestScore
0      Jason    Miller       42             4             25
1      Jason    Miller       42             4             25
2      Jason    Miller  1111111             4             25
3       Tina       Ali       36            31             57
4       Jake    Milner       24             2             62
5        Amy     Cooze       73             3             70

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

  first_name last_name      age  preTestScore  postTestScore
0      Jason    Miller       42             4             25
2      Jason    Miller  1111111             4             25
3       Tina       Ali       36            31             57
4       Jake    Milner       24             2             62
5        Amy     Cooze       73             3             70

  first_name last_name      age  preTestScore  postTestScore
2      Jason    Miller  1111111             4             25
3       Tina       Ali       36            31             57
4       Jake    Milner       24             2             62
5        Amy     Cooze       73             3             70

Download Materials

Relevant Projects

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Census Income Data Set Project - Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based on census data.

NLP and Deep Learning For Fake News Classification in Python
In this project you will use Python to implement various machine learning methods( RNN, LSTM, GRU) for fake news classification.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.