How to delete duplicates from a Pandas DataFrame?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to delete duplicates from a Pandas DataFrame?

How to delete duplicates from a Pandas DataFrame?

This recipe helps you delete duplicates from a Pandas DataFrame

0

Recipe Objective

In many dataset we find many duplicate values so how to remove that.

So this is the recipe on how we can delete duplicates from a Pandas DataFrame.

Step 1 - Importing Library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Creating DataFrame

We have created a dataframe of which we will delete duplicate values. raw_data = {"first_name": ["Jason", "Jason", "Jason","Tina", "Jake", "Amy"], "last_name": ["Miller", "Miller", "Miller","Ali", "Milner", "Cooze"], "age": [42, 42, 1111111, 36, 24, 73], "preTestScore": [4, 4, 4, 31, 2, 3], "postTestScore": [25, 25, 25, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "preTestScore", "postTestScore"]) print(); print(df)

Step 3 - Removing Duplicate Values

We will drop duplicate values which will come after the first in the feature first name we will keep the last duplicate item. print(df.duplicated()) print(df.drop_duplicates(keep="first")) print(df.drop_duplicates(["first_name"], keep="last")) So the output comes as

  first_name last_name      age  preTestScore  postTestScore
0      Jason    Miller       42             4             25
1      Jason    Miller       42             4             25
2      Jason    Miller  1111111             4             25
3       Tina       Ali       36            31             57
4       Jake    Milner       24             2             62
5        Amy     Cooze       73             3             70

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

  first_name last_name      age  preTestScore  postTestScore
0      Jason    Miller       42             4             25
2      Jason    Miller  1111111             4             25
3       Tina       Ali       36            31             57
4       Jake    Milner       24             2             62
5        Amy     Cooze       73             3             70

  first_name last_name      age  preTestScore  postTestScore
2      Jason    Miller  1111111             4             25
3       Tina       Ali       36            31             57
4       Jake    Milner       24             2             62
5        Amy     Cooze       73             3             70

Relevant Projects

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.