How to preprocess string data within a Pandas DataFrame?

This recipe helps you preprocess string data within a Pandas DataFrame
Last Updated: 06 Jul 2022

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

While working on a dataset have you ever come across a situation where many data is filled in a single column and you need to seperate them in different column, so how you deal with this type of situation ? We seperate values by preprocessing the dataframe.

So this is the recipe on how we can preprocess string data within a Pandas DataFrame.

Master the Art of Data Cleaning in Machine Learning

Recipe Objective

Step 1 - Import the library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Setting up the Data

We have created a dictionary of data having one array in which all the data is stored and we are passing it through pd.DataFrame to create a dataframe. data = {'stringData': ['Sheldon 1 2019-09-29 3242.0', 'Copper 1 2020-12-25 3413.7', 'Raj 0 2014-05-25 2123.8', 'Howard 0 2017-09-24 1173.6', 'Leonard 1 2013-01-15 9134.0', 'Penny 0 2012-07-24 2755.6']} df = pd.DataFrame(data, columns = ['stringData']) print(df)

Step 3 - Preprocessing the string data

To preprocess this type of data we can use df.str.extract function and we can pass the type of values we want to extract. So here we are extracting Boolean, strings, date, and numbers.

First we are extracting boolean values and making a new column to store it.

df['Boolean'] = df['stringData'].str.extract('(\d)', expand=True) print(df['Boolean'])

Now we are extracting the values which are in the form of dates ie. yyyy-mm-dd and storing in a new column.

df['date'] = df['stringData'].str.extract('(....-..-..)', expand=True) print(df['date'])

Now we are extracting the values which are numbers and storing in a new column.

df['score'] = df['stringData'].str.extract('(\d\d\d\d\.\d)', expand=True) print(df['score'])

Finally we are extracting the values which are in string form i.e having aplhabets and storing in a new column.

df['state'] = df['stringData'].str.extract('([A-Z]\w{0,})', expand=True) print(df['state'])

So finally printing the dataset and the output comes as

                         stringData
0    Sheldon 1 2019-09-29    3242.0
1  Copper 1 2020-12-25       3413.7
2       Raj 0 2014-05-25     2123.8
3      Howard 0 2017-09-24   1173.6
4    Leonard 1 2013-01-15    9134.0
5      Penny 0 2012-07-24    2755.6

0    1
1    1
2    0
3    0
4    1
5    0
Name: Boolean, dtype: object

0    2019-09-29
1    2020-12-25
2    2014-05-25
3    2017-09-24
4    2013-01-15
5    2012-07-24
Name: date, dtype: object

0    3242.0
1    3413.7
2    2123.8
3    1173.6
4    9134.0
5    2755.6
Name: score, dtype: object

0    Sheldon
1     Copper
2        Raj
3     Howard
4    Leonard
5      Penny
Name: state, dtype: object

                         stringData Boolean        date   score    state
0    Sheldon 1 2019-09-29    3242.0       1  2019-09-29  3242.0  Sheldon
1  Copper 1 2020-12-25       3413.7       1  2020-12-25  3413.7   Copper
2       Raj 0 2014-05-25     2123.8       0  2014-05-25  2123.8      Raj
3      Howard 0 2017-09-24   1173.6       0  2017-09-24  1173.6   Howard
4    Leonard 1 2013-01-15    9134.0       1  2013-01-15  9134.0  Leonard
5      Penny 0 2012-07-24    2755.6       0  2012-07-24  2755.6    Penny

Download Materials

iPython Notebook

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build an AI Chatbot from Scratch using Keras Sequential Model

In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

View Project Details

Mastering A/B Testing: A Practical Guide for Production

In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

View Project Details

CycleGAN Implementation for Image-To-Image Translation

In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

View Project Details

Digit Recognition using CNN for MNIST Dataset in Python

In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

View Project Details

ML Model Deployment on AWS for Customer Churn Prediction

MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

View Project Details

Build a Multi Class Image Classification Model Python using CNN

This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

View Project Details

PyCaret Project to Build and Deploy an ML App using Streamlit

In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

View Project Details

Build an Image Segmentation Model using Amazon SageMaker

In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

View Project Details

Learn Object Tracking (SOT, MOT) using OpenCV and Python

Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

View Project Details

Model Deployment on GCP using Streamlit for Resume Parsing

Perform model deployment on GCP for resume parsing model using Streamlit App.

View Project Details

How to preprocess string data within a Pandas DataFrame?

Recipe Objective

Table of Contents

Step 1 - Import the library

Step 2 - Setting up the Data

Step 3 - Preprocessing the string data

Ameeruddin Mohammed

Relevant Projects

You might also like

Relevant Projects