How to preprocess string data within a Pandas DataFrame?

This recipe helps you preprocess string data within a Pandas DataFrame

Recipe Objective

While working on a dataset have you ever come across a situation where many data is filled in a single column and you need to seperate them in different column, so how you deal with this type of situation ? We seperate values by preprocessing the dataframe.

So this is the recipe on how we can preprocess string data within a Pandas DataFrame.

Master the Art of Data Cleaning in Machine Learning

Step 1 - Import the library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Setting up the Data

We have created a dictionary of data having one array in which all the data is stored and we are passing it through pd.DataFrame to create a dataframe. data = {'stringData': ['Sheldon 1 2019-09-29 3242.0', 'Copper 1 2020-12-25 3413.7', 'Raj 0 2014-05-25 2123.8', 'Howard 0 2017-09-24 1173.6', 'Leonard 1 2013-01-15 9134.0', 'Penny 0 2012-07-24 2755.6']} df = pd.DataFrame(data, columns = ['stringData']) print(df)

Step 3 - Preprocessing the string data

To preprocess this type of data we can use df.str.extract function and we can pass the type of values we want to extract. So here we are extracting Boolean, strings, date, and numbers.

    • First we are extracting boolean values and making a new column to store it.

df['Boolean'] = df['stringData'].str.extract('(\d)', expand=True) print(df['Boolean'])

    • Now we are extracting the values which are in the form of dates ie. yyyy-mm-dd and storing in a new column.

df['date'] = df['stringData'].str.extract('(....-..-..)', expand=True) print(df['date'])

    • Now we are extracting the values which are numbers and storing in a new column.

df['score'] = df['stringData'].str.extract('(\d\d\d\d\.\d)', expand=True) print(df['score'])

    • Finally we are extracting the values which are in string form i.e having aplhabets and storing in a new column.

df['state'] = df['stringData'].str.extract('([A-Z]\w{0,})', expand=True) print(df['state'])

So finally printing the dataset and the output comes as

                         stringData
0    Sheldon 1 2019-09-29    3242.0
1  Copper 1 2020-12-25       3413.7
2       Raj 0 2014-05-25     2123.8
3      Howard 0 2017-09-24   1173.6
4    Leonard 1 2013-01-15    9134.0
5      Penny 0 2012-07-24    2755.6

0    1
1    1
2    0
3    0
4    1
5    0
Name: Boolean, dtype: object

0    2019-09-29
1    2020-12-25
2    2014-05-25
3    2017-09-24
4    2013-01-15
5    2012-07-24
Name: date, dtype: object

0    3242.0
1    3413.7
2    2123.8
3    1173.6
4    9134.0
5    2755.6
Name: score, dtype: object

0    Sheldon
1     Copper
2        Raj
3     Howard
4    Leonard
5      Penny
Name: state, dtype: object

                         stringData Boolean        date   score    state
0    Sheldon 1 2019-09-29    3242.0       1  2019-09-29  3242.0  Sheldon
1  Copper 1 2020-12-25       3413.7       1  2020-12-25  3413.7   Copper
2       Raj 0 2014-05-25     2123.8       0  2014-05-25  2123.8      Raj
3      Howard 0 2017-09-24   1173.6       0  2017-09-24  1173.6   Howard
4    Leonard 1 2013-01-15    9134.0       1  2013-01-15  9134.0  Leonard
5      Penny 0 2012-07-24    2755.6       0  2012-07-24  2755.6    Penny

Download Materials

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

ML Model Deployment on AWS for Customer Churn Prediction
MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

Build an Image Segmentation Model using Amazon SageMaker
In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.