How to preprocess string data within a Pandas DataFrame?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to preprocess string data within a Pandas DataFrame?

How to preprocess string data within a Pandas DataFrame?

This recipe helps you preprocess string data within a Pandas DataFrame

Recipe Objective

While working on a dataset have you ever come across a situation where many data is filled in a single column and you need to seperate them in different column, so how you deal with this type of situation ?
We seperate values by preprocessing the dataframe.

So this is the recipe on how we can preprocess string data within a Pandas DataFrame.

Step 1 - Import the library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Setting up the Data

We have created a dictionary of data having one array in which all the data is stored and we are passing it through pd.DataFrame to create a dataframe. data = {'stringData': ['Sheldon 1 2019-09-29 3242.0', 'Copper 1 2020-12-25 3413.7', 'Raj 0 2014-05-25 2123.8', 'Howard 0 2017-09-24 1173.6', 'Leonard 1 2013-01-15 9134.0', 'Penny 0 2012-07-24 2755.6']} df = pd.DataFrame(data, columns = ['stringData']) print(df)

Step 3 - Preprocessing the string data

To preprocess this type of data we can use df.str.extract function and we can pass the type of values we want to extract. So here we are extracting Boolean, strings, date, and numbers.

  • First we are extracting boolean values and making a new column to store it.
  • df['Boolean'] = df['stringData'].str.extract('(\d)', expand=True) print(df['Boolean'])
  • Now we are extracting the values which are in the form of dates ie. yyyy-mm-dd and storing in a new column.
  • df['date'] = df['stringData'].str.extract('(....-..-..)', expand=True) print(df['date'])
  • Now we are extracting the values which are numbers and storing in a new column.
  • df['score'] = df['stringData'].str.extract('(\d\d\d\d\.\d)', expand=True) print(df['score'])
  • Finally we are extracting the values which are in string form i.e having aplhabets and storing in a new column.
  • df['state'] = df['stringData'].str.extract('([A-Z]\w{0,})', expand=True) print(df['state'])
So finally printing the dataset and the output comes as

                         stringData
0    Sheldon 1 2019-09-29    3242.0
1  Copper 1 2020-12-25       3413.7
2       Raj 0 2014-05-25     2123.8
3      Howard 0 2017-09-24   1173.6
4    Leonard 1 2013-01-15    9134.0
5      Penny 0 2012-07-24    2755.6

0    1
1    1
2    0
3    0
4    1
5    0
Name: Boolean, dtype: object

0    2019-09-29
1    2020-12-25
2    2014-05-25
3    2017-09-24
4    2013-01-15
5    2012-07-24
Name: date, dtype: object

0    3242.0
1    3413.7
2    2123.8
3    1173.6
4    9134.0
5    2755.6
Name: score, dtype: object

0    Sheldon
1     Copper
2        Raj
3     Howard
4    Leonard
5      Penny
Name: state, dtype: object

                         stringData Boolean        date   score    state
0    Sheldon 1 2019-09-29    3242.0       1  2019-09-29  3242.0  Sheldon
1  Copper 1 2020-12-25       3413.7       1  2020-12-25  3413.7   Copper
2       Raj 0 2014-05-25     2123.8       0  2014-05-25  2123.8      Raj
3      Howard 0 2017-09-24   1173.6       0  2017-09-24  1173.6   Howard
4    Leonard 1 2013-01-15    9134.0       1  2013-01-15  9134.0  Leonard
5      Penny 0 2012-07-24    2755.6       0  2012-07-24  2755.6    Penny

Download Materials

Relevant Projects

Build OCR from Scratch Python using YOLO and Tesseract
In this deep learning project, you will learn how to build your custom OCR (optical character recognition) from scratch by using Google Tesseract and YOLO to read the text from any images.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.