How to preprocess string data within a Pandas DataFrame?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to preprocess string data within a Pandas DataFrame?

How to preprocess string data within a Pandas DataFrame?

This recipe helps you preprocess string data within a Pandas DataFrame

0

Recipe Objective

While working on a dataset have you ever come across a situation where many data is filled in a single column and you need to seperate them in different column, so how you deal with this type of situation ?
We seperate values by preprocessing the dataframe.

So this is the recipe on how we can preprocess string data within a Pandas DataFrame.

Step 1 - Import the library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Setting up the Data

We have created a dictionary of data having one array in which all the data is stored and we are passing it through pd.DataFrame to create a dataframe. data = {'stringData': ['Sheldon 1 2019-09-29 3242.0', 'Copper 1 2020-12-25 3413.7', 'Raj 0 2014-05-25 2123.8', 'Howard 0 2017-09-24 1173.6', 'Leonard 1 2013-01-15 9134.0', 'Penny 0 2012-07-24 2755.6']} df = pd.DataFrame(data, columns = ['stringData']) print(df)

Step 3 - Preprocessing the string data

To preprocess this type of data we can use df.str.extract function and we can pass the type of values we want to extract. So here we are extracting Boolean, strings, date, and numbers.

  • First we are extracting boolean values and making a new column to store it.
  • df['Boolean'] = df['stringData'].str.extract('(\d)', expand=True) print(df['Boolean'])
  • Now we are extracting the values which are in the form of dates ie. yyyy-mm-dd and storing in a new column.
  • df['date'] = df['stringData'].str.extract('(....-..-..)', expand=True) print(df['date'])
  • Now we are extracting the values which are numbers and storing in a new column.
  • df['score'] = df['stringData'].str.extract('(\d\d\d\d\.\d)', expand=True) print(df['score'])
  • Finally we are extracting the values which are in string form i.e having aplhabets and storing in a new column.
  • df['state'] = df['stringData'].str.extract('([A-Z]\w{0,})', expand=True) print(df['state'])
So finally printing the dataset and the output comes as

                         stringData
0    Sheldon 1 2019-09-29    3242.0
1  Copper 1 2020-12-25       3413.7
2       Raj 0 2014-05-25     2123.8
3      Howard 0 2017-09-24   1173.6
4    Leonard 1 2013-01-15    9134.0
5      Penny 0 2012-07-24    2755.6

0    1
1    1
2    0
3    0
4    1
5    0
Name: Boolean, dtype: object

0    2019-09-29
1    2020-12-25
2    2014-05-25
3    2017-09-24
4    2013-01-15
5    2012-07-24
Name: date, dtype: object

0    3242.0
1    3413.7
2    2123.8
3    1173.6
4    9134.0
5    2755.6
Name: score, dtype: object

0    Sheldon
1     Copper
2        Raj
3     Howard
4    Leonard
5      Penny
Name: state, dtype: object

                         stringData Boolean        date   score    state
0    Sheldon 1 2019-09-29    3242.0       1  2019-09-29  3242.0  Sheldon
1  Copper 1 2020-12-25       3413.7       1  2020-12-25  3413.7   Copper
2       Raj 0 2014-05-25     2123.8       0  2014-05-25  2123.8      Raj
3      Howard 0 2017-09-24   1173.6       0  2017-09-24  1173.6   Howard
4    Leonard 1 2013-01-15    9134.0       1  2013-01-15  9134.0  Leonard
5      Penny 0 2012-07-24    2755.6       0  2012-07-24  2755.6    Penny

Relevant Projects

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.