How to preprocess string data within a Pandas DataFrame?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to preprocess string data within a Pandas DataFrame?

How to preprocess string data within a Pandas DataFrame?

This recipe helps you preprocess string data within a Pandas DataFrame

0

Recipe Objective

While working on a dataset have you ever come across a situation where many data is filled in a single column and you need to seperate them in different column, so how you deal with this type of situation ?
We seperate values by preprocessing the dataframe.

So this is the recipe on how we can preprocess string data within a Pandas DataFrame.

Step 1 - Import the library

import pandas as pd

We have only imported pandas which is needed.

Step 2 - Setting up the Data

We have created a dictionary of data having one array in which all the data is stored and we are passing it through pd.DataFrame to create a dataframe. data = {'stringData': ['Sheldon 1 2019-09-29 3242.0', 'Copper 1 2020-12-25 3413.7', 'Raj 0 2014-05-25 2123.8', 'Howard 0 2017-09-24 1173.6', 'Leonard 1 2013-01-15 9134.0', 'Penny 0 2012-07-24 2755.6']} df = pd.DataFrame(data, columns = ['stringData']) print(df)

Step 3 - Preprocessing the string data

To preprocess this type of data we can use df.str.extract function and we can pass the type of values we want to extract. So here we are extracting Boolean, strings, date, and numbers.

  • First we are extracting boolean values and making a new column to store it.
  • df['Boolean'] = df['stringData'].str.extract('(\d)', expand=True) print(df['Boolean'])
  • Now we are extracting the values which are in the form of dates ie. yyyy-mm-dd and storing in a new column.
  • df['date'] = df['stringData'].str.extract('(....-..-..)', expand=True) print(df['date'])
  • Now we are extracting the values which are numbers and storing in a new column.
  • df['score'] = df['stringData'].str.extract('(\d\d\d\d\.\d)', expand=True) print(df['score'])
  • Finally we are extracting the values which are in string form i.e having aplhabets and storing in a new column.
  • df['state'] = df['stringData'].str.extract('([A-Z]\w{0,})', expand=True) print(df['state'])
So finally printing the dataset and the output comes as

                         stringData
0    Sheldon 1 2019-09-29    3242.0
1  Copper 1 2020-12-25       3413.7
2       Raj 0 2014-05-25     2123.8
3      Howard 0 2017-09-24   1173.6
4    Leonard 1 2013-01-15    9134.0
5      Penny 0 2012-07-24    2755.6

0    1
1    1
2    0
3    0
4    1
5    0
Name: Boolean, dtype: object

0    2019-09-29
1    2020-12-25
2    2014-05-25
3    2017-09-24
4    2013-01-15
5    2012-07-24
Name: date, dtype: object

0    3242.0
1    3413.7
2    2123.8
3    1173.6
4    9134.0
5    2755.6
Name: score, dtype: object

0    Sheldon
1     Copper
2        Raj
3     Howard
4    Leonard
5      Penny
Name: state, dtype: object

                         stringData Boolean        date   score    state
0    Sheldon 1 2019-09-29    3242.0       1  2019-09-29  3242.0  Sheldon
1  Copper 1 2020-12-25       3413.7       1  2020-12-25  3413.7   Copper
2       Raj 0 2014-05-25     2123.8       0  2014-05-25  2123.8      Raj
3      Howard 0 2017-09-24   1173.6       0  2017-09-24  1173.6   Howard
4    Leonard 1 2013-01-15    9134.0       1  2013-01-15  9134.0  Leonard
5      Penny 0 2012-07-24    2755.6       0  2012-07-24  2755.6    Penny

Relevant Projects

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.