How to preprocess string data within a Pandas DataFrame?
DATA MUNGING

How to preprocess string data within a Pandas DataFrame?

How to preprocess string data within a Pandas DataFrame?

This recipe helps you preprocess string data within a Pandas DataFrame

0
In [1]:
## How to preprocess string data within a Pandas DataFrame
def Kickstarter_Example_74():
    print()
    print(format('How to preprocess string data within a Pandas DataFrame','*^82'))
    import warnings
    warnings.filterwarnings("ignore")

    # load libraries
    import pandas as pd
    # Create a dataframe with a single column of strings
    data = {'stringData': ['Arizona 1 2014-12-23    3242.0',
                           'Iowa 1 2010-02-23       3453.7',
                           'Oregon 0 2014-06-20     2123.0',
                           'Maryland 0 2014-03-14   1123.6',
                           'Florida 1 2013-01-15    2134.0',
                           'Georgia 0 2012-07-14    2345.6']}
    df = pd.DataFrame(data, columns = ['stringData'])
    print(); print(df)

    # Search a column of strings for a pattern
    # Which rows of df['stringData'] contain 'xxxx-xx-xx'?
    print(); print(df['stringData'].str.contains('....-..-..', regex=True))

    # Extract the column of single digits
    # In the column 'stringData', extract single digit in the strings
    df['Boolean'] = df['stringData'].str.extract('(\d)', expand=True)
    print(); print(df['Boolean'])

    # Extract the column of dates
    # In the column 'raw', extract xxxx-xx-xx in the strings
    df['date'] = df['stringData'].str.extract('(....-..-..)', expand=True)
    print(); print(df['date'])

    # Extract the column of thousands
    # In the column 'stringData', extract ####.## in the strings
    df['score'] = df['stringData'].str.extract('(\d\d\d\d\.\d)', expand=True)
    print(); print(df['score'])

    # Extract the column of words
    # In the column 'stringData', extract the word in the strings
    df['state'] = df['stringData'].str.extract('([A-Z]\w{0,})', expand=True)
    print(); print(df['state'])

    # View the final dataframe
    print(); print(df)
Kickstarter_Example_74()
*************How to preprocess string data within a Pandas DataFrame**************

                       stringData
0  Arizona 1 2014-12-23    3242.0
1  Iowa 1 2010-02-23       3453.7
2  Oregon 0 2014-06-20     2123.0
3  Maryland 0 2014-03-14   1123.6
4  Florida 1 2013-01-15    2134.0
5  Georgia 0 2012-07-14    2345.6

0    True
1    True
2    True
3    True
4    True
5    True
Name: stringData, dtype: bool

0    1
1    1
2    0
3    0
4    1
5    0
Name: Boolean, dtype: object

0    2014-12-23
1    2010-02-23
2    2014-06-20
3    2014-03-14
4    2013-01-15
5    2012-07-14
Name: date, dtype: object

0    3242.0
1    3453.7
2    2123.0
3    1123.6
4    2134.0
5    2345.6
Name: score, dtype: object

0     Arizona
1        Iowa
2      Oregon
3    Maryland
4     Florida
5     Georgia
Name: state, dtype: object

                       stringData Boolean        date   score     state
0  Arizona 1 2014-12-23    3242.0       1  2014-12-23  3242.0   Arizona
1  Iowa 1 2010-02-23       3453.7       1  2010-02-23  3453.7      Iowa
2  Oregon 0 2014-06-20     2123.0       0  2014-06-20  2123.0    Oregon
3  Maryland 0 2014-03-14   1123.6       0  2014-03-14  1123.6  Maryland
4  Florida 1 2013-01-15    2134.0       1  2013-01-15  2134.0   Florida
5  Georgia 0 2012-07-14    2345.6       0  2012-07-14  2345.6   Georgia

Relevant Projects

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.