How to do string munging in Pandas?

How to do string munging in Pandas?

How to do string munging in Pandas?

This recipe helps you do string munging in Pandas


Recipe Objective

Have you ever tried string munging? That is selecting a part of string of making another string form the available strings in the dataframe.

So this is the recipe on how we can do string munging in Pandas.

Step 1 - Import the library

import pandas as pd import numpy as np import re as re

We have only imported pandas, numpy and re which is needed.

Step 2 - Creating DataFrame

We have created a dictionary and passed it through pd.DataFrame to create a Dataframe raw_data = {"first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"], "last_name": ["Miller", "Jacobson", "Ali", "Milner", "Cooze"], "email": ["", "", np.NAN, "", ""]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "email"]) print(); print(df)

Step 3 - Applying Different Munging Operation

Lets say, first we want to check that if in feature "email" which string contains "gmail". print(df["email"].str.contains("gmail")) Lets say, we want to seperate the email into parts such that the characters before "@" becomes one string and after and before "." becomes one. At last the remaining becomes the one string. pattern = "([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})" print(df["email"].str.findall(pattern, flags=re.IGNORECASE)) So the output comes as

  first_name last_name                email  preTestScore  postTestScore
0      Jason    Miller             4             25
1      Molly  Jacobson            24             94
2       Tina       Ali                  NaN            31             57
3       Jake    Milner             2             62
4        Amy     Cooze             3             70

0     True
1     True
2      NaN
3    False
4    False
Name: email, dtype: object

0       [(jas203, gmail, com)]
1    [(momomolly, gmail, com)]
2                          NaN
3     [(battler, milner, com)]
4     [(Ames1234, yahoo, com)]
Name: email, dtype: object

0    True
1    True
2     NaN
3    True
4    True
Name: email, dtype: object

Relevant Projects

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.