Python Pandas Dataframe Tutorials

What is a pandas dataframe ?
Pandas is a software programming library in Python used for data analysis. Pandas provides data structures and tools for understanding and analysing data. 

The simplest way to understand a dataframe is to think of it as a MS Excel inside python. Just like how MS excel is used to store data, has rows/columns and you can perform operations on the data, similarly you can do all those with a dataframe. 

There are many ways to deal with data in python including series, lists and dictionaries, but dataframe is the structure of choice used by data scientists. Dataframes can deal with large amounts of data and support powerful functions to manipulate the data. 

Creating dataframes from csv / dictionary / list, adding rows, columns,using dataframe indexes and working with missing data are all part of the EDA (exploratory data analysis) stage of a data science project. 

A dataframe is represented in python code as ‘df. All dataframe operations are preceded by ‘df.[operation name]’


Where can dataframes be created from ?
Dataframes can be created from the following data sources - dictionaries, lists, arrays, series, csv files, Mysql connection to a database etc. 


What is a pandas series vs dataframe ? 
A series is a 1-dimensional representation of data and hence has only column while a dataframe is a 2-dimensional table. 


Numpy versus Pandas 
Numpy is another popular library used for data manipulation but it is largely used for numerical data. Dataframes however provide powerful functions to work across tables containing multiple data types. 

 

1. How to create a Dataframe
Every dataframe usage will have the following line at the beginning of your code:

import pandas as pd

Once you have identified where your data is coming from and have stored it in an object for example “data”. You can create your dataframe with the following command. This will convert all the data stored in “data” object into a 2-dimensional dataframe representation and create a dataframe. 

df = pd.DataFrame(data)

Example Tutorial:
Check out the first few lines of this pandas dataframe example to see how a dataframe is created. 



2. How to sort rows within a pandas dataframe
Many times in data analysis you will need to get a sense of the data and its magnitude. Sorting rows enables this. The df.sort_values()function enables this and sorts by columns that are passed as parameters to the function. 

For example the following command sorts the dataframe by the “reports” column in descending order. 

df.sort_values(by='reports', ascending=0)

The following command sorts the dataframe by the “reports” column in ascending order. 

df.sort_values(by='reports', ascending=1)

The following command sorts the dataframe first by the “coverage” column and then by the “reports” column
df.sort_values(by=['coverage', 'reports'])

Example Tutorial:
Check out this pandas dataframe example to see how various ways to sort rows inside a dataframe. 


 

3. How to find the largest value in a pandas dataframe 
In the data exploratory stage of analytics, you will occasionally want to get a sense of the largest values in your dataset. This tells you directionally the shape of your data, what operations to perform on the data and what visualisation might look like. 

The idxmax() function returns the index of the row with the highest value in your dataframe. The idxmin() function returns the index of the row with the lowest value in your dataframe. 

When used like this - df['preTestScore'].idxmax()- it means that this command will return the index of the row that contains the maximum value for column “preTestScore” in your datafram (df)

Example Tutorial:
Check out this pandas dataframe example to see how to find the largest value in a dataframe. 


 

4. How to list unique value in a pandas dataframe
Finding unique values in a dataset is useful in many scenarios - to categorize the number of rows belonging to a specific entity, to find the most popular and least popular entities etc. 

The following command lists the unique values in the “name” column of the dataframe. 

df.name.unique()

Example Tutorial:
Check out this pandas dataframe example to see how to find unique values in a dataframe. 



5. How to delete duplicates from a pandas dataframe
Deleting duplicate values largely serves the purpose of reducing memory usage of your dataset. It could also be used if you don’t want a specific value to be over represented in your dataset.

drop_duplicates() returns only the unique values in the dataframe. To remove duplicates of only a subset of columns, specify only the column names that should be unique. To do this based on a column’s value, you can sort_values(colname) and specify “keep” equals either first or last. 

In the example below the remove duplicates function is demonstrated both with retaining the first and last values.  

Example Tutorial:
Check out this data science tutorial to see how to delete duplicates from a dataframe. 



6. Rename column header in a pandas dataframe
Pandas dataframes are grids of rows and columns where data can be stored and easily manipulated with functions. A dataframe column contains values of a similar kind for a specific variable or feature. 

The most common way to rename a column header is by using the df.rename() function.

To rename a single column - the following command renames a column titled “General” into a new title “Admiral”
df.rename(columns={'General': 'Admiral'}, inplace=True)

To rename multiple columns the following code will rename the column name with the colum header values.
df = df.rename(columns = header)

Example Tutorial:
Check out this data science tutorial and this one to see an example of how to rename column headers.



7. Search pandas dataframe for a value
The following code finds all value sof Age where salary > 50,000. The .where function helps to search a pandas dataframe for a value. 

print(df['Age'].where(df['Salary'] > 50000))

Example Tutorial:
Check out this data science tutorial to see an example of how to search for a value in a pandas dataframe.



8. Drop row and column in a pandas dataframe
Many times in data analysis you will have to delete rows and columns that don’t fit your modelling needs. The df.drop()helps achieve this. 

df.drop('reports', axis=1)

will drop a column names “reports. Axis=1 indicates that we are referring to a column and not a row. 

You can also drop columns based on coditions

df.drop[df.name != 'Tina']

will drop a row where the value of ‘name’ is not ‘Tina’

Example Tutorial:
Check out this code recipe to see an example of how to drop row and columns in a pandas datafame



9. Replace multiple values in a pandas dataframe 
While data munging, you might inherit a dataset with lots of null value, junk values, duplicate values etc. In such instances you will need to replace thee values in bulk. 

The df.replace()function helps to replace values in a pandas dataframe. This funcation can be used to replace a string, regex, list, dictionary, series, number etc. in a dataframe

df.replace(-999, np.nan)

will replace all occurrences of -999 with nan null values. 

df.replace(to_replace =["Tennis", "Cricket"],value ="Sports")

will replace the values ‘Tennis’ and ‘cricket’ with the value ‘Sports’.

Example Tutorial:
Check out this code recipe to see an example of how to replace multiple values in a pandas dataframe.



10. Save pandas dataframe as a .csv file
As you must have noticed from the above functions, pandas is a very powerful library for data cleaning and preparation. 

Once you are done with the various data manipulations using the above commands, you will need to convert your dataframe into a .csv file. This is needed to split your data into training and test data for model building and accuracy checking.

The df.to_csv()function converts a pandas dataframe into a .csv file format. 

df.to_csv(r'C:\Users\Admin\Desktop\file3.csv', index=False) 

will store the .csv in a specific solution. 

Example Tutorial:
Check out this code recipe to see an example of how to save a pandas dataframe as a .csv file

11. Randomly sample a pandas dataframe
Trying to understand a dataset involves getting a quick insight into what type and range of data it contains. Pandas provides functions to pick random values from the dataset. 

df.take(np.random.permutation(len(df))[:2])
this code snippet picks 2 values at random

df.take(np.random.permutation(len(df))[:2])
this code snippet picks 4 values at random

Example Tutorial:
Check out this data science tutorial on how to randomly sample a pandas dataframe

 

12. How to filter in a pandas dataframe
Filtering a dataframe enables you to view specific rows and columns either based on order or matching specific conditions. 

print(df[:2])
will print the first 2 rows in the dataframe.

print(df[(df['coverage']  > 50) & (df['reports'] < 4)])
will print rows where the column ‘coverage’ is greater than 50 and the column ‘reports’ is greater than 4. 

Example Tutorial:
Check out this data science tutorial on how to filter in a pandas dataframe

 

13. How to calculate moving average in a pandas dataframe
As part of data munging, you have to try to understand the trends in your dataset. But when your data values are very spikey its tought to spot trends.

Calculating a moving average like a 7-day average helps to smoothen out the data variability and gives you a directional trend. 

The dataframe.rolling() provides the rolling window calculation and by adding the ‘mean’ parameter to this function, the average can be calculated.

df1 = df[['preTestScore','postTestScore']].rolling(window=2).mean()
this calculates a moving average with a window of 2 on the columns ‘preTestScore’ and ‘postTestScore’. A window of 2 means, the next 2 consecutive values are averaged and this happens for the entire dataframe. 

Example Tutorial:
Check out this data science tutorial on how to calculate moving average in a pandas dataframe



14. How to normalise a column in a pandas dataframe
In the data munging step of your data science project, you will often times get data with wide variability across positive and negative values. Normalisation is done to reduce the data range when data of different scales are involved. 

Normalising a dataset (234,24,14) would result in (1, 0.31,0.28). Using 234 as the anchor value all other values are represented relative to 234). 

Example Tutorial:
Check out this data science tutorial on how to normalise a column in a pandas dataframe

 

15. How to assign new columns in a pandas dataframe
There are a couple of reasons why you might want to add new columns during data processing.You might have data in 2 different data frames that you want to bring into a single data frame. Or you might want to add a new column that is a result of a function on 2 or more other columns. 

There are multiple ways to add new columns in a pandas dataframe - by declaring a new list as a column, by using dataframe.insert(), by using dataframe.assign(), by using a dictionary. 

The dataframe.assign() function will add a new column at the end of the dataframe by default. You cannot specify in which position to add this column. For that you will need to use the dataframe.insert()

df = df.assign(Marks = [71, 82, 89])

will add a new columnd “Marks” with the values 71, 82,89 as the last column in the dataframe. 

Example Tutorial:
Check out this data science recipe on how to assign new columns in a pandas dataframe



16. How to rank a pandas dataframe in ascending and descending order
By now you must have realised that Python is an excellent language to do data analysis. This is primarily because of the powerful data analytical packages like pandas that python provides. 

Ranking a pandas dataframe returns a rank for every index (row) in the series passed to the function. Both numeric and string values can be ranked by the df.rank() 

df['coverageRanked'] = df['coverage'].rank(ascending=True)

this function will create a new columns ‘coverageRanked’ and assign to i ascendingt ranks of the values in the ‘coverage’ column. 

Example Tutorial:
Check out this data science tutorial on how to rank a pandas dataframe 



Check out this blog soon. We update new functions every couple of days. 
 

References:
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html