Python Pandas Dataframe Tutorial for Beginners

What is a pandas dataframe ? Where can dataframes be created from ? What is a pandas series vs dataframe ? How to create a Dataframe ?

Get access to all Data Science Projects View all Data Science Projects

Python Pandas Dataframe Tutorial for Beginners

Last Updated: 11 Apr 2024 | BY ProjectPro

What is a pandas dataframe ?
Pandas is a software programming library in Python used for data analysis. Pandas provides data structures and tools for understanding and analysing data.

The simplest way to understand a dataframe is to think of it as a MS Excel inside python. Just like how MS excel is used to store data, has rows/columns and you can perform operations on the data, similarly you can do all those with a dataframe.

There are many ways to deal with data in python including series, lists and dictionaries, but dataframe is the structure of choice used by data scientists. Dataframes can deal with large amounts of data and support powerful functions to manipulate the data.

Creating dataframes from csv / dictionary / list, adding rows, columns,using dataframe indexes and working with missing data are all part of the EDA (exploratory data analysis) stage of a data science project.

A dataframe is represented in python code as ‘df. All dataframe operations are preceded by ‘df.[operation name]’

What is MultivariateOLS model in the StatsModels library?

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Where can dataframes be created from ?
Dataframes can be created from the following data sources - dictionaries, lists, arrays, series, csv files, Mysql connection to a database etc.

What is a pandas series vs dataframe ?
A series is a 1-dimensional representation of data and hence has only column while a dataframe is a 2-dimensional table.

Numpy versus Pandas
Numpy is another popular library used for data manipulation but it is largely used for numerical data. Dataframes however provide powerful functions to work across tables containing multiple data types.

Python Pandas Dataframe Basics

Python Pandas Dataframe Basics

pandas dataframe

1. How to create a Dataframe

Every dataframe usage will have the following line at the beginning of your code:

import pandas as pd

Once you have identified where your data is coming from and have stored it in an object for example “data”. You can create your dataframe with the following command. This will convert all the data stored in “data” object into a 2-dimensional dataframe representation and create a dataframe.

df = pd.DataFrame(data)

Example Tutorial:
Check out the first few lines of this pandas dataframe example to see how a dataframe is created.

Here are some of the ways to create a DataFrame in Python Pandas:

New Projects

Creating a DataFrame from a list of lists

# first import the pandas library
import pandas as pd

# create a list of lists
list_of_lists = [[‘January’, 31], [‘February’, 28], [‘March’, 31]]

# creating the Pandas DataFrame
df = pd.DataFrame(list_of_lists, columns = [‘Month’, ‘Days’])

# to display the DataFrame.
df

The above code snippet generates the following DataFrame

	Month	Days
0	January	31
1	February	28
2	March	31

Creating a DataFrame from a dict of lists:

While creating a DataFrame in Pandas from a dictionary of lists, all the lists within the dictionary have to be of the same length. If the index is also passed while creating the DataFrame, then the length of the index should also be equal to the length of the lists in the dictionary. If the index is not passed, the index of the DataFrame will be range(n) by default, where n is the length of each list in the dictionary.

The keys of the dictionary become the column names of the DataFrame and their values, which are lists from the rows and columns.

import pandas as pd

# create dict of lists
dict_of_lists = {‘Students’:['Alan’', 'Vivian', ‘Alister’, 'Jade’'],

'Age':[24, 26, 32, 29]}

# creating the DataFrame
df = pd.DataFrame(dict_of_lists)

df contains the following data:

	Students	Age
0	Alan	24
1	Vivian	26
2	Alister	32
3	Jade	29

Here's what valued users are saying about ProjectPro

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop Admin, Hadoop projects. I have been happy with every project. They have really brought me into the...

Ray han

Tech Leader | Stanford / Yale University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data. In each learning path, there are many customized projects with all the details from the beginner to...

Jingwei Li

Graduate Research assistance at Stony Brook University

Not sure what you are looking for?

View All Projects

Creating an index DataFrame from a dict of lists:

Indices of a DataFrame are not restricted to numbering and can be specified as follows:

import pandas as pd

# create dict of lists
dict_of_lists = {‘Students’:['Alan’', 'Vivian', ‘Alister’, 'Jade’'],

'Age':[24, 26, 32, 29]}

# creating the DataFrame
df = pd.DataFrame(dict_of_lists, index =['Student1',

                                'Student2',

                                'Student3',

                                'Student4'])

In such a case, the DataFrame looks like:

	Students	Age
Student1	Alan	24
Student1	Vivian	26
Student2	Alister	32
Student3	Jade	29

Creating a DataFrame from a list of dicts

DataFrames in Pandas can be created with a list of dictionaries. The keys of the dictionaries are taken as the column names by default.

import pandas as pd

# create a list of dictionaries
list_of_dicts = [{'column_a': 1, 'column_b': 2, 'column_c':3}, {'column_a':10, 'column_b': 20, 'column_c': 30}]

# creating the DataFrame.
df = pd.DataFrame(list_of_dicts)

The above snippet generates the following DataFrame

	column_a	column_b	column_c
0	1	2	3
1	10	20	30

If some of the values are missing in the dictionary, like in the code snippet below:

import pandas as pd

# create a list of dictionaries
list_of_dicts = [{'column_a': 1,'column_c':3}, {'column_a':10, 'column_b': 20, 'column_c': 30}]

# creating the DataFrame.
df = pd.DataFrame(list_of_dicts)

Then df will contain the following DataFrame.

	column_a	column_b	column_c
0	1	NaN	3
1	10	20	30

Creating a DataFrame from a list of dicts and specifying the row indices.

import pandas as pd

# create a list of dictionaries
list_of_dicts = [{'column_a': 1,’column_b’: 2, 'column_c':3}, {'column_a':10, 'column_b': 20, 'column_c': 30}]

# creating the DataFrame.
df = pd.DataFrame(list_of_dicts, index = [‘row_1’, ‘row_2’])

Then df will contain the following DataFrame.

	column_a	column_b	column_c
row1	1	NaN	3
row2	10	20	30

Creating a DataFrame from a list of dicts and specifying both the row indices and the column indices

The names specified in the column list have to match the keys of the dictionary. If there is no match, the rows corresponding to that particular column will contain NaN.

import pandas as pd

# create a list of dictionaries
list_of_dicts = [{'column_a': 1, 'column_c':3}, {'column_a':10, 'column_b': 20, 'column_c': 30}]

# creating the DataFrame.
df = pd.DataFrame(list_of_dicts, index = [‘row_1’, ‘row_2’], column = [‘column_a’, ‘column_c’])

Then df will contain the following DataFrame.

	column_a	column_c
row1	1	3
row2	10	30

‘column_b’ here does not get added to the DataFrame since it is not mentioned in the column list while creating the DataFrame.

Consider the following code:

import pandas as pd

# create a list of dictionaries
list_of_dicts = [{'column_a': 1, 'column_c':3}, {'column_a':10, 'column_b': 20, 'column_c': 30}]

# creating the DataFrame.
df = pd.DataFrame(list_of_dicts, index = [‘row_1’, ‘row_2’], column = [‘column_a’, ‘column_d’])

Since column_d is not a key in either of the dictionaries, the DataFrame generated looks like:

	column_a	column_d
row1	1	NaN
row2	10	NaN

Creating a DataFrame from a list of tuples:

import pandas as pd

# create a list of tuples
list_of_tuples = [(8,’August’,1998),(2, ‘January’,1987 ),(17, ‘July’, 2021),(24, ‘June’,1932)]

# creating the DataFrame.
df = pd.DataFrame(list_of_tuples, column = [‘Date’, ‘Month’, ‘Year’])

Will generate the DataFrame df:

	Date	Month	Year
0	8	August	1998
1	2	January	1987
2	17	July	2021
3	24	June	1932

Creating a DataFrame using the zip() function:

In Python, the zip() function can be used to merge two lists. The zip() function generates a zip object. The zip object is an iterator of tuples, where the items in each of the iterators passed to the zip function are paired together, i.e first item of first iterator is paired with first item of the second iterator, the second item of the first iterator is paired with the second item of the second iterator and so on. If the iterators passed to the zip() function vary in length, the length of the zip operator is determined by the iterator of least length.

import pandas as pd

# list 1
students = [‘Alan’, 'Vivian', 'Alister', 'Jade']

# list 2
age = [24, 26, 32, 29]

# using zip to merge the two lists 
list_of_tuples = list(zip(students, age))

# list(zip(students, age)) will return 
# [(‘Alan’, 24), (‘Vivian’, 26),(‘Alister’, 32),(‘Jade’, 29)]

# Converting the lists of tuples into pandas Dataframe.
df = pd.DataFrame(list_of_tuples,
columns = [‘Students’, 'Age'])

Here, df will contain:

	Students	Age
0	Alan	24
1	Vivian	26
2	Alister	32
3	Jade	29

Creating an empty DataFrame

import pandas as pd
df  = pd.DataFrame()

The above code will create an empty DataFrame in Python Pandas.

To create an empty DataFrame with the column headers:

import pandas as pd
df  = pd.DataFrame(columns = [‘column1’ ,’column2’, ‘column3’]

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

2. How to sort rows within a pandas dataframe

Many times in data analysis you will need to get a sense of the data and its magnitude. Sorting rows enables this. The df.sort_values()function enables this and sorts by columns that are passed as parameters to the function.

For example the following command sorts the dataframe by the “reports” column in descending order.

df.sort_values(by='reports', ascending=0)

The following command sorts the dataframe by the “reports” column in ascending order.

df.sort_values(by='reports', ascending=1)

The following command sorts the dataframe first by the “coverage” column and then by the “reports” column
df.sort_values(by=['coverage', 'reports'])

Example Tutorial:
Check out this pandas dataframe example to see how various ways to sort rows inside a dataframe.

3. How to find the largest value in a pandas dataframe

In the data exploratory stage of analytics, you will occasionally want to get a sense of the largest values in your dataset. This tells you directionally the shape of your data, what operations to perform on the data and what visualisation might look like.

The idxmax() function returns the index of the row with the highest value in your dataframe. The idxmin() function returns the index of the row with the lowest value in your dataframe.

When used like this - df['preTestScore'].idxmax()- it means that this command will return the index of the row that contains the maximum value for column “preTestScore” in your datafram (df)

Example Tutorial:
Check out this pandas dataframe example to see how to find the largest value in a dataframe.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

4. How to list unique value in a pandas dataframe

Finding unique values in a dataset is useful in many scenarios - to categorize the number of rows belonging to a specific entity, to find the most popular and least popular entities etc.

The following command lists the unique values in the “name” column of the dataframe.

df.name.unique()

Example Tutorial:
Check out this pandas dataframe example to see how to find unique values in a dataframe.

5. How to delete duplicates from a pandas dataframe

Deleting duplicate values largely serves the purpose of reducing memory usage of your dataset. It could also be used if you don’t want a specific value to be over represented in your dataset.

drop_duplicates() returns only the unique values in the dataframe. To remove duplicates of only a subset of columns, specify only the column names that should be unique. To do this based on a column’s value, you can sort_values(colname) and specify “keep” equals either first or last.

In the example below the remove duplicates function is demonstrated both with retaining the first and last values.

Example Tutorial:
Check out this data science tutorial to see how to delete duplicates from a dataframe.

6. Rename column header in a pandas dataframe

Pandas dataframes are grids of rows and columns where data can be stored and easily manipulated with functions. A dataframe column contains values of a similar kind for a specific variable or feature.

The most common way to rename a column header is by using the df.rename() function.

To rename a single column - the following command renames a column titled “General” into a new title “Admiral”
df.rename(columns={'General': 'Admiral'}, inplace=True)

To rename multiple columns the following code will rename the column name with the colum header values.
df = df.rename(columns = header)

Example Tutorial:
Check out this data science tutorial and this one to see an example of how to rename column headers.

Explore Categories

Data Science Projects in Python Data Science Projects in R Machine Learning Projects in Python Machine Learning Projects in R Deep Learning Projects Neural Network Projects Tensorflow Projects Keras Deep Learning Projects NLP Projects Pytorch Data Science Projects in Banking and Finance Data Science Projects in Retail & Ecommerce Data Science Projects in Entertainment & Media Data Science Projects in Telecommunications

7. Search pandas dataframe for a value

The following code finds all value sof Age where salary > 50,000. The .where function helps to search a pandas dataframe for a value.

print(df['Age'].where(df['Salary'] > 50000))

Example Tutorial:
Check out this data science tutorial to see an example of how to search for a value in a pandas dataframe.

8. Drop row and column in a pandas dataframe

Many times in data analysis you will have to delete rows and columns that don’t fit your modelling needs. The df.drop()helps achieve this.

df.drop('reports', axis=1)

will drop a column names “reports. Axis=1 indicates that we are referring to a column and not a row.

You can also drop columns based on coditions

df.drop[df.name != 'Tina']

will drop a row where the value of ‘name’ is not ‘Tina’

Example Tutorial:
Check out this code recipe to see an example of how to drop row and columns in a pandas datafame.

9. Replace multiple values in a pandas dataframe

While data munging, you might inherit a dataset with lots of null value, junk values, duplicate values etc. In such instances you will need to replace thee values in bulk.

The df.replace()function helps to replace values in a pandas dataframe. This funcation can be used to replace a string, regex, list, dictionary, series, number etc. in a dataframe

df.replace(-999, np.nan)

will replace all occurrences of -999 with nan null values.

df.replace(to_replace =["Tennis", "Cricket"],value ="Sports")

will replace the values ‘Tennis’ and ‘cricket’ with the value ‘Sports’.

Example Tutorial:
Check out this code recipe to see an example of how to replace multiple values in a pandas dataframe.

10. Save pandas dataframe as a .csv file

As you must have noticed from the above functions, pandas is a very powerful library for data cleaning and preparation.

Once you are done with the various data manipulations using the above commands, you will need to convert your dataframe into a .csv file. This is needed to split your data into training and test data for model building and accuracy checking.

The df.to_csv()function converts a pandas dataframe into a .csv file format.

df.to_csv(r'C:\Users\Admin\Desktop\file3.csv', index=False)

will store the .csv in a specific solution.

Example Tutorial:
Check out this code recipe to see an example of how to save a pandas dataframe as a .csv file.

11. Randomly sample a pandas dataframe

Trying to understand a dataset involves getting a quick insight into what type and range of data it contains. Pandas provides functions to pick random values from the dataset.

df.take(np.random.permutation(len(df))[:2])
this code snippet picks 2 values at random

df.take(np.random.permutation(len(df))[:2])
this code snippet picks 4 values at random

Example Tutorial:
Check out this Pandas tutorial on how to randomly sample a pandas dataframe.

12. How to filter in a pandas dataframe

Filtering a dataframe enables you to view specific rows and columns either based on order or matching specific conditions.

print(df[:2])
will print the first 2 rows in the dataframe.

print(df[(df['coverage'] > 50) & (df['reports'] < 4)])
will print rows where the column ‘coverage’ is greater than 50 and the column ‘reports’ is greater than 4.

Example Tutorial:
Check out this data science tutorial on how to filter in a pandas dataframe

Get More Practice, More Data Science and Machine Learning Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

13. How to calculate moving average in a pandas dataframe

As part of data munging, you have to try to understand the trends in your dataset. But when your data values are very spikey its tought to spot trends.

Calculating a moving average like a 7-day average helps to smoothen out the data variability and gives you a directional trend.

The dataframe.rolling() provides the rolling window calculation and by adding the ‘mean’ parameter to this function, the average can be calculated.

df1 = df[['preTestScore','postTestScore']].rolling(window=2).mean()
this calculates a moving average with a window of 2 on the columns ‘preTestScore’ and ‘postTestScore’. A window of 2 means, the next 2 consecutive values are averaged and this happens for the entire dataframe.

Example Tutorial:
Check out this data science tutorial on how to calculate moving average in a pandas dataframe

14. How to normalise a column in a pandas dataframe

In the data munging step of your data science project, you will often times get data with wide variability across positive and negative values. Normalisation is done to reduce the data range when data of different scales are involved.

Normalising a dataset (234,24,14) would result in (1, 0.31,0.28). Using 234 as the anchor value all other values are represented relative to 234).

Example Tutorial:
Check out this data science tutorial on how to normalise a column in a pandas dataframe

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

15. How to assign new columns in a pandas dataframe

There are a couple of reasons why you might want to add new columns during data processing.You might have data in 2 different data frames that you want to bring into a single data frame. Or you might want to add a new column that is a result of a function on 2 or more other columns.

There are multiple ways to add new columns in a pandas dataframe - by declaring a new list as a column, by using dataframe.insert(), by using dataframe.assign(), by using a dictionary.

The dataframe.assign() function will add a new column at the end of the dataframe by default. You cannot specify in which position to add this column. For that you will need to use the dataframe.insert()

df = df.assign(Marks = [71, 82, 89])

will add a new columnd “Marks” with the values 71, 82,89 as the last column in the dataframe.

Example Tutorial:
Check out this data science recipe on how to assign new columns in a pandas dataframe

Access Data Science and Machine Learning Project Code Examples

16. How to rank a pandas dataframe in ascending and descending order

By now you must have realised that Python is an excellent language to do data analysis. This is primarily because of the powerful data analytical packages like pandas that python provides.

Ranking a pandas dataframe returns a rank for every index (row) in the series passed to the function. Both numeric and string values can be ranked by the df.rank()

df['coverageRanked'] = df['coverage'].rank(ascending=True)

this function will create a new columns ‘coverageRanked’ and assign to i ascendingt ranks of the values in the ‘coverage’ column.

Example Tutorial:
Check out this data science tutorial on how to rank a pandas dataframe

17) Add row to a DataFrame

There are several ways to add a row or rows to an existing DataFrame in Python Pandas.

Adding a single row using the DataFrame.loc() function.

To add the row at the end of the DataFrame, the length of the DataFrame has to be found to determine the position at which the new row is to be added.

import pandas as pd
from numpy.random import randint
  
dict = {‘Student’:[‘Peter’, ‘James’, ‘Ella’, ‘Charlotte’],
        ‘Age’:[28,24,35,27],
        'Major':[‘Chemistry’,’Biology’,’Physics’,’Chemistry’]
       }
#creating a DataFrame from the dict of lists 
df = pd.DataFrame(dict)

Here, df would look like this:

	Student	Age	Major
0	Peter	28	Chemistry
1	James	24	Biology
2	Ella	35	Physics
3	Charlotte	27	Chemistry

#adding a new row
df.loc[len(df.index)] = ['Mike', 33, ‘Physics’]

Now, df would look like:

0	Peter	28	Chemistry
1	James	24	Biology
2	Ella	35	Physics
3	Charlotte	27	Chemistry
4	Mike	33	Physics

Using the DataFrame.append() function

The DataFrame.append() function in Python Pandas may be used to append a single row or to append multiple rows belonging to another DataFrame to the end of a particular DataFrame and return a new DataFrame object in the process. Any columns which are not present in the original DataFrame are added as new columns. The new cells created in the original DataFrame get populated with NaN

The syntax for the append() function is as follows:

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)

where:

other : the list of rows to be appended, or a DataFrame object or dictionary object of the rows to be appended.

ignore_index : takes True or False; default is false. If set to True, the index labels are not used.

verify_integrity : takes True or False; default is false. If set to True, ValueError gets raised on creating indexes with duplicates.

sort : sorts the columns if the columns of the original DataFrame and the new rows are not aligned. sort=True is used to silence the warning and sort. sort=False results in silencing the warning and not sorting.

Returns: DataFrame object with appended rows.

Using append() to add a single row:

import pandas as pd
from numpy.random import randint
  
dict = {‘Student’:[‘Peter’, ‘James’, ‘Ella’, ‘Charlotte’],
        ‘Age’:[28,24,35,27],
        'Major':[‘Chemistry’,’Biology’,’Physics’,’Chemistry’]
       }
#creating a DataFrame from the dict of lists 
df = pd.DataFrame(dict)
new_row = {‘Student’: 'Mike', 'Age': 29, 'Major': ‘Biology’}
df = df.append(df2, ignore_index = True)

Using append() to add the rows from a new DataFrame to an existing DataFrame.

import pandas as pd
# first DataFrame
df1 = pd.DataFrame({"foo":[1, 2, 3, 4],
                         "bar":[5, 6, 7, 8]})
  
# second DataFrame
df2 = pd.DataFrame({"foo":[9, 8, 7],
                    "bar":[5, 4, 3]})

df1:

	foo	bar
0	1	5
1	2	6
2	3	7
3	4	8

df2:

	foo	bar
0	9	5
1	8	4
2	7	3

df1.append(df2)

Will return

	foo	bar
0	1	5
1	2	6
2	3	7
3	4	8
0	9	5
1	8	4
2	7	3

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author

Python Pandas Dataframe Tutorial for Beginners

Table of Contents

Python Pandas Dataframe Basics

1. How to create a Dataframe

Creating a DataFrame from a list of lists

Creating a DataFrame from a dict of lists:

Here's what valued users are saying about ProjectPro

Creating an index DataFrame from a dict of lists:

Creating a DataFrame from a list of dicts

Creating a DataFrame from a list of dicts and specifying the row indices.

Creating a DataFrame from a list of dicts and specifying both the row indices and the column indices

Creating a DataFrame from a list of tuples:

Creating a DataFrame using the zip() function:

Creating an empty DataFrame

2. How to sort rows within a pandas dataframe

3. How to find the largest value in a pandas dataframe

4. How to list unique value in a pandas dataframe

5. How to delete duplicates from a pandas dataframe

6. Rename column header in a pandas dataframe

7. Search pandas dataframe for a value

8. Drop row and column in a pandas dataframe

9. Replace multiple values in a pandas dataframe

10. Save pandas dataframe as a .csv file

12. How to filter in a pandas dataframe

13. How to calculate moving average in a pandas dataframe

14. How to normalise a column in a pandas dataframe

15. How to assign new columns in a pandas dataframe

16. How to rank a pandas dataframe in ascending and descending order

17) Add row to a DataFrame

Adding a single row using the DataFrame.loc() function.

About the Author