How to get descriptive statistics of a Pandas DataFrame?

This recipe helps you get descriptive statistics of a Pandas DataFrame
Last Updated: 06 Jul 2022

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Before making a model we need to analyse the data and for that we need to calculate different statics of the features.

This is the data science python source code does the following
1. Creates data dictionary and converts it into pandas dataframe
2. Uses describe function on dataframe
3. Performs statistical analysis on the dataset

So this is the recipe on how we can get descriptive statistics of a Pandas DataFrame

Master the Art of Data Cleaning in Machine Learning

Recipe Objective

Step 1 - Import the library

import pandas as pd

We have imported pandas which will be need for the dataset.

Step 2 - Setting up the Data

We have created a dictionary of data and passed it in pd.DataFrame to make a dataframe with columns 'first_name', 'last_name', 'age', 'Comedy_Score' and 'Rating_Score'. raw_data = {'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'], 'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'], 'age': [42, 38, 36, 41, 35], 'Comedy_Score': [9, 7, 8, 8, 5], 'Rating_Score': [25, 25, 49, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'Comedy_Score', 'Rating_Score']) print(df) print(df.info())

Step 3 - Finding different statistics

So we will be finding different statistic of the feature.

First, sum of all the ages

print(df['age'].sum())

Mean of Rating_Score

print(df['Rating_Score'].mean())

Cumulative sum of Rating_Score

print(df['Rating_Score'].cumsum())

Summary statistics on Rating_Score

print(df['Rating_Score'].describe())

Counting the number of non-NA values

print(df['Rating_Score'].count())

Minimum value of Rating_Score

print(df['Rating_Score'].min())

Maximum value of Rating_Score

print(df['Rating_Score'].max())

Median value of Rating_Score

print(df['Rating_Score'].median())

Sample variance of Rating_Score values

print(df['Rating_Score'].var())

Sample standard deviation of Rating_Score values

print(df['Rating_Score'].std())

Skewness of Rating_Score values

print(df['Rating_Score'].skew())

Kurtosis of Rating_Score values

print(df['Rating_Score'].kurt())

Correlation Matrix Of Values

print(df.corr())

Finally, Covariance Matrix Of Values

print(df.cov())

So the output comes as:

 first_name     last_name  age  Comedy_Score  Rating_Score
0    Sheldon        Copper   42             9            25
1        Raj  Koothrappali   38             7            25
2    Leonard    Hofstadter   36             8            49
3     Howard      Wolowitz   41             8            62
4        Amy        Fowler   35             5            70

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
first_name      5 non-null object
last_name       5 non-null object
age             5 non-null int64
Comedy_Score    5 non-null int64
Rating_Score    5 non-null int64
dtypes: int64(3), object(2)
memory usage: 280.0+ bytes
None

192

46.2

0     25
1     50
2     99
3    161
4    231
Name: Rating_Score, dtype: int64

count     5.000000
mean     46.200000
std      20.753313
min      25.000000
25%      25.000000
50%      49.000000
75%      62.000000
max      70.000000
Name: Rating_Score, dtype: float64

5

25

70

49.0

430.7

20.7533129885327

-0.07499061439128718

-2.6952969741807777

                   age  Comedy_Score  Rating_Score
age           1.000000      0.767579     -0.451895
Comedy_Score  0.767579      1.000000     -0.567136
Rating_Score -0.451895     -0.567136      1.000000

                age  Comedy_Score  Rating_Score
age            9.30          3.55        -28.60
Comedy_Score   3.55          2.30        -17.85
Rating_Score -28.60        -17.85        430.70

Download Materials

iPython Notebook

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Time Series Analysis with Facebook Prophet Python and Cesium

Time Series Analysis Project - Use the Facebook Prophet and Cesium Open Source Library for Time Series Forecasting in Python

View Project Details

Forecasting Business KPI's with Tensorflow and Python

In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

View Project Details

How to get descriptive statistics of a Pandas DataFrame?

Recipe Objective

Table of Contents

Step 1 - Import the library

Step 2 - Setting up the Data

Step 3 - Finding different statistics

Ray han

Relevant Projects

You might also like

Relevant Projects