How to get descriptive statistics of a Pandas DataFrame?

This recipe helps you get descriptive statistics of a Pandas DataFrame

Recipe Objective

Before making a model we need to analyse the data and for that we need to calculate different statics of the features.

This is the data science python source code does the following
1. Creates data dictionary and converts it into pandas dataframe
2. Uses describe function on dataframe
3. Performs statistical analysis on the dataset

So this is the recipe on how we can get descriptive statistics of a Pandas DataFrame

Master the Art of Data Cleaning in Machine Learning

Step 1 - Import the library

import pandas as pd

We have imported pandas which will be need for the dataset.

Step 2 - Setting up the Data

We have created a dictionary of data and passed it in pd.DataFrame to make a dataframe with columns 'first_name', 'last_name', 'age', 'Comedy_Score' and 'Rating_Score'. raw_data = {'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'], 'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'], 'age': [42, 38, 36, 41, 35], 'Comedy_Score': [9, 7, 8, 8, 5], 'Rating_Score': [25, 25, 49, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'Comedy_Score', 'Rating_Score']) print(df) print(df.info())

Step 3 - Finding different statistics

So we will be finding different statistic of the feature.

    • First, sum of all the ages

print(df['age'].sum())

    • Mean of Rating_Score

print(df['Rating_Score'].mean())

    • Cumulative sum of Rating_Score

print(df['Rating_Score'].cumsum())

    • Summary statistics on Rating_Score

print(df['Rating_Score'].describe())

    • Counting the number of non-NA values

print(df['Rating_Score'].count())

    • Minimum value of Rating_Score

print(df['Rating_Score'].min())

    • Maximum value of Rating_Score

print(df['Rating_Score'].max())

    • Median value of Rating_Score

print(df['Rating_Score'].median())

    • Sample variance of Rating_Score values

print(df['Rating_Score'].var())

    • Sample standard deviation of Rating_Score values

print(df['Rating_Score'].std())

    • Skewness of Rating_Score values

print(df['Rating_Score'].skew())

    • Kurtosis of Rating_Score values

print(df['Rating_Score'].kurt())

    • Correlation Matrix Of Values

print(df.corr())

    • Finally, Covariance Matrix Of Values

print(df.cov())

So the output comes as:

 first_name     last_name  age  Comedy_Score  Rating_Score
0    Sheldon        Copper   42             9            25
1        Raj  Koothrappali   38             7            25
2    Leonard    Hofstadter   36             8            49
3     Howard      Wolowitz   41             8            62
4        Amy        Fowler   35             5            70

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
first_name      5 non-null object
last_name       5 non-null object
age             5 non-null int64
Comedy_Score    5 non-null int64
Rating_Score    5 non-null int64
dtypes: int64(3), object(2)
memory usage: 280.0+ bytes
None

192

46.2

0     25
1     50
2     99
3    161
4    231
Name: Rating_Score, dtype: int64

count     5.000000
mean     46.200000
std      20.753313
min      25.000000
25%      25.000000
50%      49.000000
75%      62.000000
max      70.000000
Name: Rating_Score, dtype: float64

5

25

70

49.0

430.7

20.7533129885327

-0.07499061439128718

-2.6952969741807777

                   age  Comedy_Score  Rating_Score
age           1.000000      0.767579     -0.451895
Comedy_Score  0.767579      1.000000     -0.567136
Rating_Score -0.451895     -0.567136      1.000000

                age  Comedy_Score  Rating_Score
age            9.30          3.55        -28.60
Comedy_Score   3.55          2.30        -17.85
Rating_Score -28.60        -17.85        430.70

Download Materials

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Time Series Analysis with Facebook Prophet Python and Cesium
Time Series Analysis Project - Use the Facebook Prophet and Cesium Open Source Library for Time Series Forecasting in Python

Forecasting Business KPI's with Tensorflow and Python
In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

Build a CNN Model with PyTorch for Image Classification
In this deep learning project, you will learn how to build an Image Classification Model using PyTorch CNN

House Price Prediction Project using Machine Learning in Python
Use the Zillow Zestimate Dataset to build a machine learning model for house price prediction.

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

Loan Eligibility Prediction Project using Machine learning on GCP
Loan Eligibility Prediction Project - Use SQL and Python to build a predictive model on GCP to determine whether an application requesting loan is eligible or not.

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

End-to-End Speech Emotion Recognition Project using ANN
Speech Emotion Recognition using RAVDESS Audio Dataset - Build an Artificial Neural Network Model to Classify Audio Data into various Emotions like Sad, Happy, Angry, and Neutral