How to reduce dimentionality using PCA in Python?

This recipe helps you reduce dimentionality using PCA in Python

Recipe Objective

In many datasets we find that number of features are very large and if we want to train the model it take more computational cost. To decrease the number of features we can use Principal component analysis (PCA). PCA decrease the number of features by selecting dimension of features which have most of the variance.

So this recipe is a short example of how can reduce dimentionality using PCA in Python.

Master the Art of Data Cleaning in Machine Learning

Step 1 - Import the library

from sklearn import datasets from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA

Here we have imported various modules like PCA, datasets and StandardScale from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt digits dataset. digits = datasets.load_digits()

Step 3 - Using StandardScaler

StandardScaler is used to remove the outliners and scale the data by making the mean of the data 0 and standard deviation as 1. X = StandardScaler().fit_transform( print(); print(X)

Step 4 - Using PCA

We are also using Principal Component Analysis(PCA) which will reduce the dimension of features by creating new features which have most of the varience of the original data. We have passed the parameter n_components as 0.85 which is the percentage of feature in final dataset. We have also printed shape of intial and final dataset. pca = PCA(n_components=0.85, whiten=True) X_pca = pca.fit_transform(X) print(X_pca) print("Original number of features:", X.shape[1]) print("Reduced number of features:", X_pca.shape[1]) Foe better understanding we are applying PCA again. Now We have passed the parameter n_components as 0.85 which is the percentage of feature in final dataset. We have also printed shape of intial and final dataset. pca = PCA(n_components=2, whiten=True) X_pca = pca.fit_transform(X) print(X_pca) print("Original number of features:", X.shape[1]) print("Reduced number of features:", X_pca.shape[1]) As an output we get:

[[ 0.         -0.33501649 -0.04308102 ... -1.14664746 -0.5056698
 [ 0.         -0.33501649 -1.09493684 ...  0.54856067 -0.5056698
 [ 0.         -0.33501649 -1.09493684 ...  1.56568555  1.6951369
 [ 0.         -0.33501649 -0.88456568 ... -0.12952258 -0.5056698
 [ 0.         -0.33501649 -0.67419451 ...  0.8876023  -0.5056698
 [ 0.         -0.33501649  1.00877481 ...  0.8876023  -0.26113572

[[ 0.70631939 -0.39512814 -1.73816236 ...  0.60320435 -0.94455291
 [ 0.21732591  0.38276482  1.72878893 ... -0.56722002  0.61131544
 [ 0.4804351  -0.13130437  1.33172761 ... -1.51284419 -0.48470912
 [ 0.37732433 -0.0612296   1.0879821  ...  0.04925597  0.29271531
 [ 0.39705007 -0.15768102 -1.08160094 ...  1.31785641  0.38883981
 [-0.46407544 -0.92213976  0.12493334 ... -1.27242756 -0.34190284
Original number of features: 64
Reduced number of features: 25

[[ 0.70634542 -0.39504744]
 [ 0.21730901  0.38270788]
 [ 0.48044955 -0.13126596]
 [ 0.37733004 -0.06120936]
 [ 0.39703595 -0.15774013]
 [-0.46406594 -0.92210953]]
Original number of features: 64
Reduced number of features: 2

Download Materials

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Insurance Pricing Forecast Using XGBoost Regressor
In this project, we are going to talk about insurance forecast by using linear and xgboost regression techniques.

A/B Testing Approach for Comparing Performance of ML Models
The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

House Price Prediction Project using Machine Learning in Python
Use the Zillow Zestimate Dataset to build a machine learning model for house price prediction.

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

AWS MLOps Project to Deploy a Classification Model [Banking]
In this AWS MLOps project, you will learn how to deploy a classification model using Flask on AWS.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

Build Portfolio Optimization Machine Learning Models in R
Machine Learning Project for Financial Risk Modelling and Portfolio Optimization with R- Build a machine learning model in R to develop a strategy for building a portfolio for maximized returns.