How to extract features using PCA in Python?

How to extract features using PCA in Python?

How to extract features using PCA in Python?

This recipe helps you extract features using PCA in Python

Recipe Objective

In many datasets we find that number of features are very large and if we want to train the model it take more computational cost. To decrease the number of features we can use Principal component analysis (PCA). PCA decrease the number of features by selecting dimension of features which have most of the variance.

So this recipe is a short example of how can extract features using PCA in Python

Step 1 - Import the library

from sklearn import decomposition, datasets from sklearn.preprocessing import StandardScaler

Here we have imported various modules like decomposition, datasets and StandardScale from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt cancer dataset and we have created objects X and y to store the data and the target value respectively. dataset = datasets.load_breast_cancer() X = print(X.shape) print(X)

Step 3 - Using StandardScaler and PCA

StandardScaler is used to remove the outliners and scale the data by making the mean of the data 0 and standard deviation as 1. So we are creating an object std_scl to use standardScaler. std_slc = StandardScaler() X_std = std_slc.fit_transform(X) print(X_std.shape) print(X_std)

We are also using Principal Component Analysis(PCA) which will reduce the dimension of features by creating new features which have most of the varience of the original data. We have passed the parameter n_components as 4 which is the number of feature in final dataset. pca = decomposition.PCA(n_components=4) X_std_pca = pca.fit_transform(X_std) print(X_std_pca.shape) print(X_std_pca) As an output we get:

(569, 30)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]

(569, 30)

[[ 1.09706398 -2.07333501  1.26993369 ...  2.29607613  2.75062224
 [ 1.82982061 -0.35363241  1.68595471 ...  1.0870843  -0.24388967
 [ 1.57988811  0.45618695  1.56650313 ...  1.95500035  1.152255
 [ 0.70228425  2.0455738   0.67267578 ...  0.41406869 -1.10454895
 [ 1.83834103  2.33645719  1.98252415 ...  2.28998549  1.91908301
 [-1.80840125  1.22179204 -1.81438851 ... -1.74506282 -0.04813821

(569, 4)

[[ 9.19283682  1.94858315 -1.12316659  3.63373524]
 [ 2.3878018  -3.76817178 -0.52929307  1.1182629 ]
 [ 5.73389628 -1.07517381 -0.55174687  0.91208083]
 [ 1.25617928 -1.90229673  0.56273054 -2.0892281 ]
 [10.37479406  1.67201009 -1.87702907 -2.35603254]
 [-5.4752433  -0.67063675  1.49044361 -2.29915639]]

Download Materials

Relevant Projects

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Machine learning for Retail Price Recommendation with Python
Use the Mercari Dataset with dynamic pricing to build a price recommendation algorithm using machine learning in Python to automatically suggest the right product prices.

Convolutional RCCn's for extracting the text out of images
CRNNs combine both convolutional and recurrent architectures and is widely used in text detection and optical character recognition (OCR). In this project, we are going to use a CRNN architecture to detect text in sample images. The data we are going to use is TRSynth100k from Kaggle. Given an image containing some text, the goal here is to correctly identify the text using the CRNN architecture. We are going to train the model end-to-end from scratch.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Human Activity Recognition Using Multiclass Classification in Python
In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

Time Series LSTM forecasting
In this project, we will use time-series forecasting to predict the values of a sensor using multiple dependent variables. A variety of machine learning models are applied in this task of time series forecasting. We will see a comparison between the LSTM, ARIMA and Regression models. Classical forecasting methods like ARIMA are still popular and powerful but they lack the overall generalizability that memory-based models like LSTM offer. Every model has its own advantages and disadvantages and that will be discussed. The main objective of this article is to lead you through building a working LSTM model and it's different variants such as Vanilla, Stacked, Bidirectional, etc. There will be special focus on customized data preparation for LSTM.