How to reduce dimentionality on Sparse Matrix in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to reduce dimentionality on Sparse Matrix in Python?

How to reduce dimentionality on Sparse Matrix in Python?

This recipe helps you reduce dimentionality on Sparse Matrix in Python

0

Recipe Objective

While working on a large dataset with many feature and after creating a Sparse Matrix and training a model it takes a high computational cost. Managing and vizualizing the matrix is also very difficult. So we need to reduce the dimension of the matrix.

So this recipe is a short example of how can reduce dimentionality on Sparse Matrix in Python.

Step 1 - Import the library - GridSearchCv

from sklearn.preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse import csr_matrix from sklearn import datasets

Here we have imported various modules like StandardardScaler, datasets, TruncatedSVD and csr_matrix from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt digits dataset. We have used standardscaler to scale the data such that the mean becomes 0 and standard deviation to 1. We have also made a sparse matrix of the data by the function csr_matrix. digits = datasets.load_digits() X = StandardScaler().fit_transform(digits.data) print(X) X_sparse = csr_matrix(X) print(X_sparse)

Step 3 - Using GridSearchCV

We can truncate the marix that is we can reduce the dimension of the matrix by using the function TruncatedSVD with a parameter n_components which shows the final number of fetures we want. So we have fit and transform the matrix in the function to get the truncated matrix. tsvd = TruncatedSVD(n_components=10) X_sparse_tsvd = tsvd.fit(X_sparse).transform(X_sparse) print(); print(X_sparse_tsvd)

Step 6 - Printing Results

Now we are using print statements to print the results. print("Original number of features:", X_sparse.shape[1]) print("Reduced number of features:", X_sparse_tsvd.shape[1]) print(); print(tsvd.explained_variance_ratio_[0:6].sum()) As an output we get:


[[ 0.         -0.33501649 -0.04308102 ... -1.14664746 -0.5056698
  -0.19600752]
 [ 0.         -0.33501649 -1.09493684 ...  0.54856067 -0.5056698
  -0.19600752]
 [ 0.         -0.33501649 -1.09493684 ...  1.56568555  1.6951369
  -0.19600752]
 ...
 [ 0.         -0.33501649 -0.88456568 ... -0.12952258 -0.5056698
  -0.19600752]
 [ 0.         -0.33501649 -0.67419451 ...  0.8876023  -0.5056698
  -0.19600752]
 [ 0.         -0.33501649  1.00877481 ...  0.8876023  -0.26113572
  -0.19600752]]

  (0, 1)	-0.3350164872543856
  (0, 2)	-0.04308101770538793
  (0, 3)	0.2740715207154218
  (0, 4)	-0.6644775126361527
  (0, 5)	-0.8441293865949171
  (0, 6)	-0.40972392088346243
  (0, 7)	-0.1250229232970408
  (0, 8)	-0.05907755711884675
  (0, 9)	-0.6240092623290964
  (0, 10)	0.4829744992519545
  (0, 11)	0.7596224512649244
  (0, 12)	-0.05842586308220443
  (0, 13)	1.1277211297338117
  (0, 14)	0.8795830595483867
  (0, 15)	-0.13043338063115095
  (0, 16)	-0.04462507326885248
  (0, 17)	0.11144272449970435
  (0, 18)	0.8958804382797294
  (0, 19)	-0.8606663175537699
  (0, 20)	-1.1496484601880896
  (0, 21)	0.5154718747277965
  (0, 22)	1.905963466976408
  (0, 23)	-0.11422184388584329
  (0, 24)	-0.03337972630405602
  (0, 25)	0.48648927722411006
  :	:
  (1796, 38)	-0.8226945146290309
  (1796, 40)	-0.061343668908253476
  (1796, 41)	0.8105536026095989
  (1796, 42)	1.3950951873625397
  (1796, 43)	-0.19072005925701047
  (1796, 44)	-0.5868275383619802
  (1796, 45)	1.3634658076459107
  (1796, 46)	0.5874903313016945
  (1796, 47)	-0.08874161717060432
  (1796, 48)	-0.035433262605025426
  (1796, 49)	4.179200682513991
  (1796, 50)	1.505078217025183
  (1796, 51)	0.0881769306516768
  (1796, 52)	-0.26718796251356636
  (1796, 53)	1.2010187221077009
  (1796, 54)	0.8692294429227895
  (1796, 55)	-0.2097851269640334
  (1796, 56)	-0.023596458909150665
  (1796, 57)	0.7715345500122912
  (1796, 58)	0.47875261517372414
  (1796, 59)	-0.020358468129093202
  (1796, 60)	0.4441643511677691
  (1796, 61)	0.8876022965425754
  (1796, 62)	-0.26113572420685327
  (1796, 63)	-0.1960075186604789

[[ 1.91421562 -0.95449937 -3.94604425 ...  1.4963196   0.1160377
  -0.80839011]
 [ 0.58898173  0.9246434   3.92476559 ...  0.55743317  1.08360629
   0.07914133]
 [ 1.30203646 -0.31719139  3.02334129 ...  1.15547162  0.78332798
  -1.12203121]
 ...
 [ 1.02259528 -0.14791152  2.46997819 ...  0.52912028  2.04799351
  -2.0550423 ]
 [ 1.07605482 -0.38090797 -2.45549106 ...  0.76221796  1.07481616
  -0.33991093]
 [-1.25770756 -2.22760395  0.28362814 ... -1.20258084  0.80783614
  -1.84480729]]

Original number of features: 64
Reduced number of features: 10

0.4561203224142434

Relevant Projects

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Solving Multiple Classification use cases Using H2O
In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.