How to reduce dimentionality on Sparse Matrix in Python?

This recipe helps you reduce dimentionality on Sparse Matrix in Python
Last Updated: 06 Jul 2022

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

While working on a large dataset with many feature and after creating a Sparse Matrix and training a model it takes a high computational cost. Managing and vizualizing the matrix is also very difficult. So we need to reduce the dimension of the matrix.

So this recipe is a short example of how can reduce dimentionality on Sparse Matrix in Python.

Master the Art of Data Cleaning in Machine Learning

Step 1 - Import the library - GridSearchCv

from sklearn.preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse import csr_matrix from sklearn import datasets

Here we have imported various modules like StandardardScaler, datasets, TruncatedSVD and csr_matrix from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt digits dataset. We have used standardscaler to scale the data such that the mean becomes 0 and standard deviation to 1. We have also made a sparse matrix of the data by the function csr_matrix. digits = datasets.load_digits() X = StandardScaler().fit_transform(digits.data) print(X) X_sparse = csr_matrix(X) print(X_sparse)

Step 3 - Using GridSearchCV

We can truncate the marix that is we can reduce the dimension of the matrix by using the function TruncatedSVD with a parameter n_components which shows the final number of fetures we want. So we have fit and transform the matrix in the function to get the truncated matrix. tsvd = TruncatedSVD(n_components=10) X_sparse_tsvd = tsvd.fit(X_sparse).transform(X_sparse) print(); print(X_sparse_tsvd)

Step 4 - Printing Results

Now we are using print statements to print the results. print("Original number of features:", X_sparse.shape[1]) print("Reduced number of features:", X_sparse_tsvd.shape[1]) print(); print(tsvd.explained_variance_ratio_[0:6].sum()) As an output we get:

[[ 0.         -0.33501649 -0.04308102 ... -1.14664746 -0.5056698
  -0.19600752]
 [ 0.         -0.33501649 -1.09493684 ...  0.54856067 -0.5056698
  -0.19600752]
 [ 0.         -0.33501649 -1.09493684 ...  1.56568555  1.6951369
  -0.19600752]
 ...
 [ 0.         -0.33501649 -0.88456568 ... -0.12952258 -0.5056698
  -0.19600752]
 [ 0.         -0.33501649 -0.67419451 ...  0.8876023  -0.5056698
  -0.19600752]
 [ 0.         -0.33501649  1.00877481 ...  0.8876023  -0.26113572
  -0.19600752]]

  (0, 1)	-0.3350164872543856
  (0, 2)	-0.04308101770538793
  (0, 3)	0.2740715207154218
  (0, 4)	-0.6644775126361527
  (0, 5)	-0.8441293865949171
  (0, 6)	-0.40972392088346243
  (0, 7)	-0.1250229232970408
  (0, 8)	-0.05907755711884675
  (0, 9)	-0.6240092623290964
  (0, 10)	0.4829744992519545
  (0, 11)	0.7596224512649244
  (0, 12)	-0.05842586308220443
  (0, 13)	1.1277211297338117
  (0, 14)	0.8795830595483867
  (0, 15)	-0.13043338063115095
  (0, 16)	-0.04462507326885248
  (0, 17)	0.11144272449970435
  (0, 18)	0.8958804382797294
  (0, 19)	-0.8606663175537699
  (0, 20)	-1.1496484601880896
  (0, 21)	0.5154718747277965
  (0, 22)	1.905963466976408
  (0, 23)	-0.11422184388584329
  (0, 24)	-0.03337972630405602
  (0, 25)	0.48648927722411006
  :	:
  (1796, 38)	-0.8226945146290309
  (1796, 40)	-0.061343668908253476
  (1796, 41)	0.8105536026095989
  (1796, 42)	1.3950951873625397
  (1796, 43)	-0.19072005925701047
  (1796, 44)	-0.5868275383619802
  (1796, 45)	1.3634658076459107
  (1796, 46)	0.5874903313016945
  (1796, 47)	-0.08874161717060432
  (1796, 48)	-0.035433262605025426
  (1796, 49)	4.179200682513991
  (1796, 50)	1.505078217025183
  (1796, 51)	0.0881769306516768
  (1796, 52)	-0.26718796251356636
  (1796, 53)	1.2010187221077009
  (1796, 54)	0.8692294429227895
  (1796, 55)	-0.2097851269640334
  (1796, 56)	-0.023596458909150665
  (1796, 57)	0.7715345500122912
  (1796, 58)	0.47875261517372414
  (1796, 59)	-0.020358468129093202
  (1796, 60)	0.4441643511677691
  (1796, 61)	0.8876022965425754
  (1796, 62)	-0.26113572420685327
  (1796, 63)	-0.1960075186604789

[[ 1.91421562 -0.95449937 -3.94604425 ...  1.4963196   0.1160377
  -0.80839011]
 [ 0.58898173  0.9246434   3.92476559 ...  0.55743317  1.08360629
   0.07914133]
 [ 1.30203646 -0.31719139  3.02334129 ...  1.15547162  0.78332798
  -1.12203121]
 ...
 [ 1.02259528 -0.14791152  2.46997819 ...  0.52912028  2.04799351
  -2.0550423 ]
 [ 1.07605482 -0.38090797 -2.45549106 ...  0.76221796  1.07481616
  -0.33991093]
 [-1.25770756 -2.22760395  0.28362814 ... -1.20258084  0.80783614
  -1.84480729]]

Original number of features: 64
Reduced number of features: 10

0.4561203224142434

Download Materials

iPython Notebook

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

End-to-End Snowflake Healthcare Analytics Project on AWS-1

In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

View Project Details

How to reduce dimentionality on Sparse Matrix in Python?

Recipe Objective

Step 1 - Import the library - GridSearchCv

Step 2 - Setup the Data

Step 3 - Using GridSearchCV

Step 4 - Printing Results

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects