How to reduce dimentionality on Sparse Matrix in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to reduce dimentionality on Sparse Matrix in Python?

This recipe helps you reduce dimentionality on Sparse Matrix in Python

Recipe Objective

While working on a large dataset with many feature and after creating a Sparse Matrix and training a model it takes a high computational cost. Managing and vizualizing the matrix is also very difficult. So we need to reduce the dimension of the matrix.

So this recipe is a short example of how can reduce dimentionality on Sparse Matrix in Python.

Step 1 - Import the library - GridSearchCv

``` from sklearn.preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse import csr_matrix from sklearn import datasets ```

Here we have imported various modules like StandardardScaler, datasets, TruncatedSVD and csr_matrix from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt digits dataset. We have used standardscaler to scale the data such that the mean becomes 0 and standard deviation to 1. We have also made a sparse matrix of the data by the function csr_matrix. ``` digits = datasets.load_digits() X = StandardScaler().fit_transform(digits.data) print(X) X_sparse = csr_matrix(X) print(X_sparse) ```

Step 3 - Using GridSearchCV

We can truncate the marix that is we can reduce the dimension of the matrix by using the function TruncatedSVD with a parameter n_components which shows the final number of fetures we want. So we have fit and transform the matrix in the function to get the truncated matrix. ``` tsvd = TruncatedSVD(n_components=10) X_sparse_tsvd = tsvd.fit(X_sparse).transform(X_sparse) print(); print(X_sparse_tsvd) ```

Step 6 - Printing Results

Now we are using print statements to print the results. ``` print("Original number of features:", X_sparse.shape[1]) print("Reduced number of features:", X_sparse_tsvd.shape[1]) print(); print(tsvd.explained_variance_ratio_[0:6].sum()) ``` As an output we get:

```
[[ 0.         -0.33501649 -0.04308102 ... -1.14664746 -0.5056698
-0.19600752]
[ 0.         -0.33501649 -1.09493684 ...  0.54856067 -0.5056698
-0.19600752]
[ 0.         -0.33501649 -1.09493684 ...  1.56568555  1.6951369
-0.19600752]
...
[ 0.         -0.33501649 -0.88456568 ... -0.12952258 -0.5056698
-0.19600752]
[ 0.         -0.33501649 -0.67419451 ...  0.8876023  -0.5056698
-0.19600752]
[ 0.         -0.33501649  1.00877481 ...  0.8876023  -0.26113572
-0.19600752]]

(0, 1)	-0.3350164872543856
(0, 2)	-0.04308101770538793
(0, 3)	0.2740715207154218
(0, 4)	-0.6644775126361527
(0, 5)	-0.8441293865949171
(0, 6)	-0.40972392088346243
(0, 7)	-0.1250229232970408
(0, 8)	-0.05907755711884675
(0, 9)	-0.6240092623290964
(0, 10)	0.4829744992519545
(0, 11)	0.7596224512649244
(0, 12)	-0.05842586308220443
(0, 13)	1.1277211297338117
(0, 14)	0.8795830595483867
(0, 15)	-0.13043338063115095
(0, 16)	-0.04462507326885248
(0, 17)	0.11144272449970435
(0, 18)	0.8958804382797294
(0, 19)	-0.8606663175537699
(0, 20)	-1.1496484601880896
(0, 21)	0.5154718747277965
(0, 22)	1.905963466976408
(0, 23)	-0.11422184388584329
(0, 24)	-0.03337972630405602
(0, 25)	0.48648927722411006
:	:
(1796, 38)	-0.8226945146290309
(1796, 40)	-0.061343668908253476
(1796, 41)	0.8105536026095989
(1796, 42)	1.3950951873625397
(1796, 43)	-0.19072005925701047
(1796, 44)	-0.5868275383619802
(1796, 45)	1.3634658076459107
(1796, 46)	0.5874903313016945
(1796, 47)	-0.08874161717060432
(1796, 48)	-0.035433262605025426
(1796, 49)	4.179200682513991
(1796, 50)	1.505078217025183
(1796, 51)	0.0881769306516768
(1796, 52)	-0.26718796251356636
(1796, 53)	1.2010187221077009
(1796, 54)	0.8692294429227895
(1796, 55)	-0.2097851269640334
(1796, 56)	-0.023596458909150665
(1796, 57)	0.7715345500122912
(1796, 58)	0.47875261517372414
(1796, 59)	-0.020358468129093202
(1796, 60)	0.4441643511677691
(1796, 61)	0.8876022965425754
(1796, 62)	-0.26113572420685327
(1796, 63)	-0.1960075186604789

[[ 1.91421562 -0.95449937 -3.94604425 ...  1.4963196   0.1160377
-0.80839011]
[ 0.58898173  0.9246434   3.92476559 ...  0.55743317  1.08360629
0.07914133]
[ 1.30203646 -0.31719139  3.02334129 ...  1.15547162  0.78332798
-1.12203121]
...
[ 1.02259528 -0.14791152  2.46997819 ...  0.52912028  2.04799351
-2.0550423 ]
[ 1.07605482 -0.38090797 -2.45549106 ...  0.76221796  1.07481616
-0.33991093]
[-1.25770756 -2.22760395  0.28362814 ... -1.20258084  0.80783614
-1.84480729]]

Original number of features: 64
Reduced number of features: 10

0.4561203224142434
```

Relevant Projects

Time Series Analysis Project in R on Stock Market forecasting
In this time series project, you will build a model to predict the stock prices and identify the best time series forecasting model that gives reliable and authentic results for decision making.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

RASA NLU chatbot creation
The project will use rasa NLU for the Intent classifier, spacy for entity tagging, and mongo dB as the DB. The project will incorporate slot filling and context management and will be supporting the following intent and entities. Intents : product_info | ask_price|cancel_order Entities : product_name|location|order id The project will demonstrate how to generate data on the fly, annotate using framework and how to process those for different pieces of training as discussed above .

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

Grouping similar schools/colleges using scorecard and other factors
Use cluster analysis to identify the groups of characteristically similar schools in the College Scorecard dataset. Considerations: Clustering Algorithm Data Preparation How will you deal with missing values? Categorical variables? Feature intercorrelations? Feature normalization or scaling? Dimensionality reduction? Hyperparameters How will you set the parameters -- the algorithm's knobs and dials, so to speak -- in order to achieve valid and useful output? Interpretation Is it possible to explain what each cluster represents? Did you retain or prepare a set of features that enables a meaningful interpretation of the clusters? Do the compositions of the clusters seem to make sense? Validation How will you measure the validity of your clustering process? Which metrics will you use and how will you apply them?

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.