HANDS-ON-LAB

Song Popularity Prediction Machine Learning Project

Problem Statement

Humans have greatly associated themselves with Songs & Music. It can improve mood, decrease pain and anxiety, and facilitate opportunities for emotional expression. Research suggests that music can benefit our physical and mental health in numerous ways. Lately, multiple studies have been carried out to understand songs & it's popularity based on certain factors. Such song samples are broken down & their parameters are recorded to tabulate. Predicting the Song Popularity is the main aim.

The project is simple yet challenging, to predict the song's popularity based on energy, acoustics, instumentalness, liveness, danceability, etc. The dataset is large & its complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

Dataset

Kindly download the data from here.

Tasks

Check the count and percentage of missing values in each column. Drop columns which have more than 80% null values.
EDA:

- Find top 3 highly correlated numerical features with the target variable. Plot Correlation matrix.
- How do you check outliers in each column? Create a function that takes in each column and caps the value in the Inter Quartile Range (effectively removing outliers).

Are there any multi-collinearity among the features? If so, how do you find and remove those variables? (Hint: VIF)
Build Regression models to predict the final song popularity:

- Build a Linear regression model on the final data. (don’t forget to scale the features)
- Build an XGBoost Regressor on the data (XGBR - 1).
- Tune the hyperparameters and select the best model from the experiment (XGBR - 2)
- Compare and contrast the evaluation metrics of the 3 models - RMSE, MAE, R2.

Discover the correlation between musical features and song popularity. Get started with data analysis now!

FAQs

Q1. How do you check outliers in each column? Create a function that caps values in the Inter Quartile Range (IQR).

Outliers can be detected by identifying values beyond the IQR range. A function can be created to cap these values within the IQR bounds effectively removing outliers.

Q2. Are there any multicollinearity among the features? If so, how do you find and remove those variables? (Hint: VIF)

Multicollinearity can be detected using the Variance Inflation Factor (VIF). High VIF values indicate strong correlation. Variables with high VIF can be removed to mitigate multicollinearity.

Q3. How do the evaluation metrics (RMSE, MAE, R2) compare among the three models?

The evaluation metrics can be used to compare the performance of the models. RMSE measures the average prediction error, MAE represents the average absolute error, and R2 indicates the proportion of the variance in the target variable explained by the models. A comparison will reveal the best model for song popularity prediction.