HANDS-ON-LAB

Ship Listing Price Prediction Data Science Project in Python

Problem Statement

ABC is an online marketplace for buying and selling ships, primarily in Europe. The company wants to improve its revenue and customer experience by accurately predicting the listing price of ships based on their metadata and the description provided by the seller. The objective is to create a machine-learning model that can predict the listing price with a high degree of accuracy. This will help the company to make data-driven pricing decisions, increase profitability and attract more customers to their platform.

Dataset

The data mainly contains European ship listings. The dataset consists of 15 variables, where the target variable is the listing price of the ship. It includes information about the ship's length, fuel type, hull material, and the seller's description. Check out the complete data dictionary and download the Online Ship Listing Dataset.

Tasks

  1. Analyze the below hypotheses and find insights.

    • What is the distribution of Prices and average ship price in the data?

    • Is there any difference in the average price of ships that were built before and after the 2000s?

    • How does the length of the ship and hull material affect the ship's price?

  1. Data preprocessing

    • Remove null values from the description & price column

    • Check the average word count in the description column (Hint: Regex)

    • Check the word count for each description and remove records where the number of words in the description is less than 20. (Hint: Regex)

  1. Text Processing

    • Remove all the stopwords from the description column

    • Create a new feature named ADJECTIVE_COUNT that counts how many adjectives are in the description. (Adjectives are words like beautiful, wonderful, luxury, etc., which the offerers use to describe the facilities on the ship) (Hint: Regex & POS tagging) 

  1. Vectorize the description column, one-hot-encode the shipping category, fuel type, hull material columns, and drop the ID & name columns to create the model-ready data set.

  2. Build a Linear Regression model with a confidence interval to predict the ship listing price.

Learnings

  • Learn how to preprocess and clean data before building models.

  • Using text processing techniques such as stopwords removal and POS tagging to extract useful features from text data.

  • Learn to vectorize and one-hot-encode categorical variables for use in machine learning models.

  • Hands-on Experience Building a Simple Linear Regression Model.

FAQs

Q1. What is the significance of performing text processing in this project?

Text processing is crucial in this project because the seller's description about the ship is an important variable in predicting the listing price. By removing stop words, counting adjectives, and vectorizing the description column, we can extract more meaningful information from the text data and improve the performance of our linear regression model.

 

Q2. Can we use other regression models instead of linear regression?

We can use other regression models, such as Random Forest Regression, Support Vector Regression, or Gradient Boosting Regression, to predict the ship listing price. However, starting with a simple model like linear regression is recommended to establish a baseline before moving to more complex models.