How to do cross validation for time series?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

# How to do cross validation for time series?

This recipe helps you do cross validation for time series

## Recipe Objective

While fitting our model, we might get lucky enough and get the best test dataset while splitting. It might even overfit or underfit our model. It is therefore suggested to perform cross validation i.e. splitting several times and there after taking mean of our accuracy.

So this recipe is a short example on how to do cross validation on time series . Let's get started.

## Step 1 - Import the library

``` import numpy as np import pandas as pd from statsmodels.tsa.arima_model import ARMA from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import mean_squared_error ```

Let's pause and look at these imports. Numpy and pandas are general ones. Here statsmodels.tsa.arima_model is used to import ARMA library for building of model. TimeSeriesSplit will help us in easy and random splitting while performing cross validation.

## Step 2 - Setup the Data

``` df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date']) df.head() ```

Here, we have used one time series data from github.

## Step 3 - Splitting Data

``` tscv = TimeSeriesSplit(n_splits = 4) rmse = [] for train_index, test_index in tscv.split(df): cv_train, cv_test = df.iloc[train_index], df.iloc[test_index] model = ARMA(cv_train.value, order=(0, 1)).fit() predictions = model.predict(cv_test.index.values, cv_test.index.values[-1]) true_values = cv_test.value rmse.append(np.sqrt(mean_squared_error(true_values, predictions))) ```

Firstly, we have set number of splitting to be 4. Then we have loop for our cross validation. Each time, dataset is spliited to train and test datset; model is fitted on it, prediction are made and RMSE(accuracy) is calculated for each split.

## Step 4 - Printing the results

``` print(np.mean(rmse)) ```

Here, we have printed the coeffiecient of model and the predicted values.

## Step 5 - Lets look at our dataset now

Once we run the above code snippet, we will see:

```6.577393548356742
```

You might get different result but it will be close to given due to limited splitting.

#### Relevant Projects

##### House Price Prediction Project using Machine Learning
Use the Zillow dataset to follow a test-driven approach and build a regression machine learning model to predict the price of the house based on other variables.

##### Medical Image Segmentation Deep Learning Project
In this deep learning project, you will learn to implement Unet++ models for medical image segmentation to detect and classify colorectal polyps.

##### Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

##### Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

##### Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

##### Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

##### Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

##### Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

##### Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

##### Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.