While fitting our model, we might get lucky enough and get the best test dataset while splitting. It might even overfit or underfit our model. It is therefore suggested to perform cross validation i.e. splitting several times and there after taking mean of our accuracy.
So this recipe is a short example on how to do cross validation on time series . Let's get started.
import numpy as np import pandas as pd from statsmodels.tsa.arima_model import ARMA from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import mean_squared_error
Let's pause and look at these imports. Numpy and pandas are general ones. Here statsmodels.tsa.arima_model is used to import ARMA library for building of model. TimeSeriesSplit will help us in easy and random splitting while performing cross validation.
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date']) df.head()
Here, we have used one time series data from github.
Now our dataset is ready.
tscv = TimeSeriesSplit(n_splits = 4) rmse =  for train_index, test_index in tscv.split(df): cv_train, cv_test = df.iloc[train_index], df.iloc[test_index] model = ARMA(cv_train.value, order=(0, 1)).fit() predictions = model.predict(cv_test.index.values, cv_test.index.values[-1]) true_values = cv_test.value rmse.append(np.sqrt(mean_squared_error(true_values, predictions)))
Firstly, we have set number of splitting to be 4. Then we have loop for our cross validation. Each time, dataset is spliited to train and test datset; model is fitted on it, prediction are made and RMSE(accuracy) is calculated for each split.
Here, we have printed the coeffiecient of model and the predicted values.
Once we run the above code snippet, we will see:
You might get different result but it will be close to given due to limited splitting.