Today we learnt about the concept of cross validation which helps make predictions on the existing data as if we were to have with new data. The one limitation every data scientist comes across is uncertainty surrounding the model’s performance on unseen data until it undergoes actual testing and hence more often than not we are predicting a model based on available data. One way to overcome this limitation is to use the method of cross-validation.
Cross-validation involves partitioning the dataset into two or more subsets: a training set and a validation set (or multiple validation sets). The basic steps are:
- Split the data into K roughly equal-sized folds or subsets.
- Iteratively use K-1 folds for training and the remaining fold for validation.
- Repeat this process K times, each time with a different fold as the validation set.
- Compute performance metrics on each validation fold.
While there are several cross-validation methods, we are keener on the K-Fold Cross-Validation where the dataset is divided into K subsets, and the process is repeated K times, with each subset serving as the validation set once.
Cross-validation helps evaluate a model’s performance without needing a separate validation dataset. It helps prevent overfitting and provides a more accurate estimate of a model’s performance by using multiple subsets of the data for evaluation.