Cross-validation iterators in scikit-learn are simply generator objects, that is, Python objects that implement the
__iter__ method and that for each call to this method return (or more precisely,
yield ) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API.
import numpy as npfrom sklearn.utils import check_random_stateclass HoldOut: ''' Hold-out cross-validator generator. In the hold-out, the data is split only once into a train set and a test set. Unlike in other cross-validation schemes, the hold-out consists of only one iteration. Parameters ---------- n : total number of samples test_size : 0 < float < 1 Fraction of samples to use as test set. Must be a number between 0 and 1. random_state : int Seed for the random number generator. ''' def __init__(self, n, test_size=0.2, random_state=0): self.n = n self.test_size = test_size self.random_state = random_state def __iter__(self): n_test = int(np.ceil(self.test_size * self.n)) n_train = self.n - n_test rng = check_random_state(self.random_state) permutation = rng.permutation(self.n) ind_test = permutation[:n_test] ind_train = permutation[n_test:n_test + n_train] yield ind_train, ind_test
Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of Theory of classification: a survey of some recent advances and the very recent The reusable holdout ."