Training, Validation, Test, and Data Leakage

ML best practice dictates that model performance should be reported on the test set, and that training, validation, and test sets shouldn't intersect. However, while this is necessary to avoid data leakage, it is not sufficient.

Consider the case where a researcher splits training, validation, and test sets properly, develops a model and checks that it generalizes well on the validation set, but is unhappy with the results on the test set. They then proceed to tweak some hyper-parameters, and re-train the model, maybe a few times, finally obtaining satisfactory results on the test set.

This is an indirect case of data leakage, because the model is "seeing" the test set through the researcher in order to optimize its hyper-parameters.

Ideally, the test set would be put aside, and performance on it measured only once, right before reporting. Only the validation set (including in the context of cross-validation) should be used for the purpose of assessing generalization or over-fitting during the training process.