No. of Recommendations: 13
ML evaluates a model differently. They first define an algorithm to take any data and predict the future
(like a neural network or many tree search). Then the break the historical data into shorter subsets. For
each subset you train the predictor model on typically 60 to 70% of the subset and test the results.
The model is judged on the results of all the train-test sequences where no train-test sequence has any
future data. A model that does well did well looking only at the data it had at that point in time.
This is how all data mining is done, when it's done correctly. Same with "classic" MI screens.
(terminology clarification: data mining is a good thing, overmining is a problem)
The problem is this: the process you have described, including seeing which models worked on the withheld out-of-sample validation subset and killing those that didn't, is itself another layer of data mining, another step in a single larger and more complex model building process. That "greater" process has no out-of-sample validation. Once you have culled your set of models by looking at the effectiveness within the validation data set you held back, that validation data set is contaminated and is now in sample, not out of sample.
This might be seen to be "OK" if you only did it once (depending on your strictness), but nobody does this just once. There just isn't enough financial history. In effect all history gets used, and it's all in sample. The only out of sample is the stuff that actually happened after you stopped modelling, and (to be strict) after you stopped culling your set of models.
Combined with machine learning with a lot of parameters, the ability of your final model to have memorized the data set AND the validation set is nigh unbounded. That's not to say it's impossible to find a new and useful insight this way, but it's a very thorny patch in which to be hunting.
Jim