I know this is pretty late in the competition. But how are you guys locally validating metadata especially guys at the top of leaderboard? My local and lb scores donot correlate at all?
Just doing Stratified K-Fold. CV (~0.67) is below LB (~0.76).
@Loki_K The test data used for final scores does have a similar distribution as the train data, but scores won’t be exactly the same. If you are evaluating your model on data that was used in training, that also may be a reason why it performs different on new, unseen data.
I hope that’s helpful, feel free to follow up with any other questions!
Ohh, Thank you for sharing.
Sry if this is really dumb question or if i’m missing anything, I am still student.
But as mentioned here, is doing stratified k-fold still a valid way to go?
Yeah, thank u.
The val_set I carved out has coordinates(lat, lng) that are in train_set but the acutal test_set have no common coordinates, that could be the reason for local and lb error discrepancy.
I’ve taken that to mean that for a given sample you cannot use imagery/climate data from dates that are after the date of the sample, not that you can’t include future samples as part of your training set.
Here @kwetstone mentioned that for a given sample, you can only use information that was already available at the time the sample was taken.
@kwetstone can you please clarify:
The earliest test sample is on 2013-01-08. There are only 5 train samples taken before that date. Is it true that when making predictions for that test sample we can only use a model that was trained on only those 5 samples?
My understanding is that we are allowed to train a model on the entire dataset at once, but for the test sample on 2013-01-08 all of the features/images/data for that sample must be from on or before that date.
@Loki_K @BrandenKMurray These are great questions!
My understanding is that we are allowed to train a model on the entire dataset at once, but for the test sample on 2013-01-08 all of the features/images/data for that sample must be from on or before that date.
This is correct. You can train a model on the full dataset, but when running inference on a given sample you can only use data that was available at the time the sample was taken. This means that in training, you’ll also want the features for any given sample to be derived from data that was available when the sample was taken (so that the training setup is an accurate reflection of inference). In other words, during training the samples should be treated independently.
Thanks for the clarification.