Leakage in files metadata

Dear Organizers,

It seems that there is a leakage in the data you’ve provided for this competition.
I’ve built a simple experiment to check this out:

  • downloaded micro dataset
  • extracted time of last file modification and file size
  • used simple multiclass classifier with these two features
    Such model gave me a score 0.061752 on the leaderboard. For comparison, a sample submission with average class probabilities gives 0.090614 on the leaderboard.

Let us know what do you think on this issue.


It probably can be not a leakage, but the feature of video encoder. Static videos basically has lower size. Videos with ‘blank’ label have larger probability to have smaller file size.

Yes, you’re right. I’ve also tried out to build a model with last file modification feature only and it has higher performance comparing to baseline with average probabilities as well.