Thank you DrivenData.org and ERS for the awesome competition! Here is my model documentation, looking forward to seeing how everyone else tackled the problem as well.
Thank you for sharing your winning model! I’ve been interested in reading up on the hashing trick, and seeing how it benefited your model makes me all the more inclined to learn about it!
I see that you tracked the log loss for the random samples being trained against per epoch, but I’m curious if you had a holdout sample that you verified against prior to submission? Maybe you didn’t need one since you were training against random subsets of the full data set, and that’s sufficient to determine a generalized solution?
In regards to randomly selecting data to train against, you mention that your method might end up omitting some of the rows of data. Given your approach, one way you might address this is by doing a first pass over the entire data set first and then doing random selection thereafter. I’m really curious how random selection actually effects the final model. Did you initially try training against the full data set, but then discovered that random sampling improved the model?
Also, you ended up using 4 epochs in your final model. Did you experiment with more or less epochs? If so what was the impact on the predictive power of the model? I can only assume that more epochs probably lead to a loss of generality in your model which would cause it to start overfitting.
Hi Jesse, no problem and I enjoyed reading your blog post!
I did use holdout cross validation but I didn’t include it in the final code. One of the advantages of using a tool like Vowpal Wabbit for online learning is that the holdout testing is baked in, but I sort of limped along with semi-manual cross validation during this competition. I saw that tinrtgu also has holdout testing built into his latest incarnation of his online learner. (I didn’t use Vowpal Wabbit btw simply because I could not quite get it to perform as well, but the results were fairly close. Something for me to look into after competition as the result should be very close or the same with the same tuning parameters).
Good question on the effect of the randomization. My finding was that the effect was small for reasonable values (reasonable being epochs*use_example_probability ~ 1). I only added it for peace of mind, I wanted to make sure I wasn’t adversely affected by a particular order. The effect was 0.0007 when I moved from a fixed 2-pass order to a random order with 4 epochs * 0.5 use_example_probability. I also saw a small increase going from 1 to 2 epochs using all training examples (maybe 0.005? unfortunately I combined it with another change). Not much of an effect for more than 2 epochs when using all training examples.
What happened to the notebook? It only says “Code to be released on github soon” now…
Hey @zygmunt, not sure which repo you were looking at before, but thanks to @quocnle’s quick code turnaround we just made the 1st place repo public:
https://github.com/drivendata/boxplots-for-education-1st-place