First Place Model Documentation

quocnle · January 11, 2015, 4:11pm

Thank you DrivenData.org and ERS for the awesome competition! Here is my model documentation, looking forward to seeing how everyone else tackled the problem as well.

http://nbviewer.ipython.org/url/machinelearner.net/boxplots-for-education-1st-place/BoxPlots_First_Place_Model.ipynb

JesseBuesking · January 11, 2015, 8:07pm

Thank you for sharing your winning model! I’ve been interested in reading up on the hashing trick, and seeing how it benefited your model makes me all the more inclined to learn about it!

I see that you tracked the log loss for the random samples being trained against per epoch, but I’m curious if you had a holdout sample that you verified against prior to submission? Maybe you didn’t need one since you were training against random subsets of the full data set, and that’s sufficient to determine a generalized solution?

In regards to randomly selecting data to train against, you mention that your method might end up omitting some of the rows of data. Given your approach, one way you might address this is by doing a first pass over the entire data set first and then doing random selection thereafter. I’m really curious how random selection actually effects the final model. Did you initially try training against the full data set, but then discovered that random sampling improved the model?

Also, you ended up using 4 epochs in your final model. Did you experiment with more or less epochs? If so what was the impact on the predictive power of the model? I can only assume that more epochs probably lead to a loss of generality in your model which would cause it to start overfitting.

quocnle · January 11, 2015, 8:40pm

Hi Jesse, no problem and I enjoyed reading your blog post!

I did use holdout cross validation but I didn’t include it in the final code. One of the advantages of using a tool like Vowpal Wabbit for online learning is that the holdout testing is baked in, but I sort of limped along with semi-manual cross validation during this competition. I saw that tinrtgu also has holdout testing built into his latest incarnation of his online learner. (I didn’t use Vowpal Wabbit btw simply because I could not quite get it to perform as well, but the results were fairly close. Something for me to look into after competition as the result should be very close or the same with the same tuning parameters).

Good question on the effect of the randomization. My finding was that the effect was small for reasonable values (reasonable being epochs*use_example_probability ~ 1). I only added it for peace of mind, I wanted to make sure I wasn’t adversely affected by a particular order. The effect was 0.0007 when I moved from a fixed 2-pass order to a random order with 4 epochs * 0.5 use_example_probability. I also saw a small increase going from 1 to 2 epochs using all training examples (maybe 0.005? unfortunately I combined it with another change). Not much of an effect for more than 2 epochs when using all training examples.

zygmunt · January 19, 2015, 9:44pm

What happened to the notebook? It only says “Code to be released on github soon” now…

isms · January 21, 2015, 3:36pm

Hey @zygmunt, not sure which repo you were looking at before, but thanks to @quocnle’s quick code turnaround we just made the 1st place repo public:

https://github.com/drivendata/boxplots-for-education-1st-place

Topic		Replies	Views
Congrats Quoc Le Box-Plots for Education	10	3451	February 26, 2015
Share the knowledge Pover-T Tests: Predicting Poverty	29	2711	March 4, 2018
How are you guys validating? Tick Tick Bloom Challenge	9	486	February 7, 2023
Congratulations to the winners Where's Whale-do?	2	335	July 1, 2022
Small feedback for the organizers Power Laws	2	1018	March 31, 2018

First Place Model Documentation

Related topics