This was a fun competition but unfortunately I didn’t have enough time to fully engage with it. What worked?
I’d have loved to have had time to try a recurrent neural network on this data but instead I only managed what I’d call a “benchmark” style solution. I used ranger in R (with default parameters) to build regression random forests for each target based on the provided features combined with those from T+2, T+1, T-1 and T-2. I rescaled the responses to sum to 1. I was surprised this did so well (rank 14).
Look forward to hearing more about this all at ECML. How does the ECML session work?
Man, we should have teamed up.
I generated featues ranginf rm std ,mean median etc to root mean square, acceleration . similar set of features from !-1 helped nicely. Rollign max ,mean etc on acceleration.
minmax features( max - abs(min)).
rms and acceleration were nice.
I used features from a mlp/CNN layer which helped but I dropped them in the end because of overfitting…
Finally few tricks which helped:
Discard all probs less than 0.05 add their residuals to the highest prob for that sample.
taking 5-6 top submissions and mean across them
both tricks helped in 0.25 around improvement.
All with ExtraTreesClassifier. Entropy helped vs Gini.
Coudn’t get XGBoost working till the end. Extra trees outperformed it. OnevsAll also helped.
Local Crossvalidation were off by 0.3
Most of work happened in last week.
But Should have teamed up from top rankers. Would have learned much more .
Anyone used wavelets features?frequency domain
Discard all probs less than 0.05 add there residuals to the highest prob for that sample.
How much score did that improve ?
My feature set was basic statistics (quantiles, min, max, mean ) of measurements or functions of measurements of all sensors in one second windows.
I did customise xgboost. Best single model gave 0.165. Then shifting those predictions by 1, 2, 3, 4 seconds and including them as features reduced it to 0.15ish. Stacking a couple of GBlinear (xgboost), GBtree (xgboost), ET, RF drove it to 0.142. That’s when I was trying thresholding the predictions and got to 0.140. (Public Score).
Looking forward to hear winning solutions.
Also, are there 2 sets of prizes - ECML workshop prizes and Datadriven/AARP prizes ? If yes, will they be awarded to the same top 3 teams ?
With threshold of 0.05, 0.0005 improvement on LB. Score.
You mean 0.0025 ? (Post must be at at least 20 characters. Lame.)
For the acceleration data I used a recurrent autoencoder (trained on both test and training data) to generate an embedding for every second. For the rest of the modalities I just took mean and std. I then stacked the vectors from each modality and used a bi-directional RNN with LSTM cells across the entire sequence.
The issue with this approach was clearly overfitting.
Nice. I couldn’t get plain LSTM working
How much did it score?
Just above 18 on public set but around 12-14 on test set depending on the random split. So not very good but I suspect more external data and perhaps some feature engineering would benefit. Alas, not enough time in the end.
Example needed for RF or xgboost used in this competition. I want to see how to deal with target.csv. The target is multi column.
@bikash I used these custom functions for CV and train with xgboost - https://gist.github.com/aakansh9/14dc322ae72ff144a311de5a955ac6aa . Code is not so optimized but it did work.