Thought it would be useful to share the non-winning solutions as well, so here’s the my solution done in collaboration with @MasterAwk. Link to PDF of report here (we chose to work on this for a class project, hence the pretty report). We didn’t have too much time to spend working on this so it’s a pretty rough attempt. Anyway, summary given below:
Result: Rank 34, 0.261 public leaderboard score, 0.266 private leaderboard score
Models (didn’t do any ensembling!):
- Gradient boosting machine (this worked best)
- Random forest
Feature engineering:
- Number of numeric features with missing values for each woman
- Number of ordinal features with missing values for each woman
- Number of categorical features with missing values for each woman
Feature selection:
- Features with a proportion of missing values exceeding a certain cut-off in the training set would be dropped, since missing value imputation is hardly meaningful for such features. A cut-off of 90% worked best.
Missing value imputation:
- For numeric features, missing values were set to 0.
- For ordinal features, missing values were set to -1, since the lowest category for each ordinal feature is coded as either 0 or 1 in the training set.
- For categorical features, a new category “missing” was introduced and missing values were set to this category instead.
What we should have done:
- More models + ensembling
- Cross validation
Of course, sharing of your solutions + any feedback would be welcome!