Other (non-winning) solutions

Thought it would be useful to share the non-winning solutions as well, so here’s the my solution done in collaboration with @MasterAwk. Link to PDF of report here (we chose to work on this for a class project, hence the pretty report). We didn’t have too much time to spend working on this so it’s a pretty rough attempt. Anyway, summary given below:

Result: Rank 34, 0.261 public leaderboard score, 0.266 private leaderboard score

Models (didn’t do any ensembling!):

  • Gradient boosting machine (this worked best)
  • Random forest

Feature engineering:

  • Number of numeric features with missing values for each woman
  • Number of ordinal features with missing values for each woman
  • Number of categorical features with missing values for each woman

Feature selection:

  • Features with a proportion of missing values exceeding a certain cut-off in the training set would be dropped, since missing value imputation is hardly meaningful for such features. A cut-off of 90% worked best.

Missing value imputation:

  • For numeric features, missing values were set to 0.
  • For ordinal features, missing values were set to -1, since the lowest category for each ordinal feature is coded as either 0 or 1 in the training set.
  • For categorical features, a new category “missing” was introduced and missing values were set to this category instead.

What we should have done:

  • More models + ensembling
  • Cross validation

Of course, sharing of your solutions + any feedback would be welcome!

2 Likes

Thanks, looks nice will take a while to digest… Wish it was Python, but oh well… need to learn more R