I think you have to scale your data then the results would be better then before.
This is my first post… my code is on https://github.com/Payback80/drivendata_blood_donation
my score is 0.4269 with very few lines of code
preprocessing: check for NA, outliers, multicollinearity
feature engineering: some, check the code
strategy: xgboost and H2o automl
My score: 0.4350
Model used: vanilla logistic regression with 10-fold cross validation using caret in R
Pre-processing: remove total volume (100% correlation with number of donations)
Feature engineering: added new_donor variable (if months since last donation = months since first donation). I tried adding other variables like frequency (average months in between donations), interaction between the existing variables but didnt seem to improve performance much.
I have a question for anyone using the logLoss metrics in Caret. Do you get a negative logLoss? It’s weird I thought it should be higher than 0 but doesn’t seem to be the case.
bd_train <- trainControl(method=“repeatedcv”, number = 10, repeats=3, savePredictions = TRUE, classProbs=TRUE, summaryFunction=mnLogLoss)
model_bd1 <- train(donated ~ mo_last + no_donation + mo_first + new_donor , data=blood_donation, method=“glm”, family=“binomial”, trControl=bd_train, metric=“logLoss”)
Hi All! This is my first ever hands-on since completing DataCamp data scientist track )
So… rank 109 / 0.4349
Python (in PyCharm) with Keras, deep learning model comprised of BatchNorm layer, three Dense layers.
Model performs with around 0.5 loss and 0.72-ish accuracy metric.
Dropout layers did not improve much.