What's your strategy?

We’d love to here what methods are working well in this competition! This competition is just for fun, so we want to treat it as a learning opportunity. Sharing your process and tools helps our community members that are launching their data science careers learn and improve.

  • Do you use Python or R? Julia or Java? Stata or SAS?
  • Are you preprocessing any of the features?
  • Are you using an ensemble of methods or leaning on something standard?
  • What features of the data help or hurt your solutions?
3 Likes

Current Rank: 22
Toolset: R, nnet ensemble
Variables: Use all variables except volume due to high correlation with number of donations. Derived one new variable - Average donations per donation period. Treat target as factor.
Preprocessing: Scaled all numeric variables. Outlier removal doesn’t work for me (so far).

Any ideas on other derived variables? Other suggestions on how I can move from 0.4446 to 0.4223?

3 Likes

@BKR

Thanks for the info. I tried a GBM at first, but it seemed too keen on predicting everything as false.
After reading your post I tried the avNNet package in R(powered by caret), which averages models built with the nnet package.

As for feature engineering : The volume feature is of no use - all donations are 250cc in size it seems. I also derived a donations per period feature, which is quite useful according to randomForest’s importance measures. Another feature I use is the ratio between the months since last donation and the months since first donation.

To summarize then :
Current Rank : 33, score = 0.4492
Toolset : R, avNNet package(via caret)
Feature engineering: Drop volume, add donations per period and ratio between months since last and months since first donation.
Preprocessing : Centered and scaled

1 Like

I’ve tried h2o deep learning package and random forest.
Since deep learning perform very well on training set but poorly on the test set , I switch to random forest package.My score improve a lot when I tune sampsize parameter.

Current Rank : 5, score = 0.4325
Toolset : R, randomforest package
Feature engineering: Drop volume, derived new variable = Tenure / Frequency
Preprocessing : remove some outliers

4 Likes

Without doing any special preprocessing or feature engineering yet, l achieved results using generalized linear model (logistic regression). It worked better than random forest. Donated volume has strong correlation with No. of donations, hence volume is of no use.
Current Rank: 31, score=0.4457
Toolset: R, packages: glm, randomforest

1 Like

My score moved down by 0.01 point by adding Principle Components to the data. Don’t know if that might help much.

Furthermore, average time between donations was also useful. I think further improvements might be achieved working a little bit on it.

DrivenData’s Predict Blood Donations

Feature engineering did not improve scores in most cases. Scaling was used for algorithms that required it. Hyper-parameters were estimated by GridSearchCV, a brute-force stratified 10-fold cross-validated search.

leaderboard_score is the contest score for predictions of the unknown test-set; lower is better. Camel-case model names refer to scikit-learn models; lower-case were hand-crafted in some way.

model leaderboard_score
bagged_nolearn 0.4313
ensemble of averages 0.4370
voting ensemble 0.4396
LogisticRegression 0.4411
bagged_logit 0.4442
GradientBoostingClassifier 0.4452
LogisticRegressionCV 0.4457
bagged_scikit_nn 0.4465
bagged_gbc 0.4527
nolearn 0.4566
ExtraTreesClassifier 0.4729
blending ensemble 0.4834
XGBClassifier 0.4851
BaggingClassifier 0.4885
scikit_nn 0.5020
boosted_svc 0.5334
SVC 0.5336
SGDClassifier 0.5670
cosine_similarity 0.5732
boosted_logit 0.5891
KMeans 0.6289
AdaBoostClassifier 0.6642
KNeighborsClassifier 1.1870
RandomForestClassifier 1.7907

Simple logistic regression did quite well; it seems odd that bagging and boosting both reduced its performance. In general though, ensembling did improve performances.


A number of statistics were recorded for each model from 10-fold CV predictions of the training data:

  • accuracy the proportion correctly predicted

  • logloss the sklearn.metrics.log_loss

  • AUC the area under the ROC curve

  • f1 the weighted average of precision and recall

  • mu the average over 100 cross-validated scores with permutations

  • std the stdev over 100 cross-validated scores with permutations

Starting with all the variables, R’s step function produced the following

Call:
lm(formula = leaderboard_score ~ mu + std, data = score_data,
    na.action = na.omit)

Residuals:
     Min       1Q   Median       3Q      Max
-0.18728 -0.05472 -0.03539  0.02082  0.42898

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   25.722      2.962   8.685 3.09e-07 ***
mu           -33.089      3.897  -8.490 4.11e-07 ***
std          -60.589      7.857  -7.711 1.35e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1499 on 15 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.8311,	Adjusted R-squared:  0.8086
F-statistic: 36.91 on 2 and 15 DF,  p-value: 1.61e-06

Possibly std is a stand-in for statistical-learning’s variance.


The work is available on GitHub and BitBucket. (Only GitHub permits the viewing of IPython notebooks).

I am very new to programming and Data Science, however, I am eager to learn! My background is in biology and education, but looking to make the shift to Data Science. I am currently taking a part time class and considering further full-time courses.

I actually plan on using this data set as a final project for my class.

Toolset: Python, statsmodel (to start off)

I plan on performing a simple logistic regression with one variable to start off. My problem is deciding which one. I would love to apply more variables, but again, just a beginner. The variable I plan on starting off with is total number of donations.

Any suggestions / advice are greatly appreciated!

Current Rank: 107 (0.4416)
Toolset: R, glm, xgboost, caret
Variables: All variables, some feature engeneering.
Preprocessing: Tried to balance classes, but no improvement at all.

Linear models worked better for me than ensembled method (bagging or boosting). I’m surely missing something but not figured what yet.

1 Like

Score: 0.4415
Rank: 111
Toolset: R, party::cforest() and glm()
Preprocessing: Dropped volume and added average donation

I averaged results from random forest and regression function. I changed the values which were predicted negative to zero.

Anyone using python? I encountered problem in using their log_loss metric to select features/classifier. It seems to me their log loss function is scoring in way totally different from the one produced by R.

Current Rank : 105/2265 (top 5 %) , score = 0.4396
Toolset : python, scikit-learn (LogisticRegression with CV)
Feature engineering: New variable = log(“Months since First Donation”-“Months since Last Donation”), drop volume (perfectely linearly correlated to number of donations)
Preprocessing : no outlier removal

1 Like

Hi ,
the best way to start with is to look at the correlation between the target and the single feature you want to include.
To do so, just plot the distribution of each variable “matplotlib.pyplot.hist()” and set one color for a modality of the target variable and another color for the other one.
The top feature would be the most seperable distribution regarding the 2 colors…

By way of an example, see below the histograms showing the distribution of each variable available in the training dataset:

The colors “blue” and “green” stand the modality 1 and 0 of the target variable (Made Donation in March 2007) respectively. We see that none of the variable is clearly separable in the 1D-plane AND it is generally the case in most of the “real life” datasets BUT it does not mean it is not seperable in higher dimensional plane !

A better idea may be to consider a combination of different features. Maybe you can try to divide the number of donations by the difference between months since first and last donation. See the distribution in the next post.

It is more separable, but still we are far from having two non overlapping bumps, far away from each other. You can try others combinations, the best one would be the one that gives the best result, so you can search for it by iteration on the training dataset (best done with cross-validation).

Mathieu

<img src="//cdck-file-uploads-global.s3.dualstack.us-west-2.amazonaws.com/standard14/uploads/drivendata1/original/1X/4ff5ea142c5171e7f54a3cf2c952946fbbafb13b.

Hey, thanks for sharing the approach.
I’ve implemented random forest too (via caret), but I don’t really understand how to tune sampsize here and the impact it can have, could you please help me with how to go about it?

Hey, the approach really helped me, thanks!
Could you please share your code please, as I’m relatively new to neural nets it would help me understand better?

Hello. I am new to Data Science. I work with python and I am trying to solve the problem, but have not done good enough with the submissios. I even tried your method but couldn’ t do good enough. May be I have not done all the steps properly. This is my first real world problem.Can you please help me

Regards,

Mitul

Hi all, I’m a beginner. I tried out this code to do a simple test (it’s my first time doing this without DataCamp helping me along. Any comments are greatly appreciated!

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.read_csv(‘C:\Users\Roger.Hunt\Downloads\BloodTRAINING.csv’)
y = df[‘Made Donation in March 2007’]
X = df[‘Months since First Donation’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

I forgot to mention that I was getting a whole bunch of errors…which is of course why I am posting :slight_smile:

I know its too late now but i am using python for this competition what problem is with your log loss function?