We’d love to here what methods are working well in this competition! This competition is just for fun, so we want to treat it as a learning opportunity. Sharing your process and tools helps our community members that are launching their data science careers learn and improve.

Do you use Python or R? Julia or Java? Stata or SAS?

Are you preprocessing any of the features?

Are you using an ensemble of methods or leaning on something standard?

What features of the data help or hurt your solutions?

Current Rank: 22 Toolset: R, nnet ensemble Variables: Use all variables except volume due to high correlation with number of donations. Derived one new variable - Average donations per donation period. Treat target as factor. Preprocessing: Scaled all numeric variables. Outlier removal doesn’t work for me (so far).

Any ideas on other derived variables? Other suggestions on how I can move from 0.4446 to 0.4223?

Thanks for the info. I tried a GBM at first, but it seemed too keen on predicting everything as false.
After reading your post I tried the avNNet package in R(powered by caret), which averages models built with the nnet package.

As for feature engineering : The volume feature is of no use - all donations are 250cc in size it seems. I also derived a donations per period feature, which is quite useful according to randomForest’s importance measures. Another feature I use is the ratio between the months since last donation and the months since first donation.

To summarize then : Current Rank : 33, score = 0.4492 Toolset : R, avNNet package(via caret) Feature engineering: Drop volume, add donations per period and ratio between months since last and months since first donation. Preprocessing : Centered and scaled

I’ve tried h2o deep learning package and random forest.
Since deep learning perform very well on training set but poorly on the test set , I switch to random forest package.My score improve a lot when I tune sampsize parameter.

Current Rank : 5, score = 0.4325 Toolset : R, randomforest package Feature engineering: Drop volume, derived new variable = Tenure / Frequency Preprocessing : remove some outliers

Without doing any special preprocessing or feature engineering yet, l achieved results using generalized linear model (logistic regression). It worked better than random forest. Donated volume has strong correlation with No. of donations, hence volume is of no use.
Current Rank: 31, score=0.4457
Toolset: R, packages: glm, randomforest

Feature engineering did not improve scores in most cases. Scaling was used for algorithms that required it. Hyper-parameters were estimated by GridSearchCV, a brute-force stratified 10-fold cross-validated search.

leaderboard_score is the contest score for predictions of the unknown test-set; lower is better. Camel-case model names refer to scikit-learn models; lower-case were hand-crafted in some way.

model

leaderboard_score

bagged_nolearn

0.4313

ensemble of averages

0.4370

voting ensemble

0.4396

LogisticRegression

0.4411

bagged_logit

0.4442

GradientBoostingClassifier

0.4452

LogisticRegressionCV

0.4457

bagged_scikit_nn

0.4465

bagged_gbc

0.4527

nolearn

0.4566

ExtraTreesClassifier

0.4729

blending ensemble

0.4834

XGBClassifier

0.4851

BaggingClassifier

0.4885

scikit_nn

0.5020

boosted_svc

0.5334

SVC

0.5336

SGDClassifier

0.5670

cosine_similarity

0.5732

boosted_logit

0.5891

KMeans

0.6289

AdaBoostClassifier

0.6642

KNeighborsClassifier

1.1870

RandomForestClassifier

1.7907

Simple logistic regression did quite well; it seems odd that bagging and boosting both reduced its performance. In general though, ensembling did improve performances.

A number of statistics were recorded for each model from 10-fold CV predictions of the training data:

accuracy the proportion correctly predicted

logloss the sklearn.metrics.log_loss

AUC the area under the ROC curve

f1 the weighted average of precision and recall

mu the average over 100 cross-validated scores with permutations

std the stdev over 100 cross-validated scores with permutations

Starting with all the variables, R’s step function produced the following

Call:
lm(formula = leaderboard_score ~ mu + std, data = score_data,
na.action = na.omit)
Residuals:
Min 1Q Median 3Q Max
-0.18728 -0.05472 -0.03539 0.02082 0.42898
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.722 2.962 8.685 3.09e-07 ***
mu -33.089 3.897 -8.490 4.11e-07 ***
std -60.589 7.857 -7.711 1.35e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1499 on 15 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.8311, Adjusted R-squared: 0.8086
F-statistic: 36.91 on 2 and 15 DF, p-value: 1.61e-06

Possibly std is a stand-in for statistical-learning’s variance.

The work is available on GitHub and BitBucket. (Only GitHub permits the viewing of IPython notebooks).

I am very new to programming and Data Science, however, I am eager to learn! My background is in biology and education, but looking to make the shift to Data Science. I am currently taking a part time class and considering further full-time courses.

I actually plan on using this data set as a final project for my class.

Toolset: Python, statsmodel (to start off)

I plan on performing a simple logistic regression with one variable to start off. My problem is deciding which one. I would love to apply more variables, but again, just a beginner. The variable I plan on starting off with is total number of donations.

Current Rank: 107 (0.4416)
Toolset: R, glm, xgboost, caret
Variables: All variables, some feature engeneering.
Preprocessing: Tried to balance classes, but no improvement at all.

Linear models worked better for me than ensembled method (bagging or boosting). I’m surely missing something but not figured what yet.

Anyone using python? I encountered problem in using their log_loss metric to select features/classifier. It seems to me their log loss function is scoring in way totally different from the one produced by R.

Current Rank : 105/2265 (top 5 %) , score = 0.4396 Toolset : python, scikit-learn (LogisticRegression with CV) Feature engineering: New variable = log(“Months since First Donation”-“Months since Last Donation”), drop volume (perfectely linearly correlated to number of donations) Preprocessing : no outlier removal

Hi ,
the best way to start with is to look at the correlation between the target and the single feature you want to include.
To do so, just plot the distribution of each variable “matplotlib.pyplot.hist()” and set one color for a modality of the target variable and another color for the other one.
The top feature would be the most seperable distribution regarding the 2 colors…

By way of an example, see below the histograms showing the distribution of each variable available in the training dataset:

The colors “blue” and “green” stand the modality 1 and 0 of the target variable (Made Donation in March 2007) respectively. We see that none of the variable is clearly separable in the 1D-plane AND it is generally the case in most of the “real life” datasets BUT it does not mean it is not seperable in higher dimensional plane !

A better idea may be to consider a combination of different features. Maybe you can try to divide the number of donations by the difference between months since first and last donation. See the distribution in the next post.

It is more separable, but still we are far from having two non overlapping bumps, far away from each other. You can try others combinations, the best one would be the one that gives the best result, so you can search for it by iteration on the training dataset (best done with cross-validation).

Hey, thanks for sharing the approach.
I’ve implemented random forest too (via caret), but I don’t really understand how to tune sampsize here and the impact it can have, could you please help me with how to go about it?

Hey, the approach really helped me, thanks!
Could you please share your code please, as I’m relatively new to neural nets it would help me understand better?

Hello. I am new to Data Science. I work with python and I am trying to solve the problem, but have not done good enough with the submissios. I even tried your method but couldn’ t do good enough. May be I have not done all the steps properly. This is my first real world problem.Can you please help me

Hi all, I’m a beginner. I tried out this code to do a simple test (it’s my first time doing this without DataCamp helping me along. Any comments are greatly appreciated!

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.read_csv(‘C:\Users\Roger.Hunt\Downloads\BloodTRAINING.csv’)
y = df[‘Made Donation in March 2007’]
X = df[‘Months since First Donation’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))