After checking it 10 times i can state that country C has a data leak, here is the code in R to reproduce it
library(data.table)
library(caret)
library(h2o)
library(dplyr)
library(plyr)
train_c_hhold <- fread("./data/C_hhold_train.csv", stringsAsFactors = TRUE)
test_c_hhold <- fread("./data/C_hhold_test.csv", stringsAsFactors = TRUE)
combi_c = bind_rows(train_c_hhold, test_c_hhold)
#split again
combi_c$poor <- as.numeric(combi_c$poor)
dtrain_c =combi_c[1:6469 ,]
dtest_c = combi_c[6470:9656 ,]
dtrain_c <- as.data.frame(dtrain_c)
dtrain_c[sapply(dtrain_c, is.character)] <- list(NULL)
dtrain_c$poor <- as.factor(dtrain_c$poor)
Create a stratified random sample to create train and validation sets
trainIndex <- createDataPartition(dtrain_c$poor , p=0.85, list=FALSE, times=1)
dtrain_c.train <- dtrain_c[ trainIndex, ]
dtrain_c.test <- dtrain_c[-trainIndex, ]
Identify predictors and response
y <- "poor"
x <- “DBjxSUvf” #golden feature or leak?
h2o.init(nthreads = -1, max_mem_size = “16g”)
df <- dtrain_c.train
test <- dtrain_c.test
test <- as.h2o(dtrain_c.test)
df <- as.h2o(df)
################################################
splits <- h2o.splitFrame(
data = df,
ratios = c(0.8),
destination_frames = c(“train.hex”, “valid.hex”), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
xgb <- h2o.xgboost(training_frame = train,
validation_frame = valid,
x=x,
y=y)
results:
Model Details:
H2OBinomialModel: xgboost
Model ID: XGBoost_model_R_1518962213119_32960
Model Summary:
number_of_trees
1 50
H2OBinomialMetrics: xgboost
** Reported on training data. **
** Metrics reported on training frame **
MSE: 0.003402182
RMSE: 0.05832822
LogLoss: 0.01268662
Mean Per-Class Error: 0.007296737
AUC: 0.9998939
Gini: 0.9997879
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 3730 5 0.001339 =5/3735
1 9 670 0.013255 =9/679
Totals 3739 675 0.003172 =14/4414
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.331701 0.989660 22
2 max f2 0.228894 0.989415 29
3 max f0point5 0.415105 0.993120 15
4 max accuracy 0.339843 0.996828 21
5 max precision 0.998890 1.000000 0
6 max recall 0.077338 1.000000 45
7 max specificity 0.998890 1.000000 0
8 max absolute_mcc 0.331701 0.987793 22
9 max min_per_class_accuracy 0.204525 0.994109 32
10 max mean_per_class_accuracy 0.204525 0.994109 32
Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>)
or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)
H2OBinomialMetrics: xgboost
** Reported on validation data. **
** Metrics reported on validation frame **
MSE: 0.007106412
RMSE: 0.08429954
LogLoss: 0.03168671
Mean Per-Class Error: 0.02684564
AUC: 0.9957525
Gini: 0.9915051
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 937 0 0.000000 =0/937
1 8 141 0.053691 =8/149
Totals 945 141 0.007366 =8/1086
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.998890 0.972414 0
2 max f2 0.314007 0.965147 7
3 max f0point5 0.998890 0.988780 0
4 max accuracy 0.998890 0.992634 0
5 max precision 0.998890 1.000000 0
6 max recall 0.000887 1.000000 49
7 max specificity 0.998890 1.000000 0
8 max absolute_mcc 0.998890 0.968658 0
9 max min_per_class_accuracy 0.032651 0.975454 18
10 max mean_per_class_accuracy 0.314007 0.980020 7
Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>)
or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)
perf_xgb_test <- h2o.performance(xgb, newdata = test)
result:
H2OBinomialMetrics: xgboost
MSE: 0.008702358
RMSE: 0.09328643
LogLoss: 0.03661993
Mean Per-Class Error: 0.02474473
AUC: 0.9957399
Gini: 0.9914797
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 823 1 0.001214 =1/824
1 7 138 0.048276 =7/145
Totals 830 139 0.008256 =8/969
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.594046 0.971831 2
2 max f2 0.407805 0.959945 7
3 max f0point5 0.998890 0.988456 0
4 max accuracy 0.998890 0.991744 0
5 max precision 0.998890 1.000000 0
6 max recall 0.000887 1.000000 46
7 max specificity 0.998890 1.000000 0
8 max absolute_mcc 0.998890 0.967338 0
9 max min_per_class_accuracy 0.032236 0.963592 20
10 max mean_per_class_accuracy 0.407805 0.976276 7
What do you say about it?