Data leak in country C

After checking it 10 times i can state that country C has a data leak, here is the code in R to reproduce it

library(data.table)
library(caret)
library(h2o)
library(dplyr)
library(plyr)

train_c_hhold <- fread("./data/C_hhold_train.csv", stringsAsFactors = TRUE)
test_c_hhold <- fread("./data/C_hhold_test.csv", stringsAsFactors = TRUE)
combi_c = bind_rows(train_c_hhold, test_c_hhold)

#split again
combi_c$poor <- as.numeric(combi_c$poor)
dtrain_c =combi_c[1:6469 ,]
dtest_c = combi_c[6470:9656 ,]

dtrain_c <- as.data.frame(dtrain_c)
dtrain_c[sapply(dtrain_c, is.character)] <- list(NULL)

dtrain_c$poor <- as.factor(dtrain_c$poor)

Create a stratified random sample to create train and validation sets

trainIndex <- createDataPartition(dtrain_c$poor , p=0.85, list=FALSE, times=1)
dtrain_c.train <- dtrain_c[ trainIndex, ]
dtrain_c.test <- dtrain_c[-trainIndex, ]

Identify predictors and response

y <- "poor"
x <- “DBjxSUvf” #golden feature or leak?

h2o.init(nthreads = -1, max_mem_size = “16g”)
df <- dtrain_c.train

test <- dtrain_c.test
test <- as.h2o(dtrain_c.test)
df <- as.h2o(df)
################################################
splits <- h2o.splitFrame(
data = df,
ratios = c(0.8),
destination_frames = c(“train.hex”, “valid.hex”), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]

xgb <- h2o.xgboost(training_frame = train,
validation_frame = valid,
x=x,
y=y)

results:

Model Details:

H2OBinomialModel: xgboost
Model ID: XGBoost_model_R_1518962213119_32960
Model Summary:
number_of_trees
1 50

H2OBinomialMetrics: xgboost
** Reported on training data. **
** Metrics reported on training frame **

MSE: 0.003402182
RMSE: 0.05832822
LogLoss: 0.01268662
Mean Per-Class Error: 0.007296737
AUC: 0.9998939
Gini: 0.9997879

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 3730 5 0.001339 =5/3735
1 9 670 0.013255 =9/679
Totals 3739 675 0.003172 =14/4414

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.331701 0.989660 22
2 max f2 0.228894 0.989415 29
3 max f0point5 0.415105 0.993120 15
4 max accuracy 0.339843 0.996828 21
5 max precision 0.998890 1.000000 0
6 max recall 0.077338 1.000000 45
7 max specificity 0.998890 1.000000 0
8 max absolute_mcc 0.331701 0.987793 22
9 max min_per_class_accuracy 0.204525 0.994109 32
10 max mean_per_class_accuracy 0.204525 0.994109 32

Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)
H2OBinomialMetrics: xgboost
** Reported on validation data. **
** Metrics reported on validation frame **

MSE: 0.007106412
RMSE: 0.08429954
LogLoss: 0.03168671
Mean Per-Class Error: 0.02684564
AUC: 0.9957525
Gini: 0.9915051

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 937 0 0.000000 =0/937
1 8 141 0.053691 =8/149
Totals 945 141 0.007366 =8/1086

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.998890 0.972414 0
2 max f2 0.314007 0.965147 7
3 max f0point5 0.998890 0.988780 0
4 max accuracy 0.998890 0.992634 0
5 max precision 0.998890 1.000000 0
6 max recall 0.000887 1.000000 49
7 max specificity 0.998890 1.000000 0
8 max absolute_mcc 0.998890 0.968658 0
9 max min_per_class_accuracy 0.032651 0.975454 18
10 max mean_per_class_accuracy 0.314007 0.980020 7

Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)

perf_xgb_test <- h2o.performance(xgb, newdata = test)

result:

H2OBinomialMetrics: xgboost

MSE: 0.008702358
RMSE: 0.09328643
LogLoss: 0.03661993
Mean Per-Class Error: 0.02474473
AUC: 0.9957399
Gini: 0.9914797

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 823 1 0.001214 =1/824
1 7 138 0.048276 =7/145
Totals 830 139 0.008256 =8/969

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.594046 0.971831 2
2 max f2 0.407805 0.959945 7
3 max f0point5 0.998890 0.988456 0
4 max accuracy 0.998890 0.991744 0
5 max precision 0.998890 1.000000 0
6 max recall 0.000887 1.000000 46
7 max specificity 0.998890 1.000000 0
8 max absolute_mcc 0.998890 0.967338 0
9 max min_per_class_accuracy 0.032236 0.963592 20
10 max mean_per_class_accuracy 0.407805 0.976276 7

What do you say about it?

2 Likes

It’s just a golden feature ) and it is not enough for a best prediction.

How not enough for a get prediction if it gets 99 AUC?

I just checked. Seems like 99 is not equal to 100

well if i add another feature i got 1

1? Really? Can you show?

#well not one but very close to it:

train_c_hhold <- fread("./data/C_hhold_train.csv", stringsAsFactors = TRUE)
test_c_hhold <- fread("./data/C_hhold_test.csv", stringsAsFactors = TRUE)
combi_c = bind_rows(train_c_hhold, test_c_hhold)

#split again
combi_c$poor <- as.numeric(combi_c$poor)
dtrain_c =combi_c[1:6469 ,]
dtest_c = combi_c[6470:9656 ,]

dtrain_c <- as.data.frame(dtrain_c)
dtrain_c[sapply(dtrain_c, is.character)] <- list(NULL)

dtrain_c$poor <- as.factor(dtrain_c$poor)

Create a stratified random sample to create train and validation sets

trainIndex <- createDataPartition(dtrain_c$poor , p=0.85, list=FALSE, times=1)
dtrain_c.train <- dtrain_c[ trainIndex, ]
dtrain_c.test <- dtrain_c[-trainIndex, ]

Identify predictors and response

y <- "poor"
x <- c(“DBjxSUvf”,“kiAJBGqv”,“GIwNbAsH”)

h2o.init(nthreads = -1, max_mem_size = “16g”)
df <- dtrain_c.train

test <- dtrain_c.test
test <- as.h2o(dtrain_c.test)
df <- as.h2o(df)
################################################
splits <- h2o.splitFrame(
data = df,
ratios = c(0.8),
destination_frames = c(“train.hex”, “valid.hex”), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]

xgb <- h2o.xgboost(training_frame = train,
validation_frame = valid,
x=x,
y=y)

perf_xgb_test <- h2o.performance(xgb, newdata = test)

#results:

H2OBinomialMetrics: xgboost
** Reported on training data. **
** Metrics reported on training frame **

MSE: 0.001419271
RMSE: 0.03767322
LogLoss: 0.006654008
Mean Per-Class Error: 0.0001331203
AUC: 0.9999996
Gini: 0.9999992

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 3755 1 0.000266 =1/3756
1 0 658 0.000000 =0/658
Totals 3755 659 0.000227 =1/4414

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.385806 0.999241 60

H2OBinomialMetrics: xgboost
** Reported on validation data. **
** Metrics reported on validation frame **

MSE: 0.009072222
RMSE: 0.09524821
LogLoss: 0.05251169
Mean Per-Class Error: 0.02701644
AUC: 0.9842731
Gini: 0.9685461

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 915 1 0.001092 =1/916
1 9 161 0.052941 =9/170
Totals 924 162 0.009208 =10/1086

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.554915 0.969880 28

AUC: 0.9842731 - it’s a little bit far from 1 :wink: Moreover, you’ve forgotten about class weights.