Back to DrivenData | Blog

Need help, beginner

I’m a beginner and would appreciate some advice since this is a learning set and there’s no prize involved. The best result I’ve been able to get was 0.8 with logistic regression and GridSearch and a 0.9 with neural networks. What should I be doing to get to at least 0.5?

Are you using a label binerizer or onehotencoding on the thal field?

1 Like

Hi, the next step may be to normalize the data. You can do that subtracting the mean and dividing by the standard deviation (for each variables). Then try to repeat the logistic regression

2 Likes

Yeah I’ve used the Pandas dummies functions. Here’s my code for the Keras neural network:

numerical_list = [‘resting_blood_pressure’, ‘serum_cholesterol_mg_per_dl’,
‘oldpeak_eq_st_depression’, ‘age’, ‘max_heart_rate_achieved’,‘slope_of_peak_exercise_st_segment’,
‘num_major_vessels’,‘resting_ekg_results’]

categorical_list = [‘chest_pain_type’,‘sex’,‘exercise_induced_angina’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,‘thal’]

dummies_list = [‘chest_pain_type’,‘sex’,‘exercise_induced_angina’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,‘thal_normal’,
‘thal_reversible_defect’, ‘thal_fixed_defect’]

values_file = pd.read_csv(‘train_values.csv’,header=0)
labels_file = pd.read_csv(‘train_labels.csv’,header=0)
no_pt_labels = labels_file.drop(‘patient_id’,axis=1)

categorical_values = values_file[categorical_list]
categorical_values = pd.get_dummies(categorical_values)

numerical_values = values_file[numerical_list]
train_std = numerical_values.std()
train_mean = numerical_values.mean()
cut_off = train_std * 3

lower, upper = train_mean - cut_off, train_mean + cut_off

trimmed_numerical_values = numerical_values[(numerical_values < upper) & (numerical_values > lower)]
numerical_df = pd.concat([trimmed_numerical_values,no_pt_labels,categorical_values],axis=1)
numerical_df = numerical_df.dropna()
numerical_df[dummies_list] = numerical_df[dummies_list].astype(‘category’)

numerical_labels = numerical_df[‘heart_disease_present’]
numerical_data = numerical_df.drop(‘heart_disease_present’,axis=1)

train_x, test_x, train_y, test_y = train_test_split(numerical_data,numerical_labels,test_size=0.15,random_state=7)

minmax = MinMaxScaler()
fitted = minmax.fit_transform(train_x)

fit_test_x = minmax.transform(test_x)

predictors = fitted
target = np.array(train_y)
test_pred = fit_test_x
test_target = np.array(test_y)

list_of_scores = []
val_scores = []
acc_scores = []
early_stop = EarlyStopping(patience=3)
epochs = range(1,100,5)
model = Sequential()
n_col = train_x.shape[1]
for number in epochs:
model.add(Dense(15, activation=‘relu’, input_shape=(n_col,)))
model.add(Dense(256, activation=‘relu’))

model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
test_target = np.array(test_y)
model.fit(predictors, target, epochs=number, verbose=False, validation_split=0.2,batch_size=18)
loss = model.evaluate(test_pred,test_target)[0]
print(model.evaluate(test_pred,test_target))
list_of_scores.append(loss)
meow = model.history
val_scores.append(meow.history['val_acc'][-1])
acc_scores.append(meow.history['acc'][-1])

history = model.history

plt.plot(list_of_scores)
plt.title(‘model accuracy’)
plt.ylabel(‘accuracy’)
plt.xlabel(‘epoch’)
plt.show()

print(model.evaluate(test_pred,test_target))

And here is the code I used for logistic regression (I did other algorithms as well but this produced my best result). It has the same initial set up as the neural network:

lr_step = [(‘scaler’, StandardScaler()),
(‘clf’, LogisticRegression())]
logreg = Pipeline(lr_step)
logreg.fit(x_train,y_train)
logreg_pred = logreg.predict(x_test)

lr_metrics = {‘clf__penalty’: [‘l1’,‘l2’],
‘clf__class_weight’:[‘balanced’,None],
‘clf__C’:np.logspace(-5, 8, 15)}
lr_grid = GridSearchCV(logreg,lr_metrics,cv=5,scoring=‘accuracy’)
lr_grid.fit(x_train,y_train)
lr_grid_pred = lr_grid.predict(x_test)

I think my biggest issue is that I can find plenty of resources explaining how to build a simple model, however I haven’t found anything that explains how an accomplished data scientist will think through a problem. I would like to be able to come up with a solution that at least gets to an overall log-loss of 0.5.

I think that the resources you are using that explain how to build a simple model are still over complicating it because when I am looking over your code it takes me a while to figure out whats going on.

Try doing a logistic regression with sklearn:
(I can’t send you the entire code without giving you the entire answer, but this is a great way to start)

y = the train labels
x_train = all of the train values, dont worry about the non numerical values, probably just cut them out. This needs to be a numpy array
x_test = the test_values and has to be in same format as x_train

from sklearn.linear_model import LogisticRegression

x_train = train_df[[‘slope_of_peak_exercise_st_segment’,
‘resting_blood_pressure’,
‘num_major_vessels’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,
‘serum_cholesterol_mg_per_dl’,
‘oldpeak_eq_st_depression’,
‘sex’,
‘age’,
‘max_heart_rate_achieved’,
‘exercise_induced_angina’,
‘resting_ekg_results’]].values.astype(float)

x_test = test_df[[‘slope_of_peak_exercise_st_segment’,
‘resting_blood_pressure’,
‘num_major_vessels’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,
‘serum_cholesterol_mg_per_dl’,
‘oldpeak_eq_st_depression’,
‘sex’,
‘age’,
‘max_heart_rate_achieved’,
‘exercise_induced_angina’,
‘resting_ekg_results’]].values.astype(float)

model = LogisticRegression()
model.fit(x_train, y)
prediction = model.predict_proba(x_test)[:,1]
test_models[‘heart_disease_present’] = prediction
submission = test_models[[‘patient_id’, ‘heart_disease_present’]]
submission.to_csv(‘submission.csv’, index=False)

The most basic form of this gave me .34. Also don’t even include “Thal”, i think if you just throw all those numbers into the logistic regression you will get below .5

Ok yeah I did what you had and got 0.4, which is by far my lowest. I wonder if I’m messing up the data with too much prep or something. I appreciate the help!

I actually have another question. The prediction = model.predict_proba(x_test)[:,1] line, does it have the model predict only off of the resting blood pressure? When I did x_test[:] the log-loss was 1.7. I guess I’m just curious why you only picked the resting bp?

Oh the reason that you have [:,1] is because sklearn when you predict the probability with the model it returns two different values, and the value that we want is the second one. Try removing that line and then printing the prediction and see what pops up as an example.

Hi,

I am new to data science and learning through this project.
I am not able to understand what is .34.
Is that the score you achieved after submission or the accuracy of the model.

Please help me to understand it.

Thanks,
Kapil Sharma