Need help, beginner

broke_student · July 8, 2019, 4:11pm

I’m a beginner and would appreciate some advice since this is a learning set and there’s no prize involved. The best result I’ve been able to get was 0.8 with logistic regression and GridSearch and a 0.9 with neural networks. What should I be doing to get to at least 0.5?

rabiddeafguy · July 8, 2019, 8:11pm

Are you using a label binerizer or onehotencoding on the thal field?

apalladi · July 9, 2019, 5:59pm

Hi, the next step may be to normalize the data. You can do that subtracting the mean and dividing by the standard deviation (for each variables). Then try to repeat the logistic regression

broke_student · July 12, 2019, 12:54pm

Yeah I’ve used the Pandas dummies functions. Here’s my code for the Keras neural network:

numerical_list = [‘resting_blood_pressure’, ‘serum_cholesterol_mg_per_dl’,
‘oldpeak_eq_st_depression’, ‘age’, ‘max_heart_rate_achieved’,‘slope_of_peak_exercise_st_segment’,
‘num_major_vessels’,‘resting_ekg_results’]

categorical_list = [‘chest_pain_type’,‘sex’,‘exercise_induced_angina’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,‘thal’]

dummies_list = [‘chest_pain_type’,‘sex’,‘exercise_induced_angina’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,‘thal_normal’,
‘thal_reversible_defect’, ‘thal_fixed_defect’]

values_file = pd.read_csv(‘train_values.csv’,header=0)
labels_file = pd.read_csv(‘train_labels.csv’,header=0)
no_pt_labels = labels_file.drop(‘patient_id’,axis=1)

categorical_values = values_file[categorical_list]
categorical_values = pd.get_dummies(categorical_values)

numerical_values = values_file[numerical_list]
train_std = numerical_values.std()
train_mean = numerical_values.mean()
cut_off = train_std * 3

lower, upper = train_mean - cut_off, train_mean + cut_off

trimmed_numerical_values = numerical_values[(numerical_values < upper) & (numerical_values > lower)]
numerical_df = pd.concat([trimmed_numerical_values,no_pt_labels,categorical_values],axis=1)
numerical_df = numerical_df.dropna()
numerical_df[dummies_list] = numerical_df[dummies_list].astype(‘category’)

numerical_labels = numerical_df[‘heart_disease_present’]
numerical_data = numerical_df.drop(‘heart_disease_present’,axis=1)

train_x, test_x, train_y, test_y = train_test_split(numerical_data,numerical_labels,test_size=0.15,random_state=7)

minmax = MinMaxScaler()
fitted = minmax.fit_transform(train_x)

fit_test_x = minmax.transform(test_x)

predictors = fitted
target = np.array(train_y)
test_pred = fit_test_x
test_target = np.array(test_y)

list_of_scores =
val_scores =
acc_scores =
early_stop = EarlyStopping(patience=3)
epochs = range(1,100,5)
model = Sequential()
n_col = train_x.shape[1]
for number in epochs:
model.add(Dense(15, activation=‘relu’, input_shape=(n_col,)))
model.add(Dense(256, activation=‘relu’))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
test_target = np.array(test_y)
model.fit(predictors, target, epochs=number, verbose=False, validation_split=0.2,batch_size=18)
loss = model.evaluate(test_pred,test_target)[0]
print(model.evaluate(test_pred,test_target))
list_of_scores.append(loss)
meow = model.history
val_scores.append(meow.history['val_acc'][-1])
acc_scores.append(meow.history['acc'][-1])
history = model.history

plt.plot(list_of_scores)
plt.title(‘model accuracy’)
plt.ylabel(‘accuracy’)
plt.xlabel(‘epoch’)
plt.show()

print(model.evaluate(test_pred,test_target))

And here is the code I used for logistic regression (I did other algorithms as well but this produced my best result). It has the same initial set up as the neural network:

lr_step = [(‘scaler’, StandardScaler()),
(‘clf’, LogisticRegression())]
logreg = Pipeline(lr_step)
logreg.fit(x_train,y_train)
logreg_pred = logreg.predict(x_test)

lr_metrics = {‘clf__penalty’: [‘l1’,‘l2’],
‘clf__class_weight’:[‘balanced’,None],
‘clf__C’:np.logspace(-5, 8, 15)}
lr_grid = GridSearchCV(logreg,lr_metrics,cv=5,scoring=‘accuracy’)
lr_grid.fit(x_train,y_train)
lr_grid_pred = lr_grid.predict(x_test)

I think my biggest issue is that I can find plenty of resources explaining how to build a simple model, however I haven’t found anything that explains how an accomplished data scientist will think through a problem. I would like to be able to come up with a solution that at least gets to an overall log-loss of 0.5.

rabiddeafguy · July 12, 2019, 1:24pm

I think that the resources you are using that explain how to build a simple model are still over complicating it because when I am looking over your code it takes me a while to figure out whats going on.

Try doing a logistic regression with sklearn:
(I can’t send you the entire code without giving you the entire answer, but this is a great way to start)

y = the train labels
x_train = all of the train values, dont worry about the non numerical values, probably just cut them out. This needs to be a numpy array
x_test = the test_values and has to be in same format as x_train

from sklearn.linear_model import LogisticRegression

x_train = train_df[[‘slope_of_peak_exercise_st_segment’,
‘resting_blood_pressure’,
‘num_major_vessels’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,
‘serum_cholesterol_mg_per_dl’,
‘oldpeak_eq_st_depression’,
‘sex’,
‘age’,
‘max_heart_rate_achieved’,
‘exercise_induced_angina’,
‘resting_ekg_results’]].values.astype(float)

x_test = test_df[[‘slope_of_peak_exercise_st_segment’,
‘resting_blood_pressure’,
‘num_major_vessels’,
‘fasting_blood_sugar_gt_120_mg_per_dl’,
‘serum_cholesterol_mg_per_dl’,
‘oldpeak_eq_st_depression’,
‘sex’,
‘age’,
‘max_heart_rate_achieved’,
‘exercise_induced_angina’,
‘resting_ekg_results’]].values.astype(float)

model = LogisticRegression()
model.fit(x_train, y)
prediction = model.predict_proba(x_test)[:,1]
test_models[‘heart_disease_present’] = prediction
submission = test_models[[‘patient_id’, ‘heart_disease_present’]]
submission.to_csv(‘submission.csv’, index=False)

The most basic form of this gave me .34. Also don’t even include “Thal”, i think if you just throw all those numbers into the logistic regression you will get below .5

broke_student · July 13, 2019, 12:22am

Ok yeah I did what you had and got 0.4, which is by far my lowest. I wonder if I’m messing up the data with too much prep or something. I appreciate the help!

broke_student · July 13, 2019, 12:27am

I actually have another question. The prediction = model.predict_proba(x_test)[:,1] line, does it have the model predict only off of the resting blood pressure? When I did x_test[:] the log-loss was 1.7. I guess I’m just curious why you only picked the resting bp?

rabiddeafguy · July 15, 2019, 3:06am

Oh the reason that you have [:,1] is because sklearn when you predict the probability with the model it returns two different values, and the value that we want is the second one. Try removing that line and then printing the prediction and see what pops up as an example.

Kapil_Sharma · August 14, 2019, 10:06am

Hi,

I am new to data science and learning through this project.
I am not able to understand what is .34.
Is that the score you achieved after submission or the accuracy of the model.

Please help me to understand it.

Thanks,
Kapil Sharma

Topic		Replies	Views
The heart disease present Warm Up: Machine Learning with a Heart	1	651	May 10, 2019
Hints and Tricks: Which classifier have you used? Warm Up: Machine Learning with a Heart	0	1081	January 23, 2019
What's your strategy? Warm Up: Predict Blood Donations	26	10835	August 23, 2020
First competition question Warm Up: Predict Blood Donations	4	2121	September 12, 2018
Training data not enough Warm Up: Predict Blood Donations	0	815	March 16, 2018

Need help, beginner

Related topics