Hello. This is my first time entering a competition of this type, so please forgive my questions if they seem “obvious” to those of you who have done this before. I am trying to understand exactly how this competition works. Is this correct? I take the training data and create an equation using the explanatory variables given (or some combination of such) for the response variable that will give me results that, when entered into the log loss formula, will produce the best results for the training data. I apply the equation to the test data and submit my results in order that they can be tested against the actual results. Do I just send a list of the id numbers of the people who I predict will donate again? Also, the submission example assigns the prediction value of .5 to every id. Is that what I need to do or do I calculate a prediction value for each of the id numbers I submit. Again, this is my first competition and I have never used the log loss formula before. I am planning on using SAS for my analysis if that makes any difference to your responses. Thank you for your patience and responses.

Hi @dkderden,

Yep, you’ve got the gist of the process right!

You want to submit a list of predictions in a format that exactly matches the Submission format. It should be a CSV file that:

- Has the exact same column headers
- Has the same IDs in the same order
- Has predicted values between 0 and 1

Basically, you want to change the prediction of `0.5`

in the submission format file to whatever value between 0 (very unlikely to donate blood) to 1 (very likely to donate blood) your algorithm predicts for each individual.

Hope that helps!

I have always had issues with the log function so I can’t imagine what possessed me to do this competition first. I am still a little confused. I have several models with the individual variables and combinations of them also. When I use my model to predict whether the person will donate or not I end up getting mostly negative numbers. For example, I used Months Since Last Donation as the only predictor and got the following model: y-hat = (-0.2773) - (0.113*MLD). All of my y-hats were negative and one was even (-4.7973). I get the distinct feeling that there is something missing in my understanding of this process since I do know that probability need to range from 0 to 1!

Hey @dkderden,

If you’re still looking at this, it looks like you fit a linear regression model, which will predict a real value (hence the negative predictions). To predict a probability between 0 and 1, you’ll want to fit something like a logistic regression model. Because of the functional form of this model, all the predictions will be bound between 0 and 1.

Hope that helps!

@bull Should i round off the predicted values between 0 -1 to either 0 or 1 based on my logic? ot I should just just the predicted values as it is ?. if value is 0.44 then will it be considered as 0 or 1 ?