There are duplicate rows in the test data

ID’s and in fact entire rows are duplicated in the test data, e.g. 64 and even triplets, e.g. 466?

Why this?

Hi @BKR, I’ll look into your question – in the meantime, the details of this (public) data set are here.

Hi @BKR, I just checked in on this issue and the duplicate IDs are an artifact of the preprocessing–just predict those rows as you normally would.

Quick note on Blood Donations–the reason this competition is a ‘Warm Up’ is that it’s a small, approachable, and open data set. It let’s beginners try out a basic method or two and learn how to use the site (joining a competition, uploading a submission, that kind of stuff). There’s a limit to the predictive power that can be squeezed out of it, so to some extent the relative difference of the best scores is due to chance rather than better predictors.

For people looking for a more challenging competition, the United Nations Millennium Development Goals should be an engaging problem to work on. And of course, we are working hard on the next batch of for-prize competitions!

Thanks @bull, I did as much with the duplicates, was just curious if you know they exist. With regards to your note, I assume as much. I use this competition to train the people in my team and upskill myself in “new” methods in R. I have done a few Kaggle competition before, mainly with RapidMiner.

On another note, do you provide an option where we can host our own private internal “play/practise” competitions? I would love to use this as a training/awareness platform for my team.

Cheers
BKR

Hi @BKR,

Unfortunately, a mechanism for people to host/create their own private competitions is not on our todo list for the near future. We’re focused on building competitions for organizations that will benefit from machine learning. So, you can expect more competitions in the very near future, but they’ll all be public.

One option is to have your teammates sign up for a DrivenData competition. If you create a team, you’ll be able to see everyone’s submissions on the “Submissions” page. You could use this view as a sort of private leaderboard just for your team!

If that doesn’t work, you could look at Kaggle InClass. If you have any academic affiliation, you can use their platform for private competitions.

Peter

1 Like

Hi, there are duplicate rows in train data as well with different results (last column value).
I am not considering ID column for training. Since we have different results for same parameter values, the learning is not improving. I posted a separate topic on this.