There are duplicate rows in the test data

BKR · January 21, 2015, 4:29am

ID’s and in fact entire rows are duplicated in the test data, e.g. 64 and even triplets, e.g. 466?

Why this?

isms · January 21, 2015, 3:40pm

Hi @BKR, I’ll look into your question – in the meantime, the details of this (public) data set are here.

bull · January 23, 2015, 12:09am

Hi @BKR, I just checked in on this issue and the duplicate IDs are an artifact of the preprocessing–just predict those rows as you normally would.

Quick note on Blood Donations–the reason this competition is a ‘Warm Up’ is that it’s a small, approachable, and open data set. It let’s beginners try out a basic method or two and learn how to use the site (joining a competition, uploading a submission, that kind of stuff). There’s a limit to the predictive power that can be squeezed out of it, so to some extent the relative difference of the best scores is due to chance rather than better predictors.

For people looking for a more challenging competition, the United Nations Millennium Development Goals should be an engaging problem to work on. And of course, we are working hard on the next batch of for-prize competitions!

BKR · January 23, 2015, 3:11am

Thanks @bull, I did as much with the duplicates, was just curious if you know they exist. With regards to your note, I assume as much. I use this competition to train the people in my team and upskill myself in “new” methods in R. I have done a few Kaggle competition before, mainly with RapidMiner.

On another note, do you provide an option where we can host our own private internal “play/practise” competitions? I would love to use this as a training/awareness platform for my team.

Cheers
BKR

bull · January 24, 2015, 10:08pm

Hi @BKR,

Unfortunately, a mechanism for people to host/create their own private competitions is not on our todo list for the near future. We’re focused on building competitions for organizations that will benefit from machine learning. So, you can expect more competitions in the very near future, but they’ll all be public.

One option is to have your teammates sign up for a DrivenData competition. If you create a team, you’ll be able to see everyone’s submissions on the “Submissions” page. You could use this view as a sort of private leaderboard just for your team!

If that doesn’t work, you could look at Kaggle InClass. If you have any academic affiliation, you can use their platform for private competitions.

Peter

kbsvm · March 16, 2018, 1:04pm

Hi, there are duplicate rows in train data as well with different results (last column value).
I am not considering ID column for training. Since we have different results for same parameter values, the learning is not improving. I posted a separate topic on this.

Topic		Replies	Views
First competition question Warm Up: Predict Blood Donations	4	2108	September 12, 2018
Training data not enough Warm Up: Predict Blood Donations	0	814	March 16, 2018
CSV Headers do not match Warm Up: Predict Blood Donations	2	3094	July 3, 2016
IDs for submission are not correct - Having trouble with submission Warm Up: Predict Blood Donations	2	1723	September 1, 2016
Submission File Format: incorrect number of rows Warm Up: Predict Blood Donations	3	878	December 6, 2017

There are duplicate rows in the test data

Related topics