Issue with data

Sachin · August 19, 2020, 11:25am

Hi, I see some problem with data like observation from 3rd column printed in next rows.
Train data: Row number 3066, 4495, 12441, 12446, 13072, 16589,…
Test data: Row number 252, 1704, 1722, 2736, 3299, 3392,…

This may create problem while handling, manipulating, modeling data.

dylan89 · August 20, 2020, 7:45pm

I noticed the same thing too.

I also noticed the row below each “problematic row” is one column short.

So what I did I copied the data from the “below row” from Column B to the Column AN to the “above row” Column C to Column AO.

This also aligns the “Submission Format” sequence_id Column to the “Test_Values” sequence_id Column.

I think that’s the trick! If not, let me know…

cszc · August 20, 2020, 9:32pm

Thanks for flagging @Sachin and @dylan89! Can you tell us where you’re seeing this issue? Is this in pandas, a csv reader of some kind, or something else?

dylan89 · August 20, 2020, 10:07pm

Sure thing! I’m using Excel to read your CSV, and here’s an example to what the errors look (from the train_values.csv)…

CSV Error

Hope that helps…

Sachin · August 21, 2020, 6:52am

I did the same thing, but I’m not sure about the sequence. Should it be from the same row or the row below that?

Sachin · August 21, 2020, 6:56am

Yes, I found this while going through the csv file. There are at least 10 such cases I found in train data. Same with test data. When we try to summarize or model data, this will create problem as in these cases above seqruence will be used as sequence_id and missing data etc.

Anthrop · August 21, 2020, 7:46pm

Hi guys.
This could possibly be Excel-related bug. I use pandas and don’t see any of this. E.g.:

When I check unique values of binary features they are always (0, 1), too.
Also while talking about specific row I’d suggest not to use row numbers but sequence_id instead since it’s unique.

mike.icaza · August 21, 2020, 9:23pm

This is probably an excel bug. If I had to guess, I would say that excel is hitting the maximum number of characters in a cell, and for some reason is truncating the line or something.

Using pandas, I did not have this issue. However, if you opened the file in excel and then saved it you likely corrupted the file.

If you’re going to be using excel, I would recommend saving that column into a text file specifically.

cszc · August 25, 2020, 6:28pm

@Anthrop and @mike.icaza are right - it appears to be an Excel issue. The maximum character limit per cell in excel is 32,767 characters. There are definitely sequences longer than that in the dataset.

As others have mentioned, this limit does not exist in Pandas. It also does not exist in LibreOffice Calc, but it seems like you may have already found other ways around it. I hope that helps @dylan89 and @Sachin ! Thanks again for flagging.

Topic		Replies	Views
Submission Error with ID's Genetic Engineering Attribution	5	617	August 26, 2020
Having trouble submitting? Pump it Up: Data Mining the Water Table	13	5222	October 20, 2016
Error: IDs for submission are not correct Hateful Memes	7	1036	July 4, 2020
Feature Engineering (Genetic Eng. Attribution) Genetic Engineering Attribution	2	627	October 1, 2020
Submission File Format: incorrect number of rows Warm Up: Predict Blood Donations	3	878	December 6, 2017

Issue with data

Related topics