Issue with data

Hi, I see some problem with data like observation from 3rd column printed in next rows.
Train data: Row number 3066, 4495, 12441, 12446, 13072, 16589,…
Test data: Row number 252, 1704, 1722, 2736, 3299, 3392,…

This may create problem while handling, manipulating, modeling data.

1 Like

I noticed the same thing too.

I also noticed the row below each “problematic row” is one column short.

So what I did I copied the data from the “below row” from Column B to the Column AN to the “above row” Column C to Column AO.

This also aligns the “Submission Format” sequence_id Column to the “Test_Values” sequence_id Column.

I think that’s the trick! If not, let me know…

1 Like

Thanks for flagging @Sachin and @dylan89! Can you tell us where you’re seeing this issue? Is this in pandas, a csv reader of some kind, or something else?

Sure thing! I’m using Excel to read your CSV, and here’s an example to what the errors look (from the train_values.csv)…

CSV Error

Hope that helps…

I did the same thing, but I’m not sure about the sequence. Should it be from the same row or the row below that?

Yes, I found this while going through the csv file. There are at least 10 such cases I found in train data. Same with test data. When we try to summarize or model data, this will create problem as in these cases above seqruence will be used as sequence_id and missing data etc.

Hi guys.
This could possibly be Excel-related bug. I use pandas and don’t see any of this. E.g.:
изображение
When I check unique values of binary features they are always (0, 1), too.
Also while talking about specific row I’d suggest not to use row numbers but sequence_id instead since it’s unique.

1 Like

This is probably an excel bug. If I had to guess, I would say that excel is hitting the maximum number of characters in a cell, and for some reason is truncating the line or something.

Using pandas, I did not have this issue. However, if you opened the file in excel and then saved it you likely corrupted the file.

If you’re going to be using excel, I would recommend saving that column into a text file specifically.

2 Likes

@Anthrop and @mike.icaza are right - it appears to be an Excel issue. The maximum character limit per cell in excel is 32,767 characters. There are definitely sequences longer than that in the dataset.

As others have mentioned, this limit does not exist in Pandas. It also does not exist in LibreOffice Calc, but it seems like you may have already found other ways around it. I hope that helps @dylan89 and @Sachin ! Thanks again for flagging.

1 Like