Hi, I see some problem with data like observation from 3rd column printed in next rows.
Train data: Row number 3066, 4495, 12441, 12446, 13072, 16589,…
Test data: Row number 252, 1704, 1722, 2736, 3299, 3392,…
This may create problem while handling, manipulating, modeling data.
Thanks for flagging @Sachin and @dylan89! Can you tell us where you’re seeing this issue? Is this in pandas, a csv reader of some kind, or something else?
Yes, I found this while going through the csv file. There are at least 10 such cases I found in train data. Same with test data. When we try to summarize or model data, this will create problem as in these cases above seqruence will be used as sequence_id and missing data etc.
Hi guys.
This could possibly be Excel-related bug. I use pandas and don’t see any of this. E.g.:
When I check unique values of binary features they are always (0, 1), too.
Also while talking about specific row I’d suggest not to use row numbers but sequence_id instead since it’s unique.
This is probably an excel bug. If I had to guess, I would say that excel is hitting the maximum number of characters in a cell, and for some reason is truncating the line or something.
Using pandas, I did not have this issue. However, if you opened the file in excel and then saved it you likely corrupted the file.
If you’re going to be using excel, I would recommend saving that column into a text file specifically.
@Anthrop and @mike.icaza are right - it appears to be an Excel issue. The maximum character limit per cell in excel is 32,767 characters. There are definitely sequences longer than that in the dataset.
As others have mentioned, this limit does not exist in Pandas. It also does not exist in LibreOffice Calc, but it seems like you may have already found other ways around it. I hope that helps @dylan89 and @Sachin ! Thanks again for flagging.