Back to DrivenData | Blog

Just to make sure the data is correct

Hi!
Don’t want to be annoying or anything, just making sure the data is correct.
I spotted some places in which a specific hour has 10X the consumption of all other hours of that specific day + series.
e.g:
RowID 108926
All hours except 2PM of 18/8/2016 series 103634 have mean comsumption of 45k
2PM has a VERY high consumption of 440K
which is ~10X more than all other hours.
This behaviour does not happen in this series at all in other days

sometimes competitions have some kind of data issue that is found too late and a lot of competitor’s hours are gone.
Could you please just verify and let us know it’s OK and not some kind of glitch in the matrix?

Thanks!

1 Like

For everybody, let’s plot that:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.read_csv('consumption_train.csv', parse_dates=['timestamp'])
sns.relplot(x="timestamp", y="consumption", data=train[train.series_id == 103634], kind='line', aspect=2.5)
plt.legend(['id = 103_634'])

plt.show()

You talk about series_id = 103634 in the train file. Are we seeing something weird?

I though the same thing, but the scatter plot below should clear this up.

It’s obvious from the scatter plot that there are daily variations. Low amounts of power are consumed from 19 PM to 7 AM. But, power consumption jumps to huge values (over 1 million wH) in the working hours, that is from 8 AM to 18 PM.

So, I would say that this sudden jump in consumption is not a glitch, as values as high as 440K wH seem to be fairly common in this distribution.