Sampling issue in benchmark

While looking into sampling horizon of solar_wind and target value t0 and t1, I found t0 is an hour past record and t1 is a current one respect to the last aggregated time (strinctly speaking, t0 is 59min ago, and r1 is 1min future), which are assumed to be current and one hour future. My suggestion is to rewrite process_labels as

def process_labels(dst):
y = dst.copy()
y[“dst”] = y.groupby(“period”).dst.shift(-1) # <---- inserted line
y[“t1”] = y.groupby(“period”).dst.shift(-1)
y.columns = YCOLS
return y

But one minute is still lagging between aggregated solar wind data and targets. Any comments on this are welcome.

In task description:
Thus, your task is to build a model that can predict Dst in real-time for both the current hour and the next hour. For example, if the current timestep is 10:00 am, you are must predict Dst for both 10:00 am and 11:00 am using data up until but not including 10:00 am.

So you get data up to 9:59 am (Aggregate transform to … 9:00). A shift must be -1 and -2.

Hi @hklee, thanks for bringing this up.

I believe @leigh.plt is correct. In that example, data from 9am - 9:59am (aggregated to 9am) should be used to predict 10am and 11am.

Given this, I think the correct way to process the labels is this:

def process_labels(dst):
    y = dst.copy()
    y["t0"] = y.groupby("period").dst.shift(-1)
    y["t1"] = y.groupby("period").dst.shift(-2)
    return y[YCOLS]

@hklee and @leigh.plt Let me know if that makes sense and you agree. If so I’ll update the blog today.

Thanks!

For my solution, I use -1 and -2. I don’t use process_labels function from blog, so i missed this from the beginning

1 Like

is this bug also present in the way you make t0 and t1 during prediction on test data?

1 Like

No. When you’re making predictions, you’re guaranteed to get feature data up until but not including t0 for timesteps t0 and t1 - you do not have to do any re-alignment. This only affects alignment for the training code. That said, if you trained your model with this bug in place, your model will be trying to predict t-1 and t0, not t0 and t1.

1 Like

@cszc , @leigh.pi, Yes. both codes are the same. But I think about shifting solar_wind by -59min instead and t1 by -1 hour to be punctual.

@hklee I’m not sure I completely follow your scheme, but in general, yes, I think it’s fine to keep t0 stationary and instead to shift t1 and your solar_wind features. It all depends on how you process your features. The main thing that you have to ensure is that you are only using data up until t0, not during or after.