We have a question regarding what date the naturalized monthly streamflow data is generated/published. To be specific, if I take the naturalized monthly streamflow data for a given basin for the month of March, on what day specifically did that data come out, and what days does it cover (is it calendar months, e.g. March 1 - March 31, or 30 days back from when the streamflow data was published)? Thank you!
To be specific, if I take the naturalized monthly streamflow data for a given basin for the month of March, on what day specifically did that data come out, and what days does it cover (is it calendar months, e.g. March 1 - March 31, or 30 days back from when the streamflow data was published)?
The monthly naturalized streamflow is the total naturalized streamflow for that calendar month. So the March value is the sum over March 1–March 31, the April value is the sum over April 1–April 30, etc.
what date the naturalized monthly streamflow data is generated/published
Unfortunately, we do not have any record of when the data is published or modified. For the Hindcast Stage, we are making the simplifying assumption that the data is available once the month is over, e.g., the March data becomes available on April 1. In the Forecast Stage, the data will become available whenever it is published by the data providers. We recommend that you set up your model in a way that will work flexibly using whatever latest data is available.
Hi again @jayqi,
Am I right in assuming that throughout the forecast period, “test_monthly_naturalized_flow.csv” will – in the same exact manner and format as during the hindcast stage – always have antecedent flow values for the 23 sites for which it’s available?
That it is to say, when our code is executed on Feb 1st 2024 for a prediction issue_date of Feb 1st 2024,
test_monthly_naturalized_flow.csv from the
data_dir directory (i.e.
/code_execution/data/test_monthly_naturalized_flow.csv) will contain antecedent flow values from Oct 2023 through Jan 2024 for the same 23 sites as the hindcast stage (within the limitation of that data not being released yet)?
I’m just double checking to be sure I don’t need to modify that for the Forecast stage, and can leave it as-is from the Hindcast stage.
Yes, that is generally correct. For each issue date, the mounted data drive will contain a
test_monthly_naturalized_flow.csv file in the same format with whatever data is available as of that date. Note that we don’t have control over when the data is available from NRCS, so some or all sites might not have January 2024 available immediately on February 1, 2024.
Hi @jayqi, does it mean that if the data is not available for some sites, in a particular month, it will not have exactly 23 records? Or there will be 23 records but some values will be missing?
Currently, for sites where some months have observations but others do not, the months that are missing values will be empty (
NA if you’re reading with pandas). Sites that have no observations at all will not have any rows included.
could you please clarify if the
test_monthly_naturalized_flow.csv will only contain data for 2024 forecast year. Will it also include historical observations like in the hindcast stage?
That is correct. It only contains data expected to be use at inference time for the current water year.
For the Hindcast Stage, it only contained data for the water years in the test set.
If you need data from other years, e.g., to have a longer lookback window when deriving features, you should explicitly make a request with details (e.g., how far back). See the blue info box under “Time and data use” for the Forecast Stage.
Thanks for the clarification @jayqi .Ive made a request in this thread
Btw, does this also refer to files in the
teleconnections folder - will they also contain observations relevant to the current water year only?
No, the teleconnections files contain all historical data available. We don’t do anything to subset those datasets.