Forecast Data Question

jimking100 · December 22, 2023, 10:58pm

Hi,
You state that the feature data during the Forecast will be downloaded the same way as the Hindcast (locally available), I was wondering if the data is not available when you download the data what should we expect to see?

[Edit 2] - Also, do you plan on providing an updated supplementary_nrcs_train_monthly_naturized_flow.csv file that includes the test year data? It’s not currently in the forecast download.

jayqi · December 31, 2023, 5:57pm

Hi @jimking100,

For each issue date, we will run the data download bulk command to download whatever data is available from each data source. Depending on how the particular data source is written out, if some data for some date isn’t available, there will either be missing rows in a CSV file or missing files in a directory.

Re: updated supplementary_nrcs_train_monthly_naturized_flow.csv—we will work on making this available the week after New Year’s.

motoki · January 1, 2024, 9:31pm

When running predictions for 2024, I got an error “ValueError: No objects to concatenate”.

The error comes from this line

github.com

drivendataorg/water-supply-forecast-rodeo-runtime/blob/33ee7e954d23148e217111a7866b0bccba7a1188/data_reading/wsfr_read/climate/cpc_outlooks.py#L191


      
                  columns = PRECIP_COLUMNS
                  if year in {2004, 2006}:
                      widths = PRECIP_ALT_WIDTHS
                  else:
                      widths = PRECIP_WIDTHS
              dfs = {}
              for issue_date, buffer in table_gen:
                  dfs[pd.to_datetime(issue_date)] = pd.read_fwf(
                      buffer, header=None, names=columns, widths=widths
                  ).set_index(["YEAR", "MN", "LEAD", "CD"])
              return pd.concat(dfs, names=["issue_date"])
          
          
          def read_cpc_outlooks_temp(
              issue_date: str, site_id: str | None = None, fy_start_month: int = 10
          ) -> pd.DataFrame:
              """Read CPC Seasonal Temperature Outlooks available as of a given issue_date. By default,
              this loads data for the water year of that issue_date (starting prior Oct 1) up to the day
              before that issue_date. See documentation from CPC for additional explanation on what columns
              in the loaded data represent.
              https://www.cpc.ncep.noaa.gov/pacdir/NFORdir/HUGEdir2/explanation_fdf.html

Could you kindly have a look if the cpc data is there ?

jimking100 · January 1, 2024, 11:28pm

Hi,
I received two errors when the automatic 1/1/24 forecast was run and I am hoping you can provide some more insight on them. One was for the monthly naturalized flow and the other was the soi data. In both cases, I assume the files would exist and at least have the past data in the files since these files and the past data are already in your possession, but it seems this is not the case? Can you shed some light on what actually exists on 1/1/24 in the data directory? Do you expect us to load the past data?

On a broader note, it would be very helpful if you could provide us with access to the 1/1/24 - 1/11/24 data directory during this test period so we can better understand what is or is not being provided. I suppose we could make repeated submissions to print this data in our logs, but submissions are limited and this does not seem like a very efficient method.

mmiron · January 2, 2024, 9:00am

Hi @jimking100,

I just thought I’d chime in and mention that the SOI file (/code_execution/data/teleconnections/soi.txt) appears to have been loaded and parsed without issue by my submission. That piece of my code is unchanged from the hindcast stage. Hope that helps you narrow down where to look, if nothing else.

jayqi · January 2, 2024, 6:23pm

Hi @motoki,

Thanks for flagging this. It looks like what happened is that there is actually no 2024 data available yet. The download pipeline saved a not-real data file, and wsfr_read.climate.cpc_outlooks did not handle this case (since it never came up in Hindcast). The data download drive and runtime image has been updated with a fix.

It is now expected that you should get these errors if you try to load the files directly:

FileNotFoundError: [Errno 2] No such file or directory: '/code_execution/data/cpc_outlooks/cpcllftd.2024.dat'

FileNotFoundError: [Errno 2] No such file or directory: '/code_execution/data/cpc_outlooks/cpcllfpd.2024.dat'

The wsfr-read package functions has been updated to skip those and log an expected warning that looks something like this:

2024-01-02 11:28:40.527 | WARNING | wsfr_read.climate.cpc_outlooks:read_cpc_outlooks_precip:292 - No CPC outlooks available for calender year 2024. Only data from calender year 2023 loaded.

I just reran your submission after these fixes were implemented, and it still failed (you should have received an email). It looks like you have your own copies of the data reading functions, rather than using the installed wsfr_read that is included in the runtime environment, so the errors that you got are expected.

jimking100 · January 3, 2024, 2:39am

Hi,
So i’m use the logs to try to answer my previous questions and I’ve resolved the soi.txt issue (an issue with my code). I also see the issues in monthly naturalized flow data, but have questions:

It appears the sweetwater data is missing entirely from the monthly naturalized flow - I would expect it to at least have some Oct or Nov data for 2023 or nan’s or zeros - can you explain?
There are zeros in many of the Dec entries, is zero an actual value or does that mean there is no data for that month? Are nan’s ever used to show no data?

jayqi · January 3, 2024, 3:41pm

Hi @jimking100,

sweetwater_r_nr_alcova indeed has no data available. Due to the way the data processing is set up, these rows show up as missing instead of NA—these rows are also missing from the raw CSV from NRCS.

I don’t see any zeros in the data, but there are a lot of missing values. Are you sure you’re not turning missing values into zero on your end?

Regarding your request for access to the mounted data, we will look into making this available.

motoki · January 3, 2024, 5:04pm

Thanks @jayqi . I managed to adapt my code and it works fine now.

jayqi · January 4, 2024, 10:39pm

Hi @jimking100,

Please see the latest announcement about access to mounted data and about the supplementary training data.

jayqi · January 8, 2024, 10:54pm

Hi everyone,

We’ve made an update to how test_monthly_naturalized_flow.csv is produced so that all 23 sites should show up with rows for every month since October 2023, even if there is no data. This should be reflected as of the 2024-01-08 issue date.

To confirm, for 2024-01-08, we still have no data available for three sites: pueblo_reservoir_inflow, sweetwater_r_nr_alcova, and ruedi_reservoir_inflow. You will see empty values for them for all three of the 2023-10, 2023-11, and 2023-12 rows.

motoki · January 23, 2024, 5:24am

@jayqi : I think my submission hit the same issue? Could you recheck if the cpc data is empty? I handle the missing file but not the empty file.

jayqi · January 23, 2024, 3:25pm

Hi @motoki,

The data looks fine to me. I believe the issue is with your code.

Here’s an example submission that I ran that loaded the data successfully using wsfr_read.climate.cpc_outlooks:

from loguru import logger
from wsfr_read.climate import cpc_outlooks


def predict(
    site_id,
    issue_date,
    assets,
    src_dir,
    data_dir,
    preprocessed_dir,
) -> tuple[float, float, float]:
    logger.info("site_id is {}", site_id)
    logger.info("issue_date is: {}", issue_date)

    df_precip = cpc_outlooks.read_cpc_outlooks_precip(issue_date, site_id)
    logger.info("df_precip.head():\n{}", df_precip.head())
    logger.info("df_precip.tail():\n{}", df_precip.tail())

    df_temp = cpc_outlooks.read_cpc_outlooks_temp(issue_date, site_id)
    logger.info("df_temp.head():\n{}", df_temp.head())
    logger.info("df_temp.tail():\n{}", df_temp.tail())
    raise Exception("Stop.")

Logs:

2024-01-23 15:16:28.315 | INFO     | src.solution:predict:13 - site_id is hungry_horse_reservoir_inflow
2024-01-23 15:16:28.315 | INFO     | src.solution:predict:14 - issue_date is: 2024-01-22
2024-01-23 15:16:29.584 | INFO     | src.solution:predict:17 - df_precip.head():
                               R   98.   95.   90.  ...  C MEAN    F SD  C SD  POWER
issue_date YEAR MN LEAD CD                          ...                             
2023-10-18 2023 10 1    20  0.34  0.85  0.98  1.12  ...    1.88  0.1036  0.11   0.29
                        21  0.37  1.86  2.15  2.43  ...    4.04  0.1303  0.14   0.30
                   2    20  0.22  0.73  0.87  1.01  ...    1.82  0.1169  0.12   0.29
                        21  0.22  1.72  2.01  2.29  ...    3.72  0.2733  0.28   0.52
                   3    20  0.07  0.99  1.14  1.29  ...    2.06  0.1197  0.12   0.33

[5 rows x 19 columns]
2024-01-23 15:16:29.599 | INFO     | src.solution:predict:18 - df_precip.tail():
                              R   98.   95.   90.  ...  C MEAN  F SD  C SD  POWER
issue_date YEAR MN LEAD CD                         ...                           
2024-01-17 2024 1  11   21  0.0  1.93  2.24  2.53  ...    3.72  0.28  0.28   0.52
                   12   20  0.0  1.07  1.24  1.40  ...    2.06  0.12  0.12   0.33
                        21  0.0  1.99  2.25  2.49  ...    3.49  0.13  0.13   0.34
                   13   20  0.0  1.48  1.82  2.12  ...    3.18  0.86  0.86   1.02
                        21  0.0  2.20  2.47  2.71  ...    3.70  0.16  0.16   0.41

[5 rows x 19 columns]
2024-01-23 15:16:30.428 | INFO     | src.solution:predict:21 - df_temp.head():
                               R    98.    95.  ...  C MEAN    F SD  C SD
issue_date YEAR MN LEAD CD                      ...                      
2023-10-18 2023 10 1    20  0.22  22.72  23.89  ...   27.40  2.8515  2.92
                        21  0.19  22.79  23.63  ...   25.93  2.0515  2.09
                   2    20  0.13  19.10  20.51  ...   24.95  3.4292  3.46
                        21  0.22  20.07  21.13  ...   24.11  2.5781  2.64
                   3    20  0.27  22.67  24.00  ...   28.45  3.2490  3.37

[5 rows x 18 columns]
2024-01-23 15:16:30.442 | INFO     | src.solution:predict:22 - df_temp.tail():
                              R    98.    95.  ...  C MEAN  F SD  C SD
issue_date YEAR MN LEAD CD                     ...                    
2024-01-17 2024 1  11   21  0.0  18.70  19.78  ...   24.11  2.64  2.64
                   12   20  0.0  21.54  22.92  ...   28.45  3.37  3.37
                        21  0.0  22.92  23.96  ...   28.11  2.53  2.53
                   13   20  0.0  27.52  28.91  ...   34.49  3.40  3.40
                        21  0.0  28.86  29.92  ...   34.13  2.57  2.57

[5 rows x 18 columns]

motoki · January 23, 2024, 7:24pm

Many thanks for providing this example. I just use your functions and everything seems fine now.

Topic		Replies	Views
Predictors data Water Supply Forecast Rodeo	4	403	November 3, 2023
Training Data - Monthly vs. Ground Truth Water Supply Forecast Rodeo	3	395	November 10, 2023
Discrepancy between the training data and the submission format Water Supply Forecast Rodeo	7	529	November 2, 2023
Clarification of data download throughout the forecast stage Water Supply Forecast Rodeo	2	167	December 31, 2023
Data Download Not There Water Supply Forecast Rodeo	2	407	October 24, 2023

Forecast Data Question

Related topics