Hi,
I have a question about the evaluation phase. I have training and test features compiled and formatted from the approved sources for all the sites and all years, can I simply load these files into the the preprocessed_dir along with my models to run my inference or is there a reason I would need to rebuild these from scratch? This would seem to work for the hindcast evaluation phase, but would need to change in the forecast stage for the test data since we will only receive test data as the year progresses.
Hi @jimking100,
No, you should not upload precalculated features for the test data. Feature processing is considered to be part of the inference workflow, and you should calculate features from the raw data provided in the runtime environment (or from downloaded raw data for data sources that are approved for direct API access) during the runtime execution.
The Hindcast Stage and the Forecast Stage should conceptually work in the same way—if something won’t work for the Forecast Stage, that’s a good sign that it may not be correct for the Hindcast Stage.
Got it, thanks! It wasn’t explicitly mentioned in the evaluation arena, so I thought I would ask.
Hi @jayqi,
What about storing precalculated static features? Is it permitted?
These features will not change and no need to use downloaded data in hindcast and forecast stage.
Some examples are:
- Min/max/avg/std (precomputed using sklearn preprocessor)
- Precomputed elevation and other precomputed basin metadata
Thanks
Hi @jayqi,
I’m having some difficulty with the runtime environment. Specifically, there doesn’t appear to be any data available at all in the data_dir directory being passed to the preprocess() function:
2023-12-05 17:58:06,168:DEBUG:flowcast.src.solution:listing of data_dir /code_execution/data/*: [] [solution.py:128]
My solution is looking for the same file hierarchy that’s created locally by the python -m wsfr_download bulk
command in the data_dir directory passed to preprocess(). What’s the correct way of doing that?
Hi @mmiron,
I was able to reproduce your issue. It appears that this is simply an issue with glob
in Python—the files were actually there and could be read given the path to one, but glob
was just not finding them.
We’ve just pushed a fix, so your code should successfully list files if you submit again. Let me know if you have any further issues.
What about storing precalculated static features? Is it permitted?
These features will not change and no need to use downloaded data in hindcast and forecast stage.
You can upload precalculated parameters based on the training data (e.g., fitted versions of sklearn’s preprocessors). For features that use variables corresponding to the test data, you should calculate those within the code execution runtime using the provided raw data, or data that you download during code execution for data sources where direct API access is permitted.
Hi @jayqi,
Both glob.glob(f"{os.path.join(data_dir, "*")}
and os.listdir(data_dir)
return empty lists.
Hi @mmiron,
We’ll try to debug this issue further, but as a workaround that you can use in your code and have some control over, you can try something like import time; time.sleep(5)
to just delay your code for a short amount of time, like 5 seconds. There seems to be some transient effect when mounting the data drive that makes Python not able to properly iterate over the files (even though the files are actually there) until after some delay.
For example, the following preprocess function works for me:
from collections.abc import Hashable
import glob
import os
from pathlib import Path
import time
from typing import Any
from loguru import logger
def preprocess(src_dir: Path, data_dir: Path, preprocessed_dir: Path) -> dict[Hashable, Any]:
time.sleep(5)
glob_results = glob.glob(os.path.join(data_dir, "*"))
logger.info("glob.glob: {}", glob_results)
list_dir_results = os.listdir(data_dir)
logger.info("os.listdir: {}", list_dir_results)
raise Exception
return {}
Resulting logs:
2023-12-05 22:04:52.740 | INFO | __main__:main:51 - Running function 'preprocess'
2023-12-05 22:04:57.841 | INFO | src.solution:preprocess:14 - glob.glob: ['/code_execution/data/cdec_snow_stations.csv', '/code_execution/data/cpc_climate_divisions.gpkg', '/code_execution/data/geospatial.gpkg', '/code_execution/data/metadata.csv', '/code_execution/data/smoke_submission_format.csv', '/code_execution/data/submission_format.csv', '/code_execution/data/test_monthly_naturalized_flow.csv', '/code_execution/data/cdec', '/code_execution/data/cpc_outlooks', '/code_execution/data/grace_indicators', '/code_execution/data/modis_vegetation', '/code_execution/data/pdsi', '/code_execution/data/snodas', '/code_execution/data/snotel', '/code_execution/data/teleconnections', '/code_execution/data/usgs_streamflow']
2023-12-05 22:04:57.895 | INFO | src.solution:preprocess:16 - os.listdir: ['cdec_snow_stations.csv', 'cpc_climate_divisions.gpkg', 'geospatial.gpkg', 'metadata.csv', 'smoke_submission_format.csv', 'submission_format.csv', 'test_monthly_naturalized_flow.csv', 'cdec', 'cpc_outlooks', 'grace_indicators', 'modis_vegetation', 'pdsi', 'snodas', 'snotel', 'teleconnections', 'usgs_streamflow']
Yes, I had the same issue - no data! The 5 second fix seems to work and I have a successful smoke test! Thanks!