Downloading CDEC data fails every time

I tried to download cdec data using codes in wsfr_download, but I always got the Connection Error.

On the CLI, I use the command below.

python -m wsfr_download bulk data_download/hindcast_test_config.yml

then, I’ve got this.
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Could someone tell me how to solve this problem??

Hi @RyoyaKatafuchi,

When you run the download, are you able to partially download the CDEC data (i.e., it runs partway before you get a ConnectionError), or does it happen immediately and you aren’t able to download any of the CDEC data?

Hi @jayqi,

Thank you for your reply.

I can partially download the CDEC data.
As you say, it runs partway before I get a ConnectionError .

This is the one example. This time it stopped at 37% but it changes every time I run the code.

2023-11-21 01:47:40.046 | INFO     | wsfr_download.cdec:find_nearby_cdec_stations:189 - 373 nearby CDEC stations identified
2023-11-21 01:47:40.047 | INFO     | wsfr_download.cdec:download_cdec:229 - Downloading forecast year 2005 (2004-10-01 to 2005-07-21)
 37%|██████████████████████████████████▉                                                            | 137/373 [00:13<00:23, 10.21it/

Then, the error happens.

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Thanks.

I meet this issue too, tried tens of times in last 4days but all failed. I wish host can share the data in download tab.

1 Like

Hi @RyoyaKatafuchi, @hydantess,

I just pushed an update where the download code will retry up to 5 times with a delay when it encounters a ConnectionError. In other situations, this has solved this kind of problem for me. Please update your repositories and try again, and let me know if you still encounter any issues. You should make sure that you have this commit.

Hi, Thanks! I download the cdec data finally.
But I find there are 501 rows with only 3 unique site_id in sites_to_cdec_stations.csv. Is this right?

I also have similar issue with USGS streamflow. Only got 18 files for 18 sites each year while the description says we should have data for 25 (out of 26) sites.

@jayqi : Could you kindly let us know how to fix this issue? or you could just upload the files somewhere (as they are not very big) ?

Hi @hydantess,

That is correct. CDEC has snow monitoring stations in California, and only 3 of the forecast sites are in California: san_joaquin_river_millerton_reservoir, american_river_folsom_lake, and merced_river_yosemite_at_pohono_bridge. This is intended to be a supplement to the SNOTEL snow monitoring stations, which do not have as much coverage in California.

Hi @motoki,

Can you please provide more detail about what errors you are seeing? For example, please post a stack trace or error log message.

It is case that 25 of the 26 sites have associated USGS monitoring stations. However, not all 25 of the sites will have available data for every year (this varies year by year). For example, I have data for 22 sites for FY2005 and 18 sites for FY2023. We will publish a list soon of the files that will be present in the code execution runtime for the test split.

Regarding uploading the files—because of the wide range of possible locations and years that teams way want both for training and for testing, we are generally not planning to directly upload feature data for you to download.

Thank you for your help, @jayqi .

But for me, it didn’t work with the new cdec.py.
I encountered the same error: requests.exceptions.ConnectionError.
The situation doesn’t change, I can partially download the CDEC data but it failed.

I am using the same version of libraries in data_download repository, but I don’t get it why it’s keep happening.

Thanks.

Hi @RyoyaKatafuchi,
it also still wasn’t working for me, I was getting the same error very quickly all the time. It started to work when I increased wait_initial, wait_max and wait_exp_base twice (https://github.com/drivendataorg/water-supply-forecast-rodeo-runtime/blob/e756497ea06942b626d84e243f6fd854cb4549ae/data_download/wsfr_download/cdec.py#L98-L102). Though it isn’t the best way to deal with the problem for sure, it worked for me, even if it took hours and the process freezed 3 or 4 times and I had to start again from the year that wasn’t downloaded yet.

I can confirm that doubling the values of those parameters increases the stability of the download process a lot.

Hi @progin ,
Thanks for your advice.

I can confirm the stability of the download process with doubling the values, too.
But at the same time it seems it will take forever to download all.

Anyway, you saved me, thanks!!

Hi everyone,

I’ve made an update to the CDEC code that should improve reliability. It now downloads data for a batch of up to 100 stations at a time, instead of for each station individually. It also now runs serially without multithreading. Based on my testing, the reduced amount of network calls to CDEC servers should improve things. The new code is available as of this commit.