Provisions for R-users

As a user primarily skilled in R and with no skills in Python, the current requirements pose a challenge. The statement mentions:

“that solutions can be implemented in either Python or R, but the provided sample code will only be in Python. Additionally, solutions in R may need to call their R code through a Python wrapper.”

This seems a bit contradictory for someone using R and with no or limited Python skills. Embedding R code into Python requires a level of proficiency in Python. If a participant had this proficiency, they might not have chosen R for implementing a solution. Could the competition organizers please reconsider this requirement?

Another challenge is limited options for data retrival, since some of it is accessible only by designated Python-based scripts. Would it be possible to use R libraries to directly call some of the data? E.g. this package facilitates access to SNOTEL data using R GitHub - bluegreen-labs/snotelr: a snow data network (SNOTEL) R package.

I am in a similar situation with lack of Python skills. Another R package useful for data download is dataRetrieval for download of USGS streamflow data. I think we could use the add Data Request form (deadline today!) to get ask that specific R packages be added a approved methods to access the data? Some data sources (i.e., NRCS SNOTEL) already have two approved methods for access.

1 Like

Thanks! I have sent the approval request for the two packages. However even if use of these packages is approved, the main challenge would still remain there: according to the code submission requirements, everything has to be done and run in Python (Competition: Water Supply Forecast Rodeo: Hindcast Evaluation) . I hope there will be option to submit R code, otherwise I have no options to participate in the competition

Hi @tabumis, @riverbend,

Thanks for the feedback. In this challenge, Python is the primary language that is being supported. Solutions are allowed to use R in order to make the challenge accessible to more people. While R will not be supported to the same level as Python, the available resources and requirements should not be extremely limiting for R users.

  • The feature data download code is provided as a command-line program. As a prerequisite, you will need to set up a Python virtual environment and install the package in order to use it, but you do not need to have any proficiency in reading or writing Python code to use it. There should be many resources online for different ways to install Python and set up a virtual environment—here’s a guide for setting up Python with conda. Instructions for setup and use of the data download program are in the README of the data and runtime repository.
  • During code execution, rehosted feature data is available as files on disk in whatever raw formats (e.g., CSVs, netCDF, etc.). You can use whatever you’d like to read these data files. Use of the provided sample Python code is not required.
  • For any data sources where direct API access is permitted during code execution, you may use any method you’d like to download data from the approved sources.
  • The requirement to wrap your prediction code in Python is fairly lightweight. See below for a simple example of calling a model.R script from the required predict function in a solution.py. You’d include both of these files together in your submission.zip.

We are happy to provide further tips to help you implement your solution, and to accept dependency requests for R packages that are available via conda.

## model.R

# Read site_id and issue_date from command-line arguments
args <- commandArgs(trailingOnly = TRUE)
site_id <- args[1]
issue_date <- args[2]

print("Printing from model.R")
print(paste("site_id:", site_id))
print(paste("issue_date:", issue_date))

# Calculate your predictions here
predictions <- c(100.5, 110.9, 120.4)

# Write predictions to a file so Python code can read them in
out_file <- paste0("preprocessed/predictions/", site_id, "_", issue_date, ".txt")
print(paste("Writing predictions to", out_file))
write(predictions, file = out_file, sep = ",")
## solution.py

import subprocess
from pathlib import Path
from typing import Any, Hashable

from loguru import logger

def predict(
    site_id: str,
    issue_date: str,
    assets: dict[Hashable, Any],
    src_dir: Path,
    data_dir: Path,
    preprocessed_dir: Path,
) -> tuple[float, float, float]:
    logger.info("Logging from solution.py")

    logger.info("Prediction for site_id={}, issue_date={}", site_id, issue_date)

    logger.info("Using subprocess to call model.R via shell command.")
    subprocess.run(("Rscript", "model.R", site_id, issue_date))
    logger.info("model.R completed")

    # Read text file containing "100.5,110.9,120.4", split on comma, cast to float
    preds_path = preprocessed_dir / "predictions" / f"{site_id}_{issue_date}.txt"
    logger.info("Reading predictions from {}", preds_path)
    preds_text = preds_path.read_text()
    preds = tuple(float(y) for y in preds_text.split(","))

    logger.success("Successfully read predictions: {}",preds)

    return preds

Here’s some example logging output from running this code:

2023-12-05 11:39:21.519 | INFO | solution:predict:15 - Logging from solution.py
2023-12-05 11:39:21.520 | INFO | solution:predict:17 - Prediction for site_id=hungry_horse_reservoir_inflow, issue_date=2015-03-15
2023-12-05 11:39:21.520 | INFO | solution:predict:19 - Using subprocess to call model.R via shell command.
[1] "Printing from model.R"
[1] "site_id: hungry_horse_reservoir_inflow"
[1] "issue_date: 2015-03-15"
[1] "Writing predictions to preprocessed/predictions/hungry_horse_reservoir_inflow_2015-03-15.txt"
2023-12-05 11:39:21.781 | INFO | solution:predict:21 - model.R completed
2023-12-05 11:39:21.782 | INFO | solution:predict:25 - Reading predictions from preprocessed/predictions/hungry_horse_reservoir_inflow_2015-03-15.txt
2023-12-05 11:39:21.783 | SUCCESS | solution:predict:29 - Successfully read predictions: (100.5, 110.9, 120.4)

Thanks @jayqi, that is very helpful! Though it will still take me some time to (hopefully) be able to implement. Before seeing your response, I had posted an issue at I would like to be able to use R for this project · Issue #5 · drivendataorg/water-supply-forecast-rodeo-runtime · GitHub. A few follow-up questions:

  • What version of R will we have access to the runtime environment?
  • How do I we find out which R packages are available via conda, is it: Search :: Anaconda.org? I have zero experience with conda or Docker.
  • What is the process for requesting that specific R packages be added to the runtime environment? I am not very familiar with GitHub (just used it once briefly a few years ago) so it would be easiest for me if I could just provide a list of the requested packages in a forum post such as this, rather than having to write the package installation code and submit a pull request (i.e., I think that is what is meant by “If you want to use a package that is not in the runtime environment, make a pull request to this repository” on GitHub - drivendataorg/water-supply-forecast-rodeo-runtime: Data and runtime repository for the Water Supply Forecast Rodeo competition on DrivenData).

Responding to a few different items in one place for transparency for everyone who is interested in using R.


Another R package useful for data download is dataRetrieval for download of USGS streamflow data. I

Thanks! I have sent the approval request for the two packages.

@tabumis (and CC @riverbend since you brought it up in thread)—we received your request for the R packages snotelr and USGS’s dataRetrieval to be available. This is under review.

However, I do want to point out that in the code execution environment, we are rehosting predownloaded data files from both SNOTEL and USGS for locations associated with the forecast sites for the test years. These are just CSV files, and you can read these files however you want, such as using R.

For training, you can download data for training years using either Python or R without the necessary dependencies being included in the runtime environment. Your training environment is separate from the code execution runtime. The code execution runtime is for you to submit a trained model for performing inference.

If you are planning to download data for additional SNOTEL or USGS stations during the code execution run at inference time, then you are permitted make network calls to the NRCS and USGS web service APIs. We will review the snotelr and dataRetrieval packages for possible inclusion in the runtime environment, but you are also free to use generic HTTP request libraries to download that data.


@riverbend

What version of R will we have access to the runtime environment?

We will choose a relatively up-to-date version of R (4.0.0+) that is likely to be compatible with packages. This may be the current version of R (4.3.2). If you have known constraints (e.g., a package you need to use has particular requirements), please let us know.

How do I we find out which R packages are available via conda

You will need to determine whether conda-forge has a particular package available. The convention is typically that this is named "r-<packagename>", e.g., see r-dplyr. A Google search like "conda-forge r dplyr" typically will turn up the relevant package.

it would be easiest for me if I could just provide a list of the requested packages in a forum post such as this

While we prefer pull requests to the runtime repository, providing a list in a GitHub issue will also be accepted.

I do not have any Python experience, is there a way run this completely in R Studio, the environmental I typically use for my analyses? Alternatively, I do have Team members who do have Python experience, so we probably could figure out how to run R from within Python if that is allowed. From my brief research on this topic, it looks like the Python rpy2 package is the best way to run R within Python

In the context of the code execution runtime, our recommendation for the simplest approach would be to conceptually follow the example that I previously posted (here). In this approach, you do whatever you like entirely in R (and you can use RStudio as your editor for writing and testing). Then, you would submit that R code along with a Python script that uses the subprocess.run function in Python, which allows you to call command-line shell commands, to call your R scripts as if you were calling them from the command line.

Using a framework like rpy2 is likely more difficult if you are not proficient in Python. When using rpy2, one actually is writing Python code against rpy2 APIs, and rpy2 turns that into R code that it calls with the R program under the hood.

Dear @jayqi , thank you for your feedback and clarifications. I will try to run my code from Python, following the example you posted, maybe with some external assistance. I would need however some more clarifications:

You will need to determine whether conda-forge has a particular package available. The convention is typically that this is named "r-<packagename>" , e.g., see r-dplyr. A Google search like "conda-forge r dplyr" typically will turn up the relevant package.

Does this imply that running R fom Python requires R packages used in the script to have Python equivalents? If yes, then it complicates the situation further as there is no guarantee that the libraries I use could be called from Python.

I would like to ask organizers to reconsider this requirement. It will be a pity if because of this requirement we, R users, are unable to submit their solutions. I believe the primary focus of this competition is to determine optimal solutions to the water availability problem, rather than assessing participants’ mastery of Python.

Hi @tabumis,

Calling R scripts from Python using a subprocess is a very lightweight requirement. The Python required is minimal, and the example I provided earlier in the thread is likely all that you need. To reiterate, this challenge is a primarily Python challenge and we have additionally allowed for use of R code in order to support participants who would like to use R for their solution.

Aside from the Python code I wrote in my example, there is zero impact or interaction needed between Python and the R code in your solution. Conda is a language-agnostic package manager that allows you to install and manage packages from many languages, including R. These packages are just regular R packages and have nothing to do with Python.

There are already several R packages available in the runtime environment that were requested by @riverbend. You can track the updates in his issue: Please add the following R packages to the runtime · Issue #13 · drivendataorg/water-supply-forecast-rodeo-runtime · GitHub

Hi @jayqi ,
Thank you for this information and for the example you posted earlier. Preliminary test which used a part of my Rscript suggests that this solution may work. I hope it will go smoothly for the whole script too.

Id like to ask for some more clarifications:

What is a purpose of predownloading the data for the test years only?The data from SNOTEL and USGS is downloadable up to near-real time, and I (and supposedly all participants) download and use for training the models the whole training years up to present time, excluding thetest years. Or does your comment refer to data to be used during the Forecast stage?

I train the models depending on data available by each issue date, because we dont know data latency and there is a risk that source/station the models were pretrained on are temporarily unavailable. The training and prediction for test issue dates is therefore implemented simultaneously. But I can save each pretrained model for your reference. Does this align with the requirements?

Hi @tabumis,

The code execution runtime is intended to for participants to demonstrate working inference code when evaluated on test set years. In general, we do not expect training inside the code execution runtime and therefore do not provide any training data. Participants are responsible for training on your own local hardware.

Based on my understanding of your description of your modeling approach, what you call “training and prediction…simultaneously” seems like it conceptually is actually your inference process, because it depends on the test set feature data. You should upload whatever model weights or preprocessed training set feature parameters in your submission.zip needed for your approach to work.

I think it will be more relevant for Forecast Stage track where there will be unknowns, e.g. data is not updated up to specified lag dates, the station is closed, etc.

For example, if the model depends on weather site A and it is somehow down or closed, then the trained model weights will not work properly and need to be updated/retrained with the model without the feature from site A.

Hi @jayqi, thank you for the clarifications.

We’ve been testing the R solution wrapped into Python using the predict function, and everything seems to be working well. The only issue is that generating predictions for each site_id and issue_date takes a bit too long, around 4-6 seconds per prediction. This would a way longer the 2-hour time limit.

The bottleneck is caused by solution.py, which calls our R-code for each site_id and issue_date. Every time it does, the R libraries, preprocessed training set, and parameter files get uploaded again, eating up most of those 4-6 seconds.When we run just the R-code on its own (without embedding it into the predict function), things move much faster. Loading the necessary R resources happens only once, and a prediction for a single site_id and issue_date takes less than 0.8 seconds.

One possible solution could be supplying the site_id and issue_date arguments to the predict function as vectors or lists containing all sites and issue dates. Is there a possbility for such amendment on your end? We would amend our R-code accordingly, so it generates predictions for all sites and dates in a single call.

If this is not possible, we`d appreciate any other suggestions you might have.

Hi @rasyidstat,

Exactly. I assume some pretrained models will not work at all, if one of the predictor features is absent. Such uncertainty in data availability by issue date is what we have previously been recommended to take into account.

Hi @tabumis,

I encourage you to also make use of the optional preprocess function. The purpose of having this function, which runs before the predict loop, is to allow you to do data processing upfront more efficiently. You can do move redundant computations or use of heavier-duty R libraries into this step, and reduce what you need to do in the predict loop. See also this recent related thread.

Hi @jayqi ,

Yes, we tried it in different implementations. If we move data preprocessing/loading operations into preprocess function we need to create a separate .R script. The solution.py first runs the preprocess function with embedded preprocess.R script, and then predict function with model.R script. However any objects created in preprocess.R environment is not transferable to model.R. Similarly, the libraries loaded in each scripts are not transferable, every time the solution.py runs any of the R.scripts it would start a new R session.

But aside of that, lets assume we use only predict function and embed a model.R script there. Our script loads the data/model makes prediction and saves it. One would at least need three libraries for that task, smth like ‘readr’ to read the csv file,‘dplyr’ to make some data transformation and maybe ‘caret’ to run prediction. You could probably test how long it takes to run with solution.py an R script that consists of just three lines:

library(readr)
library(dplyr)
library(caret)

In our case it took more than 1 second just to load the libraries, and this does not include yet loading and transform the files you need for prediction.This has to be multiplied by 7250 times, since with every new site_id and issue_date agrument the predict function starts a new R session. Our submission consequently failed during execution because it exceeds even the extended time limit.

Can you please suggest any other solution that could help to overcome this issue?

Hi @tabumis,

In general, my suggestion is to continue moving work from model.R to preprocess.R so that you don’t need to load as many libraries in model.R. Per this recent announcement, we’ve recently increased the time limit from 2 hours to 4 hours (which is 14,400 seconds), which should help you out.

You may also be interested in the discussion in this thread which also has discussion about using preprocess and predict in a way to fit in the time limit.

Hi @tabumis,

I’m not at all familiar with R, but you could try spawning the R interpreter using os.popen() in Python instead of os.system (or however you’re forking off a new process currently). The os.popen code will return an object that’s essentially a standard *NIX shell pipe: the process will be created and its STDIN and STDOUT will be connected to the Python object returned from popen().

If you store the object in your assets return from preprocess() you could just reference it when predict() is called. Not exactly an elegant solution though, but it could work with some fiddling about.

I think your best bet would be to submit any solution that succeeds at all (just return all 0s even) so you can have more time to work on it, and before the Forecast deadline, modify your R code to basically do a select() or poll() style thing on a file in the preprocessed directory. I don’t know if a FIFO or pipe or something is feasible in the execution environment, but that’d work too. It would allow you to communicate between the processes without much issue, and wouldn’t require spawning a new R process every predict() call. The global interpreter lock in Python is a real hassle sometimes.

That’s all I can think of off-hand, personally. Good luck!

Hi @jayqi and @mmiron,

Many thanks for your suggestions. I think we found a way how to overcome this issue, by changing our workflow

Hi @jayqi,

Will it be possible to submit csv files, with our solution, that contain historical observations from the approved sources? This would include:

  • historical observations from the selected SNOTEL stations starting from 1989
  • historical observations for daily streamflow for selected NRCS/USGS stations
  • some of the files from Login

Hi @tabumis,

Please be sure to review “Can I upload trained model weights and/or precomputed features for the code execution run?” in the FAQ.

My understanding of what you are describing sounds like it would broadly fall under

:white_check_mark: Should upload: Feature parameters computed on the training set (e.g., mean value of some variable over the water years of the training set).

and therefore would be allowed.