ReUse of train-data generated features for test phase

In the phase-one notebook, the features generated on training data (`sender_currency_freq’ etc) were used both to fit the RF or XGBoost models and for the prediction on the test data.

For phase-two: is this reuse of train-data generated features also expected by the evaluators?

It makes sense, since this is a time-series application, but it also seems that there is no facility in the code template for the transmission of train-data generated features.

Hi @kingaj12,

I don’t see what you’re saying reflected in the notebook. The train features and test features are loaded as separate dataframes train and test cell 4. While it’s true that feature engineering is often applied to both these dataframes in the same cell in the notebook, they are never combined into a single dataframe and manipulated together.

If you’re referring to features like sender_currency_freq which have parameters that are fitted to the train data and then applied to the test data, that is fairly normal in machine learning problems. That isn’t really reuse of training features—the features used for prediction on the test split are still test features (X_test, which comes from the test dataframe).

Regardless, if you need to store any information between the train stage and test stage, you can use the client_dir and server_dir directories that are provided to the client and strategy factory functions, respectively. This is, in fact, the expected way for you to save a trained model so that you can use that model later in test. This is documented here.

In terms of how to implement the saving and reloading of feature engineering parameters for use during the test stage, that’s up to you. I’m personally a fan of scikit-learn’s Pipeline API for bundling feature engineering together with the model for easy saving, and that is in fact what is done in the example (here). You should do whatever works for you.

Hope that helps. Let me know if there’s anything else that we can help clarify.

Hi @jayqi

The following code fragment computes sender_currency_freq inside the loop using train["sender_currency"] and then maps this sender_currency_freq to both the train and test dataframes:

# Sender-Currency Frequency and Average Amount per Sender-Currency
train["sender_currency"] = train["Sender"] + train["InstructedCurrency"]
test["sender_currency"] = test["Sender"] + test["InstructedCurrency"]

sender_currency_freq = {}
sender_currency_avg = {}

for sc in set(list(train["sender_currency"].unique()) + list(test["sender_currency"].unique())):
    sender_currency_freq[sc] = len(train[train["sender_currency"] == sc])
    sender_currency_avg[sc] = train[train["sender_currency"] == sc]["InstructedAmount" ].mean()

train["sender_currency_freq"] = train["sender_currency"].map(sender_currency_freq)
test["sender_currency_freq"] = test["sender_currency"].map(sender_currency_freq)

train["sender_currency_amount_average"] = train["sender_currency"].map(sender_currency_avg)
test["sender_currency_amount_average"] = test["sender_currency"].map(sender_currency_avg)

From your response, I gather this is not what was intended!

Hi @kingaj12,

The way it’s implemented is intended, and is a normal way of implementing fitted feature engineering in machine learning.

You can consider the sender_currency_freq and sender_currency_avg mappings to be fitted on the training data—the values for each sender currency are effectively like model parameters.

It would be bad practice in this situation to calculate new frequencies and averages based on the test data—this does not treat test observations independently. It means you would get different answers if you run inference on the whole test set at once, vs. feeding the test observations through the model pipeline one-at-a-time. This is also a form of data leakage—you’ve fit what is basically part of your model on test data that you are evaluating on.

For reference, here is some discussion about this type of thing:

1 Like