“A simple approach may ignore the time dimension and only consider the ion abundances as a function of temperature. However, there may be nuances to how the sample was heated over time that provides additional information”

I think, we should add time information. Time and temperature are highly correlated but we can not drop the time. Ranges and slopes of time-temperature graphic are different. When we discretize the overall temperature range into bins (of 100 degrees),

(-100, 0] can be first bin for “S0000” .
(-100, 0] may not be same bin for “S0001”, it may correspond to a different bin (ex.(-100,-50]).

We need to add time information. We should convert time and temperature columns into a single column and then, we will discretize the overall temperature range into bins.
What do you think Jayqi?

When I drew this graph (abundance, m/z), I understood how the algorithm works and why we discretize the overall temperature range into bins. This graph represents the last moment of the accumulation process. It may consist of a different number of chemical components. For example, for carbon dioxide and propane, m/z value is same (44). it will accumulate on 44 (m/z). But reason of accumulation 44 (m/z) can be different combinations for carbon dioxide and propane (00,01,10,11) and 44 is not only accumulation point for this two compounds.
So we can decompose the compounds by using other different accumulation points. İf you look the link you can easily understand. We need to find out which components it contains(by the way we know the compounds for train_files S0755 because it has train label.)

By the way we don’t need to learn chemistry or all m/z values for every compounds. By using this graphics we will automatically perform.

Friends, if you have suggestions, I will be happy when you can write.
If there are parts that you think are wrong, please reply.

As a competition administrator, I cannot contribute to discussions about modeling strategy. My role is to answer questions about the rules of the competition and and clarifications about the data or task.

However, you are welcome to continue sharing your thoughts here and other participants may be interested and join in. (Just don’t expect me to provide my opinions.)

I do want to note that if you’re going to add new posts regarding the same subject, please add them as replies to an existing topic rather than creating new topics entirely, so that we can keep the community forum more tidy. I’ve merged your last three other topics into this one.

For relevant domain knowledge on interpreting the data, I encourage you to read the “Understanding EGA-MS data” section on the problem description page.

Of Course, we can discuss. I only use pca for time and temp and some graphics are reversed. But ı think, time_temp_bin is not true. You can use linear regression for time and temp. You can find a and b, you can use new_temp.
y = a x + b
x=time (input)
y= new_temp (output)

In this way, you use the temperature values that depend on time.
Onur Koc

We have (766) training data. Training data have slightly different range and slopes of time and temp (last figure).

First question that comes to mind,

How will we provide time-temperature compatibility for each data?

Because the abundance values that change with time and temperature need to be processed in the same range bin for every data , for each m/z value.
By using linear regression, convert time and temp into new_temp. Then, I think, we scale new_temp all in the same range, then we can divide new_temp into bins. In this way, we ensure time-temperature compatibility for all data.
By the way, Because SAM testbeds data have nonlinear time-temp function, you should use polynomial regression. For commercial data, you can also use poly.

Wherever there is a linear function between time and temperature, do we even need to perform regression between them ? Do you think that is necessary ? I think, we could do it only for the latter - where SAM Testbeds have a non-linear time-temp function.