Optimizing demand-side: my approach

Here is my approach.

My solution is based on math and data exploration. In fact, my submission does not use machine learning libraries. Since my solution does not really fit the training data, I hope it would not overfit.

Optimal battery.charge

If the forecast is perfect, we can derive an optimal charge for the battery at each timestep (luckily, 95 forecast timestep are almost always enough).
For each timestep, the optimal charge is roughly the sum of energy which will be needed during the highest price minus the energy which will be collected when PV > Load or during the smallest price.
In this calculus, we should carefully account the battery efficiency and the battery power limit. And also the battery capacity (which may be tricky). This lead to approximately -0.18 on LB I think.
Then, looking to simulations, we can add some optimizations as follow:

Various optimizations

  1. The first optimization is to use the fact that the simulation ends at 10 days. So optimal charge should consider that there will be an unlimited amount of PV after 10 days, and the battery should end up at 0.

  2. Secondly, taking the uncertainty of the next forecasting as follow:

  • a) if the coming PV will be enough to fully charge the battery, it is better to charge it when PV is not too small (which avoid buying from the grid because of the variance in the forecast);
  • b) same if coming Load will fully discharge the battery;
  1. Lastly, given an optimal charge of the battery (according to forecast), it is better to wait the last time to charge, which avoids re-adjusting the charge if forecast change during the night.

Let me illustrate this by a picture.

In the picture below, the light-purple is the next energy need (need = load - PV), in black is the price and in blue our proposed_state. (The yellow is the forecasting error, and dark-purple the energy request to the grid.)

  • At time 55 (before highest price) we buy energy to cover the next consumption period (from 60 to 70); Then discharge it during this period.
  • Between 70 and 110, need < 0, then we bought energy when we are sure it is safe (2.b).
  • The same during the discharging period between 155 and 160, since we have enough energy to cover this period, we discharge our battery a little less than needed (2.a)



Thank you for sharing, Tomas.

It was a surprise to me to see 0.18 score from such approach, because we use exactly the same idea and it gave us ~0.12 :smile:
It seems that the only difference is

carefully account the battery efficiency and the battery power limit

What did you mean by that?

1 Like

Hi all,

Our Team (ORLab) used the approach almost similarly to Thomas:

  1. We did not use any ML tools. We considered the problem as a dynamic optimization problem. The problem at each step was modeled as a linear programming (LP). We selected Ortool to solve LP model optimally because Ortool seemed to be the fastest and easy to install on the docker.

  2. We then tried to adjust the solution obtained from Ortool to take into account the uncertainty of the next forecasting such that its objective value is unchanged. The idea for this part is the same as what Thomas said above.


The results are so tight that I think we all use the same idea. The difference may lie in how we implement it.

If anyone used ML methods (deep learning, for example), please share it here! I really want to know how this kind of problem can be solved with different approaches. Thank you!


First I would like to thank the challenge organizers for their hard work, and also for this very interesting and original challenge. I enjoyed it a lot, and also learned a lot.

As suggested by @ORLab, I will post here our ML approach.

  1. Approximate Dynamic Programming

We first started to implement an approach based on Approximate Dynamic Programming, due to the structure of the problem and the possibility to solve it recursively. This approach was very close to a Deep Q-Network, except that models were learned sequentially starting from the last time step of each simulation.

We did not have so much time to benchmark this method, but it seemed to be limited in the following ways:

  • Q-Networks cannot deal with continuous action space, which forces to discretize the action space of proposed battery charges at each time step
  • If the discretization of the action space is very fine, then we tend to over-estimate the Q-values due to the max operation (though double Q-Networks can help with this)
  • Since models are learned sequentially and calling each other, model errors tend to propagate (this can be mitigated by calling them as rarely as possible, as seemed to be suggested by @ironbar who apparently followed this approach)
  1. Policy network

During the implementation of this 1st approach, I actually came to see an alternative which seemed even more optimal. Indeed I understood that all operations implemented in the simulate_timestep() function of simulate.py (the function which updates the battery charge, computes the grid energy and the money saved) could actually be implemented into a DL framework such as Keras.

This may seem anecdotal, but it was actually a strong specificity that cannot be found in usual problems (indeed most RL policy implementations use techniques such as Reinforce for training http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf, since they cannot directly model the reward that would correspond to a given action at a given point).

So I ended up coding directly a policy network, which took the same inputs as the standard battery controller (in fact, the policy network should be understood as the battery controller), and gave in output the proposed battery charge.

On top of this policy network, further layers (without any parameters to learn) performed the “simulation work”. These layers took as inputs:

  • the current charge of the battery coming from the previous time step
  • the proposed charge coming from the policy network at the current time step
  • the actual values of load, pv, and prices for the current time step

They gave in output:

  • the money saved on the current time step ( = money_spent - money_spent_without_battery)
  • an updated current battery charge ready for the next time step

Finally the loss of this global architecture was set as the competition score: money_saved / abs(money_spent_without_battery) (the denominator had actually no influence on the optimal policy for a given simulation, but it enabled to prioritize learning to save money when the factor was small). In this way, Keras’s optimizer was directly doing the job of minimizing the simulation score.

The most important trick to make this approach work was related to the policy network output. Indeed, since the battery charge was bounded in its allowed range, any exploration of the policy network outside of this allowed range would lead to zero gradients. This was analogous to the “dying ReLU” issue, but even more problematic since it would affect the whole network.

To solve this, I ended up with a solution with 6 outputs of the policy network:

  • 3 sigmoid outputs giving the proposed position of the charge w.r.t. different ranges (1 for the allowed range [0,1] coming from the battery capacity, 1 for the allowed range coming from the current charge and the power limits, and 1 for the combination of the 2 previous ranges)
  • 3 softmax outputs to weight the proposed charge resulting from each of the 3 sigmoid outputs
  • This enabled to make sure that gradients never completely disappear, and it also gave a lot of degrees of freedom for the system not to get stuck in local minima

Here’s a list of other choices:

  • Due to the strong constraints on computing time, we only applied the policy network every 4 time steps (so every hour), and the policy network gave its 6 outputs for each of the 4 following time steps
  • We used a simple Data Augmentation scheme, that consisted in randomly sampling time periods for the load and prices series, and randomly shifting the pv series by a number of days between -14 and +14
  • In our final training, we merged “train/” and “submit/” data
  • We used simple Dense layers for the policy networks, since Convolutional layers would have been too computationally expensive
  • We restricted the depth of the policy network, since Batch Normalization layers lead to unexpected behaviour in this context where some layers were shared between time steps
  • The total number of time steps learned at the same time in the policy network was a tradeoff between the increase in training time, and the potential benefit of having a longer policy horizon (it was set to 6 days in our final results, but I’m not sure whether the gain was significant compared to shorter horizons)

We obtained an average score of -0.1985 on submit data. We could not train as long as we wanted due to the competition deadline, but I am not really sure whether further optimization would have been in the overfitting regime, or would have lead to a further decrease of the test score.

I’ve put the code on GitHub at this address if you’re interested: https://github.com/alabatie/optim-pv-battery



That’s a very interesting aproach. Thanks for sharing.
How many parameters did your model had?

In total we had around 3M parameters.

Given we had a relatively large amount of data, and given the Data Augmentation, this choice seemed reasonable. It also seemed ok on preliminary tests I made by training only on train/ and testing on submit/.

But again, we had not so much time for tests and verifications, so it might be that it lead to overfitting and made the whole approach less competitive.

I wonder which was your best score on the public test set.


Our best score was around -0.1962.

1 Like