Hello,

First I would like to thank the challenge organizers for their hard work, and also for this very interesting and original challenge. I enjoyed it a lot, and also learned a lot.

As suggested by @ORLab, I will post here our ML approach.

- Approximate Dynamic Programming

We first started to implement an approach based on Approximate Dynamic Programming, due to the structure of the problem and the possibility to solve it recursively. This approach was very close to a Deep Q-Network, except that models were learned sequentially starting from the last time step of each simulation.

We did not have so much time to benchmark this method, but it seemed to be limited in the following ways:

- Q-Networks cannot deal with continuous action space, which forces to discretize the action space of proposed battery charges at each time step
- If the discretization of the action space is very fine, then we tend to over-estimate the Q-values due to the max operation (though double Q-Networks can help with this)
- Since models are learned sequentially and calling each other, model errors tend to propagate (this can be mitigated by calling them as rarely as possible, as seemed to be suggested by @ironbar who apparently followed this approach)

- Policy network

During the implementation of this 1st approach, I actually came to see an alternative which seemed even more optimal. Indeed I understood that all operations implemented in the simulate_timestep() function of simulate.py (the function which updates the battery charge, computes the grid energy and the money saved) could actually be implemented into a DL framework such as Keras.

This may seem anecdotal, but it was actually a strong specificity that cannot be found in usual problems (indeed most RL policy implementations use techniques such as Reinforce for training http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf, since they cannot directly model the reward that would correspond to a given action at a given point).

So I ended up coding directly a policy network, which took the same inputs as the standard battery controller (in fact, the policy network should be understood as the battery controller), and gave in output the proposed battery charge.

On top of this policy network, further layers (without any parameters to learn) performed the â€śsimulation workâ€ť. These layers took as inputs:

- the current charge of the battery coming from the previous time step
- the proposed charge coming from the policy network at the current time step
- the actual values of load, pv, and prices for the current time step

They gave in output:

- the money saved on the current time step ( = money_spent - money_spent_without_battery)
- an updated current battery charge ready for the next time step

Finally the loss of this global architecture was set as the competition score: money_saved / abs(money_spent_without_battery) (the denominator had actually no influence on the optimal policy for a given simulation, but it enabled to prioritize learning to save money when the factor was small). In this way, Kerasâ€™s optimizer was directly doing the job of minimizing the simulation score.

The most important trick to make this approach work was related to the policy network output. Indeed, since the battery charge was bounded in its allowed range, any exploration of the policy network outside of this allowed range would lead to zero gradients. This was analogous to the â€śdying ReLUâ€ť issue, but even more problematic since it would affect the whole network.

To solve this, I ended up with a solution with 6 outputs of the policy network:

- 3 sigmoid outputs giving the proposed position of the charge w.r.t. different ranges (1 for the allowed range [0,1] coming from the battery capacity, 1 for the allowed range coming from the current charge and the power limits, and 1 for the combination of the 2 previous ranges)
- 3 softmax outputs to weight the proposed charge resulting from each of the 3 sigmoid outputs
- This enabled to make sure that gradients never completely disappear, and it also gave a lot of degrees of freedom for the system not to get stuck in local minima

Hereâ€™s a list of other choices:

- Due to the strong constraints on computing time, we only applied the policy network every 4 time steps (so every hour), and the policy network gave its 6 outputs for each of the 4 following time steps
- We used a simple Data Augmentation scheme, that consisted in randomly sampling time periods for the load and prices series, and randomly shifting the pv series by a number of days between -14 and +14
- In our final training, we merged â€śtrain/â€ť and â€śsubmit/â€ť data
- We used simple Dense layers for the policy networks, since Convolutional layers would have been too computationally expensive
- We restricted the depth of the policy network, since Batch Normalization layers lead to unexpected behaviour in this context where some layers were shared between time steps
- The total number of time steps learned at the same time in the policy network was a tradeoff between the increase in training time, and the potential benefit of having a longer policy horizon (it was set to 6 days in our final results, but Iâ€™m not sure whether the gain was significant compared to shorter horizons)

We obtained an average score of -0.1985 on submit data. We could not train as long as we wanted due to the competition deadline, but I am not really sure whether further optimization would have been in the overfitting regime, or would have lead to a further decrease of the test score.

Iâ€™ve put the code on GitHub at this address if youâ€™re interested: https://github.com/alabatie/optim-pv-battery

Best,

Antoine