It took a while to put together but I finally finished cleaning up and documenting the code as well as writing the solution’s technical report. It’s more or less finished - expect a few minor changes here and there.
Tesla T4 GPUs that I used for most of my training runs do not support brain-float 16 training. If you try to run a training job with this setting you will get RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16. To fully utilize this form of mixed-precision training a GPU with Ampere or later architecture is needed. Since I have RTX3090 in my local PC (that I used to submit all solutions) the switch from 16-mixed to bf16-mixed gave me consistent, although very marginal bump in LB scores. That being said, when I tried to run inference with 32-true precision, it did not improve the LB scores.