IMPORTANT: How to speed up Inference & Queue time

Max_Schaefer · December 31, 2021, 8:09am

Hey fellow participants.

I think i found the reason for long queue waiting time. I noticed that the code provided in the benchmark notebook doesn´t leverage GPU, but instead lets your model do inference on CPU. Since probably most participants start with that notebook, i´m guessing that most didn´t adjust the code in main.py to let their model do inference on GPU. I adjusted the code and tested the difference:

CPU: 1h 44 minutes
GPU: 0h 23 minutes

Here is a little hard coded solution to speed up Inference:

main.py

line 79:
change to:
x = batch[“chip”].to(“cuda”)

instead of:
x = batch[“chip”]

line 81:
change to:
preds = (preds > 0.5).detach().to(“cpu”).numpy().astype(“uint8”)

instead of:
preds = (preds > 0.5).detach().numpy().astype(“uint8”)

line 117:
change to:
model = CloudModel(bands=bands, hparams={“weights”: None, “gpu”: True})

instead of:
model = CloudModel(bands=bands, hparams={“weights”: None})

Please make sure to adjust your code. It would benefit everybody. It would help if the blog post with the benchmark notebook gets adjusted as well to avoid running into the same issues of long queues again.

ajijohn · December 31, 2021, 1:15pm

This is great ! Thanks for taking the time to write this up.

One follow-up question, on the benchmark code side, to speed up the training, are there other places that needs to be modified to use GPU?

Thank you

Max_Schaefer · January 1, 2022, 9:37am

Hey @ajijohn.

I just saw that you are struggling with waiting hours per epoch. Your model very likely trains on CPU.

You should add the parameter “GPUs” = True, when setting up your model:

cloud_model = CloudModel(
    bands=BANDS,
    x_train=train_x,
    y_train=train_y,
    x_val=val_x,
    y_val=val_y,
    hparams={"gpu": True}
)

Set gpus = 1, when setting up the pytorch_lightning.Trainer object:

trainer = pl.Trainer(
    gpus=1,
    fast_dev_run=False,
    callbacks=[checkpoint_callback, early_stopping_callback],
)

One epoch in google colab takes about 5-7 minutes.

ajijohn · January 4, 2022, 6:42pm

Thx @Max_Schaefer , I’ve the settings you indicated, but only difference is that the images are on my google drive. Will try to debug it.

harish5p · January 4, 2022, 8:49pm

@Max_Schaefer

I got CUDA out of memory error, would changing the batch_size help fix this error?
I trained on batch_size of 8, would testing on a different batch_size give an error?

Full error:
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 11.17 GiB total capacity; 10.54 GiB already allocated; 99.88 MiB free; 10.67 GiB reserved in total by PyTorch)

ajijohn · January 4, 2022, 10:59pm

@harish5p , I’m having the issue, and per suggestion by a community member, I modified the batch size to resolve OOM error. I think it depends on your backbone too, I modified my batch_size to 10 from 23, and monitored the usage, and its hovering around 7GB (well below 11GB max we get). Also, I’m only using 4 bands now with backbone resnet34, so you might want to experiment with with your model framework.

I plan to resubmit soon, but looks like it might be fine to go through inference.

Hope this helps.

Max_Schaefer · January 5, 2022, 2:19am

Make sure to restart your runtime before trying another batch size. @ajijohn is right. Changing your batch size will solve the problem. Deeper backbones often require smaller batch sizes. Using batch_size = 16 with a resnet34 backbone on google colab environment works for me.
Pytorch lightning provides a function to find the right batch_size for your current environment. Tutorial for using auto_scale_batch_size. Solution that worked for me:

add auto_scale_batch_size=True flag to trainer:

trainer = pl.Trainer(
    gpus=1,
    fast_dev_run=False,
    callbacks=[checkpoint_callback, early_stopping_callback],
    auto_scale_batch_size=True
)

Use trainer.tune instead of trainer.fit to find the right size:

trainer.tune(cloud_model)

output should look similar to this:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Global seed set to 42 Batch size 2 succeeded, trying batch size 4
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Global seed set to 42 Batch size 4 succeeded, trying batch size 8 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
…

Finished batch size finder, will continue with full run using batch size 32
Restoring states from the checkpoint path at …
{‘scale_batch_size’: 32}

Although batch_size 32 succeeded in my case, i still run into issues using it. Batch size 16 worked for me. Make sure to remove these flags again and restart runtime before training again.
There is even ‘auto_lr_find’ flag that finds the right learning rate for you. I recommend you to checkout Pytorch Lightning Tutorials to find more tricks to optimize your pipeline.

Topic		Replies	Views
Submission Jobs failing because of CUDA out of memory On Cloud N	4	525	January 6, 2022
QUEUED: There are currently 12 submissions ahead of yours On Cloud N	8	630	January 5, 2022
Normal Submission Time Limit Water Supply Forecast Rodeo	1	128	December 13, 2023
Colab Pro job running slow On Cloud N	4	476	January 6, 2022
Please speedup the time taken for submissions On Cloud N	6	443	December 31, 2021

IMPORTANT: How to speed up Inference & Queue time

Related topics