Submission fails, but no explicit reason is provided

Our submission works fine when tried on our local hardware using the provided runtime, but fails without explicit reason on the DD site with:

Your submission did not output the expected file so it could not be scored. This may be due to an unhandled exception or syntax error in your code. The log output may have more details.

However, there is no sign of any exception occurring in the logs.

We have a lot of preprocessing and need to create a temporary directory with preprocessed images. Could the problem be a result of too long processing or disk space shortage? If the cause was one of these two would we get explicit message about exceeding particular submission resources?

Knowing whether the error message we got covers “exceeding system resources” errors would help us to identify the source of the problem.

1 Like

Are you referring to the submission on 2022-01-03? I’ve looked at those logs, and I agree there’s nothing in there to indicate why it failed. Do you have an estimate for how much disk space your submission uses? It might help to run shutil.disk_usage log the available disk space remaining at various points while your submission is running.

import shutil
shutil.disk_usage("/")
# usage(total=..., used=..., free=...)

Are you referring to the submission on 2022-01-03?

All of our submission including this one (except the one on 2022-01-10, we run this one as debug and expect it to fail).

Do you have an estimate for how much disk space your submission uses?

We have to convert the whole dataset into a specific format save it, run inference, and then convert the output back, so it is quite a lot of memory. What is the memory limit? (can’t find it in the ''Code submission format" section of the competition’s page)

It might help to run shutil.disk_usage log the available disk space remaining at various points while your submission is running.

The log file is truncated after 2k lines which makes it hard to track the progress (time/memory usage) in details throughout the execution.

Do I understand correctly that the message:

Your submission did not output the expected file so it could not be scored. This may be due to an unhandled exception or syntax error in your code. The log output may have more details.

May (but not have to be) caused by exceeding the resources (time/mem) limitations? Or if this occurred would I get more more specific message?

Thank you for your response

Are we allowed to write additional band files in the file system? I tried to pull additional bands for chips and save them in the folder but got an permission error:

OSError: [Errno 30] Read-only file system: '/codeexecution/data/test_features/aaaa/AOT.tif

Hi, I had the exact same issue with my submission of ‘2022-01-10 00:28:11 UTC’.

The execution failed with empty logs. What could be the reasons?

Thanks.

Hi, mziaja. I got the same error since 2022-1-08. I uploaded three submissions, all failed without explicit reason. Those submissions work well locally and I am still finding the reason why they failed. I also think this maybe exceeding system resources error, So i tried

  • change batch size from 4 to 2
  • change dataloader’s num_worker from 6 to 2

But still failed. My submission just uses 6G gpu memory, and it will not save any postprocessing files. So it will not exceed gpu memory and disk limit.

Hi, I just found it is possible code exceed cpu memory, I had fixed it and prepare to try again.

@haijunli The error:

OSError: [Errno 30] Read-only file system: '/codeexecution/data/test_features/aaaa/AOT.tif

is because submissions are not permitted to write to the test features Instead write your temporary files to a new folder, e.g., /codeexecution/tmp.

@mziaja

May (but not have to be) caused by exceeding the resources (time/mem) limitations? Or if this occurred would I get more more specific message?

That’s right; it could be caused by any number of failures including out of memory, out of disk. At least one of your submissions exceeded the 4 hour time limit. We are working on making the failure messages more informative, but in the meantime, I’d suggest finding a way to suppressing some of your logs so you don’t exceed the log line limit. Hope this helps!

Hi @rbgb

What is the disk storage limit? If we pull additional bands and save them to the temporal directory, could we read the files in every following submission?

Or should we just pull additional bands and process them in memory in every submission and execution?

Thanks!

After starting up, it looks like the runtime has around 60GB of unused disk available. You can always use shutil.disk_usage to monitor that. The disk space is temporary – it will be cleared when your submission finishes running, so you’ll need to repeat your processing for every submission.

FYI, we managed to shorten the execution time and this solved the issue. Thanks for the support.

1 Like