I have some jobs, supposed to not take more than 10 min, taking hours
to carry out, and looking at their logs I read several lines like the
one below:
001 (10945.000.000) 09/29 16:44:14 Job executing on host:
<172.24.89.68:46508>
...
004 (10945.000.000) 09/29 16:45:48 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
2367357 - Run Bytes Sent By Job
10400133 - Run Bytes Received By Job
[and so on]
My job is always being evicted and resubmitted to another node, until
it finally terminates.
Is there a way to look deeper in such a behaviour? Is there something
that I can do to avoid or at least minimise it?
It gets weirder because I also submit several other jobs at the same
time and almost all of them terminates in a couple of minutes, as
expected.