Greetings,
Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.
Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.
I used the command
condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.
I can see the following:
I don't understand the machine's "own requirements" I did try also the extended command:
condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
that provides a summary:
I find these 2 statements conflicting in their meaning...
The output for both commands is very long and rather cryptic.
These are on "Universe = Docker" and I tested simpler
.sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.
What can have been changed from rebooting? Is there any way to find this information?
THanks
Jean-Yves
|