We recently upgraded to condor 6.6.6, and I think this corresponded
with a change in behavior of condor for some users.
One user started having runs hang on in the queue, where they would be
started and stopped immediately. A little snooping in the log files
revealed that each run would fail with:
8/3 21:43:31 ERROR "Uid not found in passwd file & SOFT_UID_DOMAIN is
False" at line 336 in file starter_common.C
in the StarterLog on the compute nodes. The UID in question is valid
and available via NIS only, not literally in the passwd file (which is
the case with all real user UIDs). NIS is functioning properly on the
compute nodes in question.
For a workaround I've set SOFT_UID_DOMAIN=true and the runs have
started and completed successfully.
Was there a change in UID management in version 6.6.6? Is there
something further I can do to determine the root cause of the problem?
Is condor expected to work with NIS (I assume so, as it worked
perfectly before).