[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs never start running



Hi,

I'm having problems with DAG jobs that never start runnning.
All of these jobs are identical to others that did complete nicely so
the requirements should to be correct.

Condor_q -analyze says that the jobs are "being serviced", but nothing happens,
no negotiation or claim. I've checked the log files on both the submitting computer and
the server but the only sign of the process ID was this line:

Match record (<192.168.0.104:2486>, 23225, -1) deleted.


While checking the SchedLog of the submitting computer I've found some strange 
messages that look ... well, strange:

5/30 13:57:29 Zombie process has not been cleaned up by reaper - pid 9372
5/30 13:57:29 Zombie process has not been cleaned up by reaper - pid 10164
5/30 13:57:29 Zombie process has not been cleaned up by reaper - pid 9128
5/30 13:57:29 Zombie process has not been cleaned up by reaper - pid 5440


I'm not really sure in what state the jobs are in (condor_q -l says they are running,
but none of the computers do run them) and how could I manage this issue and
avoid it later.

And by the way... Is it possible to restart a dag job (with all pre and post scripts executing
again)? Sometimes I'd like to resubmit a few DAG jobs but the submit files are modified by
pre scripts that does not execute if I simply restart a completed job.

Thanks in advance, any help is much appreciated.

Cheers,
Szabolcs