Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] Submission and Worker Machines Out of Sync
- Date: Wed, 21 Jan 2004 10:58:40 +0000
- From: Alexander Klyubin <A.Kljubin@xxxxxxxxxxx>
- Subject: [condor-users] Submission and Worker Machines Out of Sync
Hello!
I'm experiencing a strange problem with Condor. When a machine, where a
simulation is running, is rebooted, switched off, or simply disconnected
from the network for a period of time and then connected back, the
shadow process on the submission machine still thinks the simulation is
running on that machine. After the "outage" the worker machine appears
in the pool in Unclaimed or Owner state. So, simply by comparing the
output of condor_status and condor_q -run one can see the discrepancy --
according to condor_status the machine is not running anything, whereas
according to condor_q it is. Moreover, Condor may even start another
job on this worker machine. In this case, condor_q shows that the
machine is running several jobs at the same time.
Condor does not see this discrepancy even hours after the "outage".
Basically, it never notices anything.
Worker machines are running 6.5.5 and 6.5.4, whereas the central manager
/submission machine (Linux) is running 6.6.0 (the same problem appeared
also with 6.5.5 on the central manager/submission machine). I'm
wondering if anybody else has experienced this problem or is it my
mistake in configuring the pool. I'm using default settings as far as I
can see.
Kind Regards,
Alexander Klyubin
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>