Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor doesn't recognize that tasks have finished
- Date: Mon, 16 May 2005 07:58:12 -0700
- From: John Wheez <john@xxxxxxxxxx>
- Subject: [Condor-users] Condor doesn't recognize that tasks have finished
I am fiding that if I have a large amount of jobs which take a while to
compute that Condor will ecentually stop realizing that tasks have
finished. Once this happens all my cpus become unavailable because each
cpu thinks it is processing a job. If i go to the actual CPU, the usage
is at 0% idle..and the result of the program was saved to the network.
This happens on dual CPU and single CPU machines too.
If I place teh tasks on HOLD , Condor starts to work again but
eventually sometimes teh CPU's again become available for the same resons.
Here is the error o see in the StarterLog of a machine which still
thought the task was running:
5/15 23:00:50 About to exec C:\WINDOWS\System32\cmd.exe /Q /C task01.bat
5/15 23:00:50 Create_Process succeeded, pid=2588
5/15 23:01:16 Process exited, pid=2588, status=0
5/15 23:01:39 getpeername failed so connect must have failed
5/15 23:02:04 Connect failed for 30 seconds; returning FALSE
5/15 23:02:04 FileTransfer: Unable to connect to server
<192.168.0.3:9611> <<<-----Windows XP STARTD Machine
5/15 23:02:04 JIC::allJobsDone() failed, waiting for job lease to expire
or for a reconnect attempt << LEase never expires
Condor Collector/Negotiator: Condor 6.7.6 on Linux 7.2 / Dec Alpha
Condor StartD Machines: Condor 6.7.6 on Windows XP