I am fiding that if I have a large amount of jobs which take a while
to compute that Condor will ecentually stop realizing that tasks have
finished. Once this happens all my cpus become unavailable because
each cpu thinks it is processing a job. If i go to the actual CPU, the
usage is at 0% idle..and the result of the program was saved to the
network. This happens on dual CPU and single CPU machines too.
If I place teh tasks on HOLD , Condor starts to work again but
eventually sometimes teh CPU's again become available for the same
resons.
Here is the error o see in the StarterLog of a machine which still
thought the task was running:
5/15 23:00:50 About to exec C:\WINDOWS\System32\cmd.exe /Q /C task01.bat
5/15 23:00:50 Create_Process succeeded, pid=2588
5/15 23:01:16 Process exited, pid=2588, status=0
5/15 23:01:39 getpeername failed so connect must have failed
5/15 23:02:04 Connect failed for 30 seconds; returning FALSE
5/15 23:02:04 FileTransfer: Unable to connect to server
<192.168.0.3:9611> <<<-----Windows XP STARTD Machine
5/15 23:02:04 JIC::allJobsDone() failed, waiting for job lease to
expire or for a reconnect attempt << LEase never expires
Condor Collector/Negotiator: Condor 6.7.6 on Linux 7.2 / Dec Alpha
Condor StartD Machines: Condor 6.7.6 on Windows XP
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users