Amendment: The job finally came back this time.
But for curiosity: Does anyone know it that was a known problem in the Condor 6.6 series? I've experienced it a lot of times - otherwise I would not immediately have mailed to the condor users list.
----- Ursprüngliche Mail ----
Von: Thorsten Lampe <kardinallehmann@xxxxxxxx>
An: condor-users@xxxxxxxxxxx
Gesendet: Mittwoch, den 7. März 2007, 13:06:38 Uhr
Betreff: [Condor-users] Condor claims jobs running forever (never terminate)
We're running a Windows pool
using a Windows 2003 Server as Central Manager, another Windows 2003 Server as a dedicated submit node and a bunch of XP boxes for job execution. Condor version is 6.8.4 throughout all nodes.
Now I have submitted 351 jobs of which each should take about 50 minutes. 350 of them executed and terminated properly, while the last one has been kind of "stuck" for over two hours now. The execution node is still in "claimed" state and the job is marked as executing although all output data has already been transferred back to the submit node and the process is no longer running on the execute node! It seems as if Condor just loves the job and doesn't want to release it :-)
I have experienced this problem quite a few times with Condor 6.6.* in the past which was actually my
primary reason for updating our pool to 6.8. Now I just can't imagine me being the only one running into that problem and would guess it should be a somewhat well-known problem...
Does anyone have a clue?
Thanks,
Thorsten
Jetzt Mails schnell in einem Vorschaufenster überfliegen. Dies und viel mehr bietet das
neue Yahoo! Mail
.