stream_output=true
When the scheduler server gets busy jobs seem to die and get placed back into the queue because they cant keep up with the file transfer and I/O (i think). Is there a way to figure this out?
In the Schedd log I see for a particular job,
...cur_host=1, status=2
...cur_host=1, status=2
...cur_host=1, status=2
...cur_host=1, status=2
Shadow pid 23323 for job 145.3 exited with status 107
Deleting Shadow rec for PID 23323, job (145.3)
Maked job as IDLE
Now on the shadowlog I see this around the exact same time,
condor_read(): socket closed when trying to read 5 bytes from startd
slot1@xxxxxxx
IO: EOF reading packet header
Can no longer talk to condor_starter
FileLock::obtain(1) ... now WRITE
FileLock::obtain(2) ... now UNLOCKED
Trying to reconnect...
Trying to reconnnect disconnected job
Any thoughts or ideas why the deamons would be behaving like this? Are there any tuning parameters I can use for a more optimal performance?
--
---
Get your facts first, then you can distort them as you please.--