Hi,
I'm running jobs on windows machines at Purdue that take a number of
days to complete. In the process, I am wasting a lot of computing
time. I have my jobs set up so that the dumpfiles are transferred to
the spool directory when a job is evicted from a machine and then the
updated spool files are transferred to the next machine when the job
executes again. The problem with this is that many times, the machines
seem to be manually rebooted (or something else unexpectedly happens
that suddenly takes the machine off the network. This is the reason
that a large fraction of my jobs stop executing on a given machine, and
when this happens I lose all of the computing because my dumpfiles are
not updated because condor didn't go through the normal evict process.
An example of this can be found in this log file (more text below excerpt):
----------------------------------------------------------------------------------
006 (78912.994.000) 08/25 11:46:21 Image size of job updated: 4980
...
022 (78912.994.000) 08/25 11:49:23 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx
<128.210.37.33:1061>
...
024 (78912.994.000) 08/25 12:09:23 Job reconnection failed
Job disconnected too long: JobLeaseDuration (1200 seconds) expired
Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (78912.994.000) 08/25 14:38:31 Job executing on host:
<128.210.59.72:1057>
----------------------------------------------------------------------
So far the only way I've found to get around this is to do a
condor_vacate_job /cluster/ command periodically to force the jobs to
vacate and update normally. I have two questions: do you know why so
many jobs are suddenly killed in the way I've talked about above? and
2) Is there another easier/more efficient way I can update spool files
in a periodic manner to avoid this problem?
Thanks,
Nate Kaib
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/