Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Recovering spool files from windows machines
- Date: Mon, 25 Aug 2008 18:00:05 -0700
- From: Nathan Kaib <kaib@xxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Recovering spool files from windows machines
Hi,
I'm running jobs on windows machines at Purdue that take a number of
days to complete. In the process, I am wasting a lot of computing
time. I have my jobs set up so that the dumpfiles are transferred to
the spool directory when a job is evicted from a machine and then the
updated spool files are transferred to the next machine when the job
executes again. The problem with this is that many times, the machines
seem to be manually rebooted (or something else unexpectedly happens
that suddenly takes the machine off the network. This is the reason
that a large fraction of my jobs stop executing on a given machine, and
when this happens I lose all of the computing because my dumpfiles are
not updated because condor didn't go through the normal evict process.
An example of this can be found in this log file (more text below excerpt):
----------------------------------------------------------------------------------
006 (78912.994.000) 08/25 11:46:21 Image size of job updated: 4980
...
022 (78912.994.000) 08/25 11:49:23 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx
<128.210.37.33:1061>
...
024 (78912.994.000) 08/25 12:09:23 Job reconnection failed
Job disconnected too long: JobLeaseDuration (1200 seconds) expired
Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (78912.994.000) 08/25 14:38:31 Job executing on host:
<128.210.59.72:1057>
----------------------------------------------------------------------
So far the only way I've found to get around this is to do a
condor_vacate_job /cluster/ command periodically to force the jobs to
vacate and update normally. I have two questions: do you know why so
many jobs are suddenly killed in the way I've talked about above? and
2) Is there another easier/more efficient way I can update spool files
in a periodic manner to avoid this problem?
Thanks,
Nate Kaib