Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Recovering spool files from windows machines

Date: Mon, 25 Aug 2008 18:00:05 -0700
From: Nathan Kaib <kaib@xxxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] Recovering spool files from windows machines

Hi,

I'm running jobs on windows machines at Purdue that take a number ofdays to complete. In the process, I am wasting a lot of computingtime. I have my jobs set up so that the dumpfiles are transferred tothe spool directory when a job is evicted from a machine and then theupdated spool files are transferred to the next machine when the jobexecutes again. The problem with this is that many times, the machinesseem to be manually rebooted (or something else unexpectedly happensthat suddenly takes the machine off the network. This is the reasonthat a large fraction of my jobs stop executing on a given machine, andwhen this happens I lose all of the computing because my dumpfiles arenot updated because condor didn't go through the normal evict process.


An example of this can be found in this log file (more text below excerpt):

----------------------------------------------------------------------------------

006 (78912.994.000) 08/25 11:46:21 Image size of job updated: 4980
...
022 (78912.994.000) 08/25 11:49:23 Job disconnected, attempting to reconnect
 Socket between submit and execute hosts closed unexpectedly

Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx<128.210.37.33:1061>

...
024 (78912.994.000) 08/25 12:09:23 Job reconnection failed
 Job disconnected too long: JobLeaseDuration (1200 seconds) expired
 Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...

001 (78912.994.000) 08/25 14:38:31 Job executing on host:<128.210.59.72:1057>


----------------------------------------------------------------------

So far the only way I've found to get around this is to do acondor_vacate_job /cluster/ command periodically to force the jobs tovacate and update normally. I have two questions: do you know why somany jobs are suddenly killed in the way I've talked about above? and2) Is there another easier/more efficient way I can update spool filesin a periodic manner to avoid this problem?


Thanks,

Nate Kaib

Prev by Date: Re: [Condor-users] job restarts
Next by Date: [Condor-users] Limiting number of concurrently running jobs from a cluster.
Previous by thread: Re: [Condor-users] job restarts
Next by thread: [Condor-users] Limiting number of concurrently running jobs from a cluster.
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] Recovering spool files from windows machines