Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Jobs don't run on execute machines
- Date: Wed, 3 Feb 2010 09:27:20 -0800
- From: "Finch, Ralph" <rfinch@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Jobs don't run on execute machines
I should have mentioned that everything is set right again by issuing a
condor_restart -all. But of course I'd rather find the root of the
problem.
RF
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Finch, Ralph
> Sent: Wednesday, February 03, 2010 9:25 AM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] Jobs don't run on execute machines
>
> $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> $CondorPlatform: INTEL-WINNT50 $
>
> The pool is about a dozen Windows XP computers, most are 4-core with a
> few 2-core machines. I am submitting from a 4-core machine which
> potentially can also execute on all 4 cores, as can all the other
> machines except the condor master machine; that one we limit to
running
> on just 3 cores in an attempt to not overload it.
>
> The program run is a numerical model which is both cpu- and
> disk-intensive, so using even 3 cores noticeably impacts the
> interactive
> use of a given computer.
>
> The problem is that after some time--perhaps a few hours--the jobs
fail
> to run on any execute machine, instead generating these errors:
>
> 024 (5193.000.000) 02/03 07:15:44 Job reconnection failed
> Job disconnected too long: JobLeaseDuration (300 seconds) expired
> Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling
> job
> ...
> 022 (5193.000.000) 02/03 08:11:22 Job disconnected, attempting to
> reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx
> <136.200.32.179:4314>
>
> This unwanted behavior *may* be triggered by my interactive use of the
> submitting machine. I say this because, for instance, today hundreds
of
> jobs ran successfully overnight, only to start disconnecting when I
> remotely logged in to the submitting machine to check progress. Might
> be a coincidence.
>
> I wonder if I can prevent the disconnecting by running fewer or no
jobs
> on the submitting machine? Even though it has 4 cores, it is also
> running 30-40 condor_shadows and receiving and sending a few 100MB per
> job from and to the remote jobs. Having read about job leases in the
> manual, it seems the problem lies with the submitting machine. Or
> could
> it be the condor master machine?
>
> Ralph Finch, P.E.
> Senior Engineer, W.R.
> California Dept. of Water Resources
> Bay-Delta Office, Delta Modeling Section
> Room 215-13
> 1416 9th Street
> Sacramento, CA 95814
>
> 916-653-7552
> rfinch@xxxxxxxxxxxx
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/