[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Our pool appears to work inefficiently

Date: Tue, 20 Sep 2005 13:13:36 +0200
From: Alain EMPAIN <alain.empain@xxxxxxxxx>
Subject: Re: [Condor-users] Our pool appears to work inefficiently

Hello,

my 1 euro cent :

I got the same kind of symptoms after tripling the node number : it was just the NFS server not answering fast enough. In my case, the solution was to allow more nfs server instances, and to mount the shared partition with the options ' rw,hard,nointr,tcp,vers=3,rsize=32k,wsize=32k,bg '

Now the submits are taken into account quickly and on all the nodes.

	Hoping to help

	Alain

Michael Yoder wrote:

 I could be exposing my lack of knowledge of the mechanics of condor
pools, however on hand I am quite surprised that the performance of

the

pool is, on the whole, quite poor. The composition of the pool is
complicated -- there are machines from different departments and/or
subnet, and so this may be a very difficult issue to analyse or for

any

one to advise us on...

According to condor_status most of the machines are unclaimed, however
when I submit a batch of 100 simple jobs I find that maybe 50% of them
will run simultaneously in the pool -- the rest are rejected, and
condor_q tells me that machines do match however reject the jobs for
some unknown reason. The vast majority of the machines are running XP
with SP2.

Can anyone please advise us in this respect. For example what might be
wrong in the pool, or what analysis might we consider doing?

1216 match, match, but reject the job for unknown reasons

The trick to figuring this out would be to track down these "unknown
reasons".  Are there certain machines that are consistently able to run
jobs?  Are there certain machines that consistently fail to run jobs?
You can find successful machines by looking at the "LastRemoteHost"
attribute that condor_history <cluster.proc> -l reports.  Then see if
you can find failures by looking at the ShadowLog on the submitting
machine.  You may want to have a look at my Troubleshooting page:

http://docs.optena.com/display/CONDOR/Troubleshooting

My guess is that some of your machines are somehow mis-configured and that jobs are going there, dying, and getting kicked off, only to start somewhere else and succeed.

Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


--
------------------------------------------------------------
Dr Alain EMPAIN  <alain.empain@xxxxxxxxx> <alain@xxxxxxxxxx>
      Bioinformatics, Molecular Genetics,
      Fac. Med. Vet., University of LIEGEe, Belgium
      Bd de Colonster, B43   B-4000 LIEGEe (Sart-Tilman)
WORK: +32 4 366 4159         FAX: +32 4 366 4122
HOME: rue des Martyrs,7      B- 4550 Nandrin
      +32 85 51 2341         GSM: +32 497 70 1764
-------------------------------------------------------------------------------
"I worry about my child and the Internet all the time, even though she's
too young to have logged on yet. Here's what I worry about. I worry that
10 or 15 years from now, she will come to me and say 'Daddy, where were
you when they took freedom of the press away from the Internet?'"
--Mike Godwin, Electronic Frontier Foundation
-------------------------------------------------------------------------------

References:
- RE: [Condor-users] Our pool appears to work inefficiently
  - From: Michael Yoder

Prev by Date: Re: [Condor-users] Problem installing condor on cluster.
Next by Date: Re: [Condor-users] Resending: Solaris 10 - All jobs idling for ever...
Previous by thread: RE: [Condor-users] Our pool appears to work inefficiently
Next by thread: [Condor-users] MyProxy username
Index(es):
- Date
- Thread