Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with jobs
Basically I have a pool with a shared file system and 25 machines. These are
all very powerfull.
I think the weakest link in my chain is my submitting machine which is just
a lone server with its
own configuration. Its a 1ghz 512mb Mini ITX box. Not the fastest in the
world and has a few
other applications running (required).
Is this the machine I should set JOB_START_COUNT on? or should it be set on
the machines that
actually run the jobs?
On my submitting machine thats the one I see the condor_shadow daemons
firing up.
chris 18253 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.9
<146.191.100.202:46251> -
chris 18325 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.12
<146.191.100.202:46251> -
chris 18362 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.13
<146.191.100.202:46251> -
chris 18396 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.14
<146.191.100.202:46251> -
chris 18454 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.15
<146.191.100.202:46251> -
chris 18464 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.16
<146.191.100.202:46251> -
chris 18499 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.17
<146.191.100.202:46251> -
chris 18533 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.18
<146.191.100.202:46251> -
chris 18570 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.19
<146.191.100.202:46251> -
chris 18579 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.20
<146.191.100.202:46251> -
How many of these is normal? I am submitting a 1000 job cluster to the pool
with 25 machines (50 vms).
Looks like I may be running low on memory on my submitting machine as well.
top - 14:01:52 up 20 days, 21:24, 3 users, load average: 0.79, 1.21, 0.92
Tasks: 71 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3% us, 1.7% sy, 0.7% ni, 97.0% id, 0.0% wa, 0.3% hi, 0.0% si
Mem: 484284k total, 476868k used, 7416k free, 14216k buffers
Swap: 999928k total, 0k used, 999928k free, 249768k cached
Im still unsure to some of this.... where exactly is the problem lying,, the
submitter or the executers?
thanks again
Chris
----- Original Message -----
From: "Matt Hope" <matthew.hope@xxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Cc: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
Sent: Thursday, December 08, 2005 8:21 AM
Subject: Re: [Condor-users] Problems with jobs
On 12/7/05, Chris Miles <chrismiles@xxxxxxxxxxxxxxxx> wrote:
I have managed to get that number up as high as 20 and even 50 with only
little difference. I am seeing
more running jobs, but not much more. Only 7vms max so far
How many (non held) clusters and jobs* are in your queue and how often
do you negotiate?
Since the schedd can only do one of the two tasks (starting shadows
and serving queue info requests) it can fail to keep up
A similar situation can occur if something/someone is running condor_q
against your schedd repeatedly.
* if NEGOTIATE_ALL_JOBS_IN_CLUSTER is true then jobs matter, if not
then clusters matter.
Matt
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users