Re: [Condor-users] Problems with jobs

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Good idea stopping the startd process. The less you have running on the machine the easier this is to solve. On all the machines reporting Claimed + Idle -- which schedd in your system has the claim? Try:

condor_status -const 'State == "Claimed" && Activity == "Idle"' -f ?%s\n? GlobalJobId

The GlobalJobId identifies the schedd the job came from -- is it the same schedd that?s running three jobs? Or is it a different schedd in your system that?s maybe hung up and hanging on to those machines?

- Ian

From: Chris Miles [mailto:chrismiles@xxxxxxxxxxxxxxxx]
Sent: December 5, 2005 7:19 PM
To: Ian Chesal
Cc: Condor-Users Mail List
Subject: Re: [Condor-users] Problems with jobs

I have removed the STARTD daemon from my central manager to try improve

performance. and have also set TESTINGMODE in the config file.

Still only 3 vms actually running jobs still.

Chris

----- Original Message -----

From: Ian Chesal

To: Chris Miles

Cc: Condor-Users Mail List

Sent: Monday, December 05, 2005 5:03 PM

Subject: RE: [Condor-users] Problems with jobs

Ahh, but do you only have the one schedd? It looks like you have three jobs running (Claimed + Busy) according to your output -- are they all from the same schedd? Maybe there?s another schedd in your system that?s not able to respond to it?s claims in time?

- Ian

From: Chris Miles [mailto:chrismiles@xxxxxxxxxxxxxxxx]
Sent: December 5, 2005 11:55 AM
To: Ian Chesal
Cc: Condor-Users Mail List
Subject: Re: [Condor-users] Problems with jobs

Its a meaty machine. 2gb of memory and 4Ghz CPU (64 bit) its an IBM built cluster

and doesnt have any resource problem that I can see. there isnt much else working on it.

Chris

----- Original Message -----

From: Ian Chesal

To: Chris Miles

Sent: Monday, December 05, 2005 4:26 PM

Subject: RE: [Condor-users] Problems with jobs

The dreaded Claimed+Idle. It generally happens to us when our schedd can?t keep up with the processing required to start our jobs. Check the resources on your schedd machine: can your machine handle spawing all the necessary shadows? Or is it running out of CPU, memory, disk, etc?

- Ian

From: Chris Miles [mailto:chrismiles@xxxxxxxxxxxxxxxx]
Sent: December 5, 2005 11:11 AM
To: Condor-Users Mail List; Ian Chesal
Subject: Re: [Condor-users] Problems with jobs

Hi Thanks for the response.

SUBMIT_SEND_RESCHEDULE has not specified in any of my config files which

means that its automatically set to true does it not?

condor_q -ana says jobs being serviced.

It seems a lot of machines go into the claimed state but stay idle.

tux.neuralgri LINUX       INTEL Unclaimed Idle       0.250   512 0+00:34:48
vm1@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.750 2048 0+00:00:02
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:05
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.120 2048 0+00:01:20
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:01:21
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:05
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.270 2048 0+00:00:05
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.180 2048 0+00:00:07
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:11
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.050 2048 0+00:00:07
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:10
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.100 2048 0+00:00:07
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:11
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.100 2048 0+00:00:07
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:11
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:07
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.210 2048 0+00:00:11
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.080 2048 0+00:00:08
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:03
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.050 2048 0+00:00:02
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:03
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.160 2048 0+00:00:04
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:05
vm1@xxxxxxxxx LINUX       X86_64 Owner      Idle       1.000 2048 0+20:15:59
vm2@xxxxxxxxx LINUX       X86_64 Owner      Idle       0.310 2048 0+00:00:02
vm1@xxxxxxxxx LINUX       X86_64 Owner      Idle       1.000 2048 0+20:16:02
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.220 2048 0+00:00:09
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.130 2048 0+00:00:05
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:06
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:04
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.460 2048 0+00:00:05
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.200 2048 0+00:00:04
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:05
vm1@xxxxxxxxx LINUX       X86_64 Unclaimed Idle       0.130 2048 0+00:00:05
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:05
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.170 2048 0+00:00:04
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Busy       0.000 2048 0+00:00:06
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Busy       0.020 2048 0+00:00:04
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Busy       0.000 2048 0+00:00:06
vm1@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.110 2048 0+00:00:05
vm2@xxxxxxxxx LINUX       X86_64 Claimed    Idle       0.000 2048 0+00:00:05

Chris

----- Original Message -----

From: Ian Chesal

To: Condor-Users Mail List

Sent: Monday, December 05, 2005 2:39 PM

Subject: Re: [Condor-users] Problems with jobs

Hi.

When im submitting jobs into my pool it seems to take ages to start running the

jobs unless i run condor_reschedule. Is there a way to speed the process up without

me running this command?

[Ian Chesal] See: http://www.cs.wisc.edu/condor/manual/v6.7/3_3Configuration.html#11494 -- make sure you have that set to True in the config file on the machine you?re calling condor_submit from. It will automatically issue a reschedule after submission.

My second problem is that job results are not returning to me any quicker than If i ran

my jobs one a one machine pool. I.e im checking condor_q and the queue is only going

down 1 at a time at roughly the same speed as if there was only one machine in that pool.

It is also slower than if I actually ran my jobs sequentially on one machine using a batch

file or shell script.

[Ian Chesal] What does condor_q -ana say? Are you setting your job requirements such that only one VM in the system is able to match with all your jobs in your cluster? What about the MAX_JOBS_RUNNING setting on your schedd? Make sure that isn?t set to 1.

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Mailing List Archives

Authenticated access

Re: [Condor-users] Problems with jobs