Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DedicatedScheduler hogging resources

Date: Wed, 15 Mar 2006 11:57:20 -0800
From: Rok Roskar <roskar@xxxxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] DedicatedScheduler hogging resources

this is a snippet from the log file illustrating what I was referringto a few weeks ago...


009 (4225.000.000) 03/15 11:47:30 Job was aborted by the user.
        via condor_rm (by user roskar)
...

000 (4226.000.000) 03/15 11:48:00 Job submitted from host:<128.95.98.82:36131>

...

014 (4226.000.000) 03/15 11:48:55 Node 0 executing on host:<128.95.99.141:32785>

...

014 (4226.000.001) 03/15 11:48:55 Node 1 executing on host:<128.95.99.88:32785>

...

014 (4226.000.000) 03/15 11:50:05 Node 0 executing on host:<128.95.99.141:32785>

...

014 (4226.000.001) 03/15 11:50:05 Node 1 executing on host:<128.95.99.88:32785>

...

014 (4226.000.000) 03/15 11:51:15 Node 0 executing on host:<128.95.99.141:32785>

...

014 (4226.000.001) 03/15 11:51:15 Node 1 executing on host:<128.95.99.88:32785>

...

014 (4226.000.000) 03/15 11:52:25 Node 0 executing on host:<128.95.99.141:32785>

...

014 (4226.000.001) 03/15 11:52:25 Node 1 executing on host:<128.95.99.88:32785>

...

014 (4226.000.000) 03/15 11:53:35 Node 0 executing on host:<128.95.99.141:32785>

...

014 (4226.000.001) 03/15 11:53:36 Node 1 executing on host:<128.95.99.88:32785>

...

the job requests 4 machines, but only 2 are given to it when executionbegins for some reason.


-----------------------------------------
Rok Roskar
University of Washington
Department of Astronomy

On Mar 9, 2006, at 12:59 PM, Greg Thain wrote:

Rok Roskar wrote:
I'm running MPI under condor 6.6:
DedicatedScheduler holds on to resources even after all MPI jobs havebeen
removed from the queue - any way to fix this? Or is it an unfortunate
byproduct of mixing parallel and serial jobs on the same set ofresources?
It will hold onto claims for UNUSED_CLAIM_TIMEOUT seconds after the job
leaves the queue, where UNUSED_CLAIM_TIMEOUT is a parameter in the
condor_config file.  The default is 300 seconds, and you can lower this
as you like.
Also, my jobs sometimes try to start even when DedicatedSchedulerdoesn't
have enough resources for them. This causes infinite looping of
unsuccessful job execution, meaning that all the resource time gets
wasted. For example, my job requests 8 machines, but only 7 areavailable.Somehow, Condor tries to execute the job anyway, but because therearen't
enough resources, it doesn't run. Solutions?
Does the job try to start, or do the machines just get claimed?

-greg

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

References:
- [Condor-users] DedicatedScheduler hogging resources
  - From: Rok Roskar
- Re: [Condor-users] DedicatedScheduler hogging resources
  - From: Greg Thain

Prev by Date: Re: [Condor-users] Problems with visualizing
Next by Date: Re: [Condor-users] keyboard/mouse idle information
Previous by thread: Re: [Condor-users] DedicatedScheduler hogging resources
Next by thread: [Condor-users] Way to append to classad of already-submitted jobs
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] DedicatedScheduler hogging resources