Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor's calculated memory vs image size of jobs in queue

Date: Wed, 16 May 2007 11:55:44 -0500 (CDT)
From: Paul Armor <parmor@xxxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] condor's calculated memory vs image size of jobs in queue

Hi,

I'm wondering if others are noticing similar problems? How they'reworking with/around this? If I've just got something misconfigured?

I've apparantly got a whole in my config, but I'm wondering how othershandle this.

I'm noticing an interesting edge case in our pool, where a user has lotsof jobs queued up... some may get evicted after some amount of run time,fail to match when they try to pick up where they left off after acheckpoint/eviction as their SIZE had grown to larger than the "Memory"value determined on start up on the compute node. When that job has thelowest job id for that user in the queue, schedd will just spin from thatpoint on, only trying to schedule that job, and no others...



As examples are far better than my description:

Compute node has SMP CPU and 2048M memory, in the condor init script weset a ulimit of 1300000, to keep the job from running the machine intothe weeks. On startup condor auto determines 2 cpus, with 1004M per CPU(per condor_status).

A job starts running on machine, and eventually is checkpointed/evicted.Its image size is reported as 1220.7 from the schedd (via condor_q). Anyjobs with a lower job id run to completion, but, any jobs with a higherjob id from that point on fail to match, such that the queue appears assuch for that user:


1515840.0   user            5/12 21:01   1+03:42:12 I  0   1220.7 net.sh
1517884.0   user            5/13 01:53   1+01:23:22 I  0   1025.4 net.sh
1517885.0   user            5/13 01:53   1+00:40:21 I  0   1220.7 net.sh
1582585.0   user            5/15 20:54   0+01:04:55 I  0   459.0 net.sh

From MatchLog on the collector/negotiator machine, I only see job

1515840.0 trying to be matched, and all other jobs by that user areignored. If the user blows away jobs 1515840.0 1517884.0 1517885.0, allother jobs start getting scheduled/matched/run; until this happens again.



Thanks!
Paul

Follow-Ups:
- Re: [Condor-users] condor's calculated memory vs image size of jobs in queue
  - From: Stuart Anderson

References:
- [Condor-users] how to ask an execute machine "stop after this job" ?
  - From: Nicolas GUIOT
- Re: [Condor-users] how to ask an execute machine "stop after this job" ?
  - From: Matt Hope

Prev by Date: [Condor-users] only preempt "NiceUser" jobs
Next by Date: Re: [Condor-users] condor's calculated memory vs image size of jobs in queue
Previous by thread: Re: [Condor-users] how to ask an execute machine "stop after this job" ?
Next by thread: Re: [Condor-users] condor's calculated memory vs image size of jobs in queue
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] condor's calculated memory vs image size of jobs in queue