Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor's calculated memory vs image size of jobs in queue
- Date: Wed, 16 May 2007 11:55:44 -0500 (CDT)
- From: Paul Armor <parmor@xxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor's calculated memory vs image size of jobs in queue
Hi,
I'm wondering if others are noticing similar problems? How they're
working with/around this? If I've just got something misconfigured?
I've apparantly got a whole in my config, but I'm wondering how others
handle this.
I'm noticing an interesting edge case in our pool, where a user has lots
of jobs queued up... some may get evicted after some amount of run time,
fail to match when they try to pick up where they left off after a
checkpoint/eviction as their SIZE had grown to larger than the "Memory"
value determined on start up on the compute node. When that job has the
lowest job id for that user in the queue, schedd will just spin from that
point on, only trying to schedule that job, and no others...
As examples are far better than my description:
Compute node has SMP CPU and 2048M memory, in the condor init script we
set a ulimit of 1300000, to keep the job from running the machine into
the weeks. On startup condor auto determines 2 cpus, with 1004M per CPU
(per condor_status).
A job starts running on machine, and eventually is checkpointed/evicted.
Its image size is reported as 1220.7 from the schedd (via condor_q). Any
jobs with a lower job id run to completion, but, any jobs with a higher
job id from that point on fail to match, such that the queue appears as
such for that user:
1515840.0 user 5/12 21:01 1+03:42:12 I 0 1220.7 net.sh
1517884.0 user 5/13 01:53 1+01:23:22 I 0 1025.4 net.sh
1517885.0 user 5/13 01:53 1+00:40:21 I 0 1220.7 net.sh
1582585.0 user 5/15 20:54 0+01:04:55 I 0 459.0 net.sh
From MatchLog on the collector/negotiator machine, I only see job
1515840.0 trying to be matched, and all other jobs by that user are
ignored. If the user blows away jobs 1515840.0 1517884.0 1517885.0, all
other jobs start getting scheduled/matched/run; until this happens again.
Thanks!
Paul