Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_negotiator/condor_collector scheduling problem

Date: Thu, 04 May 2006 14:34:11 -0400
From: Armen Babikyan <armenb@xxxxxxxxxx>
Subject: [Condor-users] condor_negotiator/condor_collector scheduling problem

Hi Condor Team,

A few weeks ago I described a problem I was having with Condor notscheduling jobs on available resources. I've recreated the problem in asimpler way, without the need for a DAG. It seems likecondor_negotiator and/or condor_collector are somehow misbehaving andnot matching jobs when there are resources and jobs that match.

My Condor system contains a bunch of VMs; one VM that providesMY_RESOURCE_1, and one other VM that provides MY_RESOURCE_2. First, Isimply submit 4 jobs that require MY_RESOURCE_1, and wait untilMY_RESOURCE_1 gets claimed by one of these processes. The other threejobs sit idle. I then submit 4 jobs that require MY_RESOURCE_2. Eventhough the VM with the dedicated MY_RESOURCE_2 is idle, Condor won't runone of the second batch of 4 jobs that I submit. Condor will onlyschedule a MY_RESOURCE_2-needing process when the lastMY_RESOURCE_1-needing process has started running.

I am fairly certain it is a negotiator/collector problem, because Idon't see anything egregious in the StartLog of the machine providingMY_RESOURCE_2.

To exacerbate the problem, download the 3.8-kbyte tar.gz file linkedbelow. MY_RESOURCE_1 is needed by the foo_*.sub jobs, and MY_RESOURCE_2is needed by the bar_*.sub jobs, so configure two separate VMs withthese resources. condor_submit the foo_*.sub jobs, wait for one tostart, then condor_submit the bar_*.sub jobs. You should see all thebar jobs sitting idle, even though one (I think) should be running. Abar job won't start running until the last foo job is running.

Is there anything I can do to speed up progress on resolving thisissue? If the Condor team doesn't have resources to track down thisissue, could I get a link to the Condor source code so that I canexamine this problem in greater detail? This problem is really holdingup some of my work, and though I could put together a complex hack toget around this issue, I'd rather go this route first. Of course, I'mwilling to wait a release or two for this (dare I say) bug to be fixed,but I haven't received any acknowledgement or feedback as to the causeof this problem.

I'm using Condor 6.7.18, and have reproduced this problem on X86_64 andINTEL architectures, with custom and (more-or-less) default Condorconfigurations.

As always, feel free to email me or call me if you'd like greaterdetail. Thanks!


 - Armen

Armen Babikyan wrote:

Hi,
A couple weeks ago I made a post about having a problem with Condor'sscheduling mechanism not advancing a pipeline that involves twosingleton dedicated resources in a row. I followed up last week, buthaven't received any replies. Has this issue been a problem for anyoneelse? I could be making a simple configuration mistake, but I 'm prettysure my configuration is as it ought to be. In any case, I've tarballedup my scenario and posted it online:
http://www.static.net/~armenb/condor-pipeline-test.tar.gz
First, add the following machine attributes to one execute machine'scondor_config.local:
MY_RESOURCE_1 = TRUE
VM1_STARTD_EXPRS = MY_RESOURCE_1

and add these machine attributes to a different execute machine:

MY_RESOURCE_2 = TRUE
VM1_STARTD_EXPRS = MY_RESOURCE_2
and restart condor on both machines. After unpacking the tarball, enterthe condor-pipeline-test directory, and run these commands:
condor_submit_dag -no_submit special_0.dag
condor_submit_dag -no_submit special_1.dag
condor_submit_dag -no_submit special_2.dag
condor_submit_dag -no_submit special_3.dag
condor_submit_dag test.dag
After running the above commands, the problem I describe in my previouspost (quoted below) should manifest itself. I am running Condor 6.7.18on a 48-CPU Opteron (X86_64) system that has a globally-accessible NFSpartition.
If you'd like any additional information, by all means, feel free to ask.

Thanks!

 - Armen

Armen Babikyan wrote:
Hi,
This issue is still causing problems for me. I've created amini-scenario (independent of my larger project) that manifests theproblem described, though in a slightly different way (***). I willgladly provide a tar.gz of my scenario (i'd rather not spam this mailinglist with attachments). Here is more detailed output of the problems Iam seeing:
1.0 armenb 4/19 14:19 0+00:17:50 R 0 3.8condor_dagman -f -4.0 armenb 4/19 14:22 0+00:15:05 R 0 3.8condor_dagman -f -5.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8condor_dagman -f -7.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8condor_dagman -f -8.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8condor_dagman -f -9.0 armenb 4/19 14:22 0+00:04:49 R 0 0.0 mysleep 600-n 010.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600-n 011.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600-n 012.0 armenb 4/19 14:33 0+00:00:00 I 0 0.0 mysleep2600 -n 0
mysleep requires MY_RESOURCE_1, which is provided by VM1 of machinegrid-2. mysleep2 requires MY_RESOURCE_2, which is provided by VM1 ofmachine grid-3. grid-3's VM1 is currently idle, and should bescheduling mysleep2.
condor_q has this to say about job 12:

012.000:  Run analysis summary.  Of 50 machines,
   49 are rejected by your job's requirements
    0 reject your job because of their own requirements
    0 match but are serving users with a better priority in the pool
    1 match but reject the job for unknown reasons
    0 match but will not currently preempt their existing job
    0 are available to run your job
There is no mention of job "12.0" anywhere in the condor logs - only thefollowing in the submit machine's spool directory:
local.grid-8/spool/job_queue.log:101 12.0 Job Machine
local.grid-8/spool/job_queue.log:103 12.0 GlobalJobId"grid-8.llan.ll.mit.edu#1145471586#12.0"
local.grid-8/spool/job_queue.log:103 12.0 ProcId 0
When job 9 finishes, condor_dagman adds another mysleep2 to thepipeline, but it also remains idle in the queue:
1.0 armenb 4/19 14:19 0+00:25:40 R 0 3.8condor_dagman -f -4.0 armenb 4/19 14:22 0+00:22:55 R 0 3.8condor_dagman -f -5.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8condor_dagman -f -7.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8condor_dagman -f -8.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8condor_dagman -f -10.0 armenb 4/19 14:22 0+00:02:36 R 0 0.0 mysleep 600-n 011.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600-n 012.0 armenb 4/19 14:33 0+00:00:00 I 0 0.0 mysleep2600 -n 013.0 armenb 4/19 14:43 0+00:00:00 I 0 0.0 mysleep2600 -n 0
(***) I should mention here that in my actual application, job 9 wouldnot terminate until job 12 terminated, and this would be controlled byan external program that both job 9 and job 12 would be talking to.With the current problem I am describing, this causes deadlock.
*Finally*, only when job 10 exits does Condor fire up job 12 on grid-3:
1.0 armenb 4/19 14:19 0+00:34:01 R 0 3.8condor_dagman -f -4.0 armenb 4/19 14:22 0+00:31:16 R 0 3.8condor_dagman -f -5.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8condor_dagman -f -7.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8condor_dagman -f -8.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8condor_dagman -f -11.0 armenb 4/19 14:22 0+00:00:55 R 0 0.0 mysleep 600-n 012.0 armenb 4/19 14:33 0+00:00:51 R 0 0.0 mysleep2600 -n 013.0 armenb 4/19 14:43 0+00:00:00 I 0 0.0 mysleep2600 -n 014.0 armenb 4/19 14:53 0+00:00:00 I 0 0.0 mysleep2600 -n 0
Why doesn't Condor schedule an idle job with a Resource that should beavailable? How do I get Condor to be more opportunistic in its scheduling?
Any advice would be very helpful,

Thanks!

- Armen
Armen Babikyan wrote:
Hi,
In my Condor setup, I'm using Condor's DAG functionality to ensure orderof dependencies between different programs, and to maximize efficiencyof the pipeline through my system. I'm having a problem though:
In two consecutive stages of the pipeline (called B and C,respectively), interaction with some external hardware device occurs.The design of the hardware allows for pipelined control - the iteration#N of stage B can occur while iteration #(N-1)'s stage C is takingplace. Furthermore, there is only one of these hardware resourcesavailable. I've written a proxy between Condor and this hardware, andhave two programs which (should) effectively run and exit right aftereach other.
My experiment generates lots of these "B -> C" DAGs. Among several jobB's that will appear as idle in 'condor_q', one will get to run, and theother job B's will sit idle, waiting for the first job B to exit. Whenthe job B exits, Condor will schedule a job C, and another job B.
When only one B -> C process is occuring, the system runs fine: B runs,B exits, condor schedules C, C runs, C exits, pipeline continues.
The problem I'm having is that with more than one "B -> C" DAG, thesecond B-job runs, but the first C-job sits idle in 'condor_q' forever.I'm not sure why. The single machine controlling this external hardwarehas two VM's on it, (configured with NUM_CPUS = 2. It also hashyperthreading, but that shouldn't matter, i'm pretty sure). I've madesure to define my two resources (in the machine and job configurations),and add one resource to each of the VM1_STARTD_EXPRS andVM2_STARTD_EXPRS variables in the machine's config. IOW, Job B and JobC require different resources (e.g. JOB_B and JOB_C, the former providedby VM1 and the latter by VM2).
I looked up C's job number among the log files of the machines in mycluster, and none have any mention of the job. I can only find mentionof my job in the spool directory of the submitting machine. 'condor_q-analyze' has this to say about the C job that won't run:
129.000:  Run analysis summary.  Of 50 machines,
  49 are rejected by your job's requirements
   0 reject your job because of their own requirements
   0 match but are serving users with a better priority in the pool
   0 match but reject the job for unknown reasons
   0 match but will not currently preempt their existing job
   1 are available to run your job

Any ideas or advice would be most helpful.  Thanks!

- Armen


--
Armen Babikyan
MIT Lincoln Laboratory

armenb@xxxxxxxxxx . 781-981-1796

Follow-Ups:
- Re: [Condor-users] condor_negotiator/condor_collector scheduling problem
  - From: Erik Paulson

Prev by Date: Re: [Condor-users] IDLE jobs
Next by Date: Re: [Condor-users] Args not found (condor-c)
Previous by thread: [Condor-users] HPDC 2006 Call for Participation
Next by thread: Re: [Condor-users] condor_negotiator/condor_collector scheduling problem
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] condor_negotiator/condor_collector scheduling problem