Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problem with matched job sitting idle

Date: Wed, 12 Apr 2006 15:23:36 -0400
From: Armen Babikyan <armenb@xxxxxxxxxx>
Subject: [Condor-users] problem with matched job sitting idle

Hi,

In my Condor setup, I'm using Condor's DAG functionality to ensure orderof dependencies between different programs, and to maximize efficiencyof the pipeline through my system. I'm having a problem though:

In two consecutive stages of the pipeline (called B and C,respectively), interaction with some external hardware device occurs.The design of the hardware allows for pipelined control - the iteration#N of stage B can occur while iteration #(N-1)'s stage C is takingplace. Furthermore, there is only one of these hardware resourcesavailable. I've written a proxy between Condor and this hardware, andhave two programs which (should) effectively run and exit right aftereach other.

My experiment generates lots of these "B -> C" DAGs. Among several jobB's that will appear as idle in 'condor_q', one will get to run, and theother job B's will sit idle, waiting for the first job B to exit. Whenthe job B exits, Condor will schedule a job C, and another job B.

When only one B -> C process is occuring, the system runs fine: B runs,B exits, condor schedules C, C runs, C exits, pipeline continues.

The problem I'm having is that with more than one "B -> C" DAG, thesecond B-job runs, but the first C-job sits idle in 'condor_q' forever.I'm not sure why. The single machine controlling this external hardwarehas two VM's on it, (configured with NUM_CPUS = 2. It also hashyperthreading, but that shouldn't matter, i'm pretty sure). I've madesure to define my two resources (in the machine and job configurations),and add one resource to each of the VM1_STARTD_EXPRS andVM2_STARTD_EXPRS variables in the machine's config. IOW, Job B and JobC require different resources (e.g. JOB_B and JOB_C, the former providedby VM1 and the latter by VM2).

I looked up C's job number among the log files of the machines in mycluster, and none have any mention of the job. I can only find mentionof my job in the spool directory of the submitting machine. 'condor_q-analyze' has this to say about the C job that won't run:


129.000:  Run analysis summary.  Of 50 machines,
    49 are rejected by your job's requirements
     0 reject your job because of their own requirements
     0 match but are serving users with a better priority in the pool
     0 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     1 are available to run your job

Any ideas or advice would be most helpful.  Thanks!

 - Armen

--
Armen Babikyan
MIT Lincoln Laboratory

armenb@xxxxxxxxxx . 781-981-1796

Follow-Ups:
- Re: [Condor-users] problem with matched job sitting idle
  - From: Armen Babikyan

Prev by Date: [Condor-users] A few basic questions and a modest suggestion
Next by Date: [Condor-users] strange errors in ShadowLog
Previous by thread: Re: [Condor-users] Fwd: A few basic questions and a modest suggestion
Next by thread: Re: [Condor-users] problem with matched job sitting idle
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] problem with matched job sitting idle