Hi,
A couple weeks ago I made a post about having a problem with Condor's
scheduling mechanism not advancing a pipeline that involves two
singleton dedicated resources in a row. I followed up last week, but
haven't received any replies. Has this issue been a problem for anyone
else? I could be making a simple configuration mistake, but I 'm pretty
sure my configuration is as it ought to be. In any case, I've tarballed
up my scenario and posted it online:
http://www.static.net/~armenb/condor-pipeline-test.tar.gz
First, add the following machine attributes to one execute machine's
condor_config.local:
MY_RESOURCE_1 = TRUE
VM1_STARTD_EXPRS = MY_RESOURCE_1
and add these machine attributes to a different execute machine:
MY_RESOURCE_2 = TRUE
VM1_STARTD_EXPRS = MY_RESOURCE_2
and restart condor on both machines. After unpacking the tarball, enter
the condor-pipeline-test directory, and run these commands:
condor_submit_dag -no_submit special_0.dag
condor_submit_dag -no_submit special_1.dag
condor_submit_dag -no_submit special_2.dag
condor_submit_dag -no_submit special_3.dag
condor_submit_dag test.dag
After running the above commands, the problem I describe in my previous
post (quoted below) should manifest itself. I am running Condor 6.7.18
on a 48-CPU Opteron (X86_64) system that has a globally-accessible NFS
partition.
If you'd like any additional information, by all means, feel free to ask.
Thanks!
- Armen
Armen Babikyan wrote:
Hi,
This issue is still causing problems for me. I've created a
mini-scenario (independent of my larger project) that manifests the
problem described, though in a slightly different way (***). I will
gladly provide a tar.gz of my scenario (i'd rather not spam this mailing
list with attachments). Here is more detailed output of the problems I
am seeing:
1.0 armenb 4/19 14:19 0+00:17:50 R 0 3.8
condor_dagman -f -
4.0 armenb 4/19 14:22 0+00:15:05 R 0 3.8
condor_dagman -f -
5.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8
condor_dagman -f -
7.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8
condor_dagman -f -
8.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8
condor_dagman -f -
9.0 armenb 4/19 14:22 0+00:04:49 R 0 0.0 mysleep 600
-n 0
10.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600
-n 0
11.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600
-n 0
12.0 armenb 4/19 14:33 0+00:00:00 I 0 0.0 mysleep2
600 -n 0
mysleep requires MY_RESOURCE_1, which is provided by VM1 of machine
grid-2. mysleep2 requires MY_RESOURCE_2, which is provided by VM1 of
machine grid-3. grid-3's VM1 is currently idle, and should be
scheduling mysleep2.
condor_q has this to say about job 12:
012.000: Run analysis summary. Of 50 machines,
49 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
1 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
There is no mention of job "12.0" anywhere in the condor logs - only the
following in the submit machine's spool directory:
local.grid-8/spool/job_queue.log:101 12.0 Job Machine
local.grid-8/spool/job_queue.log:103 12.0 GlobalJobId
"grid-8.llan.ll.mit.edu#1145471586#12.0"
local.grid-8/spool/job_queue.log:103 12.0 ProcId 0
When job 9 finishes, condor_dagman adds another mysleep2 to the
pipeline, but it also remains idle in the queue:
1.0 armenb 4/19 14:19 0+00:25:40 R 0 3.8
condor_dagman -f -
4.0 armenb 4/19 14:22 0+00:22:55 R 0 3.8
condor_dagman -f -
5.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8
condor_dagman -f -
7.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8
condor_dagman -f -
8.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8
condor_dagman -f -
10.0 armenb 4/19 14:22 0+00:02:36 R 0 0.0 mysleep 600
-n 0
11.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600
-n 0
12.0 armenb 4/19 14:33 0+00:00:00 I 0 0.0 mysleep2
600 -n 0
13.0 armenb 4/19 14:43 0+00:00:00 I 0 0.0 mysleep2
600 -n 0
(***) I should mention here that in my actual application, job 9 would
not terminate until job 12 terminated, and this would be controlled by
an external program that both job 9 and job 12 would be talking to.
With the current problem I am describing, this causes deadlock.
*Finally*, only when job 10 exits does Condor fire up job 12 on grid-3:
1.0 armenb 4/19 14:19 0+00:34:01 R 0 3.8
condor_dagman -f -
4.0 armenb 4/19 14:22 0+00:31:16 R 0 3.8
condor_dagman -f -
5.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8
condor_dagman -f -
7.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8
condor_dagman -f -
8.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8
condor_dagman -f -
11.0 armenb 4/19 14:22 0+00:00:55 R 0 0.0 mysleep 600
-n 0
12.0 armenb 4/19 14:33 0+00:00:51 R 0 0.0 mysleep2
600 -n 0
13.0 armenb 4/19 14:43 0+00:00:00 I 0 0.0 mysleep2
600 -n 0
14.0 armenb 4/19 14:53 0+00:00:00 I 0 0.0 mysleep2
600 -n 0
Why doesn't Condor schedule an idle job with a Resource that should be
available? How do I get Condor to be more opportunistic in its scheduling?
Any advice would be very helpful,
Thanks!
- Armen
Armen Babikyan wrote:
Hi,
In my Condor setup, I'm using Condor's DAG functionality to ensure order
of dependencies between different programs, and to maximize efficiency
of the pipeline through my system. I'm having a problem though:
In two consecutive stages of the pipeline (called B and C,
respectively), interaction with some external hardware device occurs.
The design of the hardware allows for pipelined control - the iteration
#N of stage B can occur while iteration #(N-1)'s stage C is taking
place. Furthermore, there is only one of these hardware resources
available. I've written a proxy between Condor and this hardware, and
have two programs which (should) effectively run and exit right after
each other.
My experiment generates lots of these "B -> C" DAGs. Among several job
B's that will appear as idle in 'condor_q', one will get to run, and the
other job B's will sit idle, waiting for the first job B to exit. When
the job B exits, Condor will schedule a job C, and another job B.
When only one B -> C process is occuring, the system runs fine: B runs,
B exits, condor schedules C, C runs, C exits, pipeline continues.
The problem I'm having is that with more than one "B -> C" DAG, the
second B-job runs, but the first C-job sits idle in 'condor_q' forever.
I'm not sure why. The single machine controlling this external hardware
has two VM's on it, (configured with NUM_CPUS = 2. It also has
hyperthreading, but that shouldn't matter, i'm pretty sure). I've made
sure to define my two resources (in the machine and job configurations),
and add one resource to each of the VM1_STARTD_EXPRS and
VM2_STARTD_EXPRS variables in the machine's config. IOW, Job B and Job
C require different resources (e.g. JOB_B and JOB_C, the former provided
by VM1 and the latter by VM2).
I looked up C's job number among the log files of the machines in my
cluster, and none have any mention of the job. I can only find mention
of my job in the spool directory of the submitting machine. 'condor_q
-analyze' has this to say about the C job that won't run:
129.000: Run analysis summary. Of 50 machines,
49 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
1 are available to run your job
Any ideas or advice would be most helpful. Thanks!
- Armen