Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Tracking DAGMan jobs
- Date: Mon, 30 Dec 2013 05:17:38 -0800 (PST)
- From: nathan.panike@xxxxxxxxx
- Subject: Re: [HTCondor-users] Tracking DAGMan jobs
Wrap it in a nested dag and this should be pretty easy: The toplevel DAG will handle all the messy details.
subdag external mydag the-original-dag.dag
script post mydag post.script
script pre mydag pre.script
Nathan Panike
> From: Brian Candler <b.candler-e+AXbWqSrlAAvxtiuMwx3w@xxxxxxxxxxxxxxxx>
> Subject: Tracking DAGMan jobs
> Date: Mon, 30 Dec 2013 12:16:47 +0000
>
> I wish to submit DAGs and track them in a database. When each DAG
> completes, I want the database to update, and record the success/fail
> status. I'm sure I can't be the first person to want to do this :-)
>
> However I'm having trouble working out the best way to interact with Condor.
>
>
> 1. I could add a FINAL node to the DAG itself - ideally a NOOP job with
> a SCRIPT PRE. This gets $DAG_STATUS and $FAILED_COUNT parameters.
>
> However, when I go to update the database, I'll want to know the
> clusterID of the dagman process itself (to find the corresponding row
> for the job submission). This won't be known until condor_submit_dag is
> run, so I can't hardcode it in the FINAL node unless I allocate my own
> independent IDs. Is there a way to get this?
>
> $JOBID doesn't seem to help - it's the ID of an individual DAG node, not
> the dagman job itself. Indeed, dagman rejects it:
>
> 12/30/13 11:23:36 Warning: $JOBID macro should not be used as a PRE
> script argument!
> 12/30/13 11:23:36 ERROR: Warning is fatal error because of
> DAGMAN_USE_STRICT setting
>
> Similarly, $CLUSTER and $(CLUSTER) are also rejected.
>
> Now, I've done a bit of experimentation:
>
> $ cat testfinal.dag
> FINAL final_node /dev/null NOOP
> SCRIPT PRE final_node do_final.sh $DAG_STATUS $FAILED_COUNT
>
> $ cat do_final.sh
> #!/bin/sh
> exec >>/tmp/do_final.out
> echo "Args: $@"
> printenv
>
> $ condor_submit_dag testfinal.dag
>
> With this, I find the dagman cluster ID is in environment variable
> "CONDOR_ID" (without a leading underscore). This seems to be completely
> undocumented; the manual only talks about the CONDOR_IDS setting, which
> is unrelated.
>
> Looking at the source, this behaviour happens *only* for scheduler
> universe jobs:
>
> # src/condor_schedd.V6/schedd.cpp
> Scheduler::start_sched_universe_job(PROC_ID* job_id)
> ....
> // stick a CONDOR_ID environment variable in job's environment
> char condor_id_string[PROC_ID_STR_BUFLEN];
> ProcIdToStr(*job_id,condor_id_string);
> envobject.SetEnv("CONDOR_ID",condor_id_string);
>
> Furthermore, I don't see any code which makes use of this value. How
> safe is it to rely on this? If it's used by some well-known external
> application (e.g. Pegasus) then it could be dependable.
>
> Quick look at Pegasus source: yes, I think that's what it's there for.
>
> $ grep -R CONDOR_ID .
> ../bin/pegasus-dagman: arguments.insert(0,
> "condor_scheduniv_exec."+os.getenv("CONDOR_ID"))
> ../bin/pegasus-dagman:
> dagman_bin=os.path.join(os.getcwd(),"condor_scheduniv_exec."+os.getenv("CONDOR_ID"))
> ../test/exitcode/largecode.out: <env key="CONDOR_ID">511497.0</env>
>
>
> 2. I could use -append or -insert_sub_file to modify the dagman
> submission file. This will have $(cluster) available, and I could try to
> use +PostCmd. But the documentation says this is only for vanilla
> universe jobs (and the PostCmd runs on the execute machine for the job),
> whereas dagman is a scheduler universe job, and runs on the scheduler host.
>
> http://research.cs.wisc.edu/htcondor/manual/current/condor_submit.html#80133
>
> Also, I can't see any way to get at the DAGman exit code in a macro
> which could be passed to PostArgs.
>
>
> 3. I can create a NODE_STATUS_FILE or JOBSTATE_LOG file, or take the
> *.dagman.out file, and poll it periodically. Or I can poll the queue and
> look for the dagman clusterID, wait for it to vanish from the queue,
> then check the file. Both of these seem pretty messy to me.
>
> 4. I could send out an E-mail on completion to a special address which
> triggers a handler script which parses the mail. I really really don't
> want to do this.
>
> Anybody else done something like this?
>
> It's really only DAGs I'm worried about for now, although I suppose it
> would be good to be able to track one-off jobs in the same way. They
> could always be wrapped in a DAG.
>
> Thanks,
>
> Brian.
>