Hi Christopher,
As you have noticed the Post Script executes for a DAG node when the job cluster is complete (success or failure), and not per proc. The only way for a Post Script to be run at the end of each job proc is if the nodes were not multi-proc but spread out individually.
I will also note that with the current set up if any of the 50,000 some procs fail then all other procs will be removed by DAGMan and the node goes into failure mode. I also would not recommend not using a wrapper script around the actual executable itself
since wrapper scripts are evil and hide vital information from condor, have the risk of not handle certain failures correctly, and make debugging even more difficult.
All this being said I do have an idea come to mind being creating a local universe job to run under the DAG to monitor the other nodes either by querying the queue or trailing the *.nodes.log file like DAGMan proper does. What is the full requirements of this
situation? If one of the many procs succeeds you want to remove the entire cluster, but should the enitre DAG exit? Is there only a single node in the DAG that this applies to? Are there other nodes in the DAG? If so, how complex is the DAG structure? Also
what version of HTCondor are you currently using?
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Christopher Phipps <hawtdogflvrwtr@xxxxxxxxx>
Sent: Monday, January 29, 2024 8:32 AM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] POST on each Proc I'm in a situation where I need a DAG to run a POST script (or
something equivalent) after each procid finishes while queue is greater than 1 in the submit file, to determine if the remainder of the jobs within that cluster, or other jobs performing a similar action within another DAG should be aborted. Obviously the DAG only runs POST after an entire job Cluster completes, but I'm curious if there is another way to have something run after each ProcId finishes so we can kill the remainder of the jobs if we get a result elsewhere. Our Jobs generally contain about 50,000 15 hour jobs that can/should be exited if one of the Processes (ProcId) finishes with a positive result. Given we only have 10,000 cores to work against, we could have a positive result minutes into processing, but have to wait the days necessary for all Jobs to complete in order to know this in POST. I've researched ABORT-DAG-ON, but we have little control over the application we run, so I'd need to write a wrapper that interprets the results and exits appropriately to stop the jobs within the cluster and then handle the removal of like jobs in FINAL after the abort. I'm just curious if there is a way to maintain use of the native binary we're using without a wrapper, without also having to define each ProcId manually in the DAG. Thanks, ChrisP _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |