[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Max Idle and DAGS



Hi Jeff,

The use of the MaxIdle functionality is something that we are actively working on at the CHTC along improving how DAGMan manages nodes containing more than a single job. DAGMan's MaxIdle acts very differently than the max_idle command in the Job Description Language (JDL). As the JDL version creates a late materialization factory in the Schedd while DAGMan's MaxIdle functionality acts as a threshold for placing more jobs to the AP. I say threshold because currently DAGMan can place more jobs past this 'max' limit of idle jobs in various situations.

The actual issue you are likely experiencing is the fact that DAGMan has two methods of placing jobs to the AP. First is shelling out condor_submit on behalf of the user. Second is materializing jobs itself and directly placing them to the AP. Until recently (v24.2.1) the latter did not actually respect any late materialization capabilities defined in the job description. If your AP is running a version prior to this version try setting DAGMAN_USE_DIRECT_SUBMIT = False in the AP configuration.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jeff Templon <templon@xxxxxxxxx>
Sent: Monday, January 6, 2025 2:07 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Max Idle and DAGS
 
Hi,

We’re trying to get out users to use the late materialisation factory stuff, to help avoid tens of thousands of queued jobs.

It doesn’t seem to work though, with DAGs - even though the docs suggest it should.  Quoting one of our users:

I submitted these jobs using condor_submit_dag and setting the -MaxIdle flag as suggested when doing condor_submit_dag -h, as well as max_idle in the submission file of the sub-job. According to the documentation the flag is instead -maxidle so I also tried that to no avail.

According to the documentation these flags should set the DAGMAN_MAX_JOBS_IDLE key in the configuration. By looking at this key in the logs it appears that the flags are indeed ignored. Creating a local config file with DAGMAN_MAX_JOBS_IDLE = 20 changes the value in the logs, but from condor_q it still appears that more than 20 jobs can be idle. 

Any ideas?

Happy New Year,

JT