Thanks Mark,
I have many users that sending similar dags and they have no issues, this particular submitter is sending a lot more jobs and maybe it's load related.
This issue happens sporadically.
The DAG file:
SUBDAG EXTERNAL A xxx/xxx/a.dag
SUBDAG EXTERNAL B yyy/yyy/b.dag
PARENT A CHILD B
This is our configuration (DAG related):
DAGMAN_MAX_JOBS_IDLE = 25000
DAGMAN_MAX_JOB_HOLDS = 5000
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 100
DAGMAN_MAX_SUBMIT_ATTEMPTS = 1
DAGMAN_USE_SCTRICT =0
DAGMAN_USE_CONDOR_SUBMIT = TRUE
The DAGMAN_MAX_JOBS_SUBMITTED is not configured so the value = 0
Sometimes I can see that shared port daemon is under high load and refuse the connection but it's not happening at the same time.
Many Thanks
David
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mark Coatsworth <coatsworth@xxxxxxxxxxx>
Sent: 07 June 2021 19:19 To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Dagman Issue V9 Hi David,
Can you describe the dag where this is happening (or better yet, send me the .dag file)? When you mention a child dag, are you talking about an external subdag or something different? By default every dag is supposed to have these attributes in its classad. I just did a quick test to verify this. So I'm wondering if there's something special about your environment causing it to not be there. Are you setting a custom value for max jobs? (either with the DAGMAN_MAX_JOBS_SUBMITTED configuration knob or the -maxjobs submit flag) Mark On Mon, Jun 7, 2021 at 6:54 AM <duduhandelman@xxxxxxxxxxx> wrote: > > Hi Again, > I forgot to mention it's happening on dag that submit dag only the child dag are effected. > > Also, the child dag does not have those classads. which I don't know if it's ok or not. > > DAGMan_MaxIdle > DAGMan_MaxJobs > DAGMan_MaxPreScripts > DAGMan_MaxPostScripts > DAGMan_MaxHoldScripts > > Thanks Again, > David > ________________________________ > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of duduhandelman@xxxxxxxxxxx <duduhandelman@xxxxxxxxxxx> > Sent: 07 June 2021 14:29 > To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> > Subject: [HTCondor-users] Dagman Issue V9 > > Hi All, > A week ago I have upgrade to condor 9.0.1 from 8.8 I'm facing an issue with Dagman Jobs, > Most of the jobs running as expected but some DAGMan are not submitting jobs after a while. > It seems that Dagman job is asking for DagMan_Max_jobs and sometimes gets a positive value but sometimes gets negative number and that causing the issue I assume. > > The Sched debug print: > GetAttributeInt(968372, 0 , DAGMAN_MaxJobs) not found. > > The Dag output display every few minutes: > Warning: failed to get attribute DAGMan_MaxIdle > Warning: failed to get attribute DAGMan_MaxJobs > Warning: failed to get attribute DAGMan_MaxPreScripts > Warning: failed to get attribute DAGMan_MaxPostScripts > Warning: failed to get attribute DAGMan_MaxHoldScripts > > > It seems like the value is garbage, probably not initialized. > Any clues? can it be a security issue? > > Many Thanks > David > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ -- Mark Coatsworth Systems Programmer Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |