[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job factory universe changed?



There were  3 relevant changes. 

HTCONDOR-756 apply JOB_TRANSFORM_* and JOB_REQUIREMENT_* config of the schedd to late materialization jobs at submit time. 
https://opensciencegrid.atlassian.net/browse/HTCONDOR-756

HTCONDOR-1369 apply submit transforms to late mat factories and jobs
https://opensciencegrid.atlassian.net/browse/HTCONDOR-1369

HTCONDOR-1483 JOB_TRANSFORM vars that indicate cluster transforms and late materialization
https://opensciencegrid.atlassian.net/browse/HTCONDOR-1483

The first change went into  9.4.0, but it  would not have caused you any problems, because it merely moved the transform and submit requirements check from materialization time to submit time of the factory.  

This lead to a bug report that transforms that *modified* job attributes that varied for each job at materialization time would give incorrect answers.  Note that if a transform sets the attribute, that would work.  The issue was when a transform did a modify operation on the attribute.  

The fix for that is HTCONDOR-1369 which applies the transforms to *both* factories at submit time and jobs at materialization time.  This change went into 10.0.0, and requires transforms to be stable, i.e. They must give the same result if applied to both the factory and to the subsequent job, or to just the job itself (for non factory jobs)

The last change makes it possible for a transform to know if it is being run on a job factory, a normal job submission, or on job materialization. This change went in to 10.0.1 

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Thomas
Sent: Wednesday, June 21, 2023 12:20 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] job factory universe changed?

Hi TJ,

One last question on this topic.  Do you happen to know which condor 
version introduced this bugfix/change in behavior?

Thanks,

--Mike

On 6/19/23 10:44, John M Knoeller via HTCondor-users wrote:
> There was a change.  A bug fix actually.
> 
> Transforms and submit requirements are now applied to both the factory at submit time, and to the jobs as they materialize.  You can see that happening in the log
> 
> 06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.-1: 2
> considered, 2 applied (TagJob,RemoveAcctGroup)
> ...
> 06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.0
> step=0 row=0
> 06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.1
> step=0 row=1
> 06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.0: 2
> considered, 2 applied (TagJob,RemoveAcctGroup)
> 06/16/23 15:00:56 (pid:5534) CommitTransaction() failed for cluster
> 19803671 rval=-1 (Invalid value for search tag: None)
> 
> The first line is applying the transform to the factory.  When that finishes, the factory has no value for AccountingGroup, AcctGroupUser, and AcctGroup.
> 
> So when job 19803671.0 is materialized, it *also* has no value for these attributes, which it inherits from the factory.  So the transform does a COPY on these missing attributes and ends up replacing the LigoSearchTag which this job also inherited with undefined.
> 
> Then the submit requirement rejects the job because LogoSearchTag is missing.
> 
> What you need to do change the TagJob transform so it does not overwrite a LigoSearchTag value if the job already has one.
> 
> -tj
> 
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Thomas
> Sent: Friday, June 16, 2023 3:40 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] job factory universe changed?
> 
> I'm trying to submit a set of jobs using the schedd late materialization
> job factory in condor 10.0.4.  I note that the same submit file and
> schedd configuration worked fine in condor v9, so I'm guessing there was
> some behavior change that I overlooked.
> 
> My submit file contains an accounting_group, which a job transform turns
> into a LigoSearchTag and validates that it has an acceptable value.
> 
> To start, here is my submit file:
> 
> executable = validate_files.sh
> log = /home/michael.thomas/condor/rawtrend/job.log.$(Process)
> universe = vanilla
> accounting_group=llo.test
> request_disk = 2048MB
> notification = Always
> notify_user = michael.thomas@xxxxxxxx
> should_transfer_files = YES
> stream_output = True
> request_HeavyNetwork = 1
> max_materialize = 5
> arguments = input/condor_input_$(Process)
> error = /home/michael.thomas/condor/rawtrend/validation/job.err.$(Process)
> output = /home/michael.thomas/condor/rawtrend/validation/job.out.$(Process)
> transfer_input_files = input/condor_input_$(Process),validaterawtrend
> transfer_output_files = validation
> preserve_relative_paths = True
> queue 10
> 
> ...and here are the job transforms:
> 
> JOB_TRANSFORM_NAMES = TagJob,RemoveAcctGroup
> 
> JOB_TRANSFORM_TagJob @=end
> [
>     COPY_AcctGroup = "LigoSearchTag";
>     COPY_AcctGroupUser = "LigoSearchUser";
>     EVAL_SET_LigoSearchTag = LigoSearchTag ?: "None";
>     EVAL_SET_LigoSearchUser = LigoSearchUser ?: Owner;
> ]
> @end
> 
> # do not strip accounting classads from scheduler universe
> # because their presence is necessary to propagate to child
> # jobs and sub-DAGs
> JOB_TRANSFORM_RemoveAcctGroup @=end
> [
> Requirements = JobUniverse != 7;
> delete_AccountingGroup = True;
> delete_AcctGroup = True;
> delete_AcctGroupUser = True;
> ]
> @end
> 
> SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES)
> ValidSearchTags ValidSearchUsers
> CLASSAD_USER_MAPFILE_ValidSearchTags = /etc/condor/accounting/valid_tags
> CLASSAD_USER_MAPFILE_ValidSearchUsers = /etc/condor/accounting/valid_users
> 
> SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) ValidateSearchTag
> ValidateSearchUser
> 
> SUBMIT_REQUIREMENT_ValidateSearchTag = JobUniverse == 7 || \
>     userMap("ValidSearchTags",LigoSearchTag) isnt undefined
> SUBMIT_REQUIREMENT_ValidateSearchTag_REASON = \
>     strcat("Invalid value for search tag: ",LigoSearchTag ?: "<undefined>")
> 
> SUBMIT_REQUIREMENT_ValidateSearchUser = \
>     JobUniverse == 7 || \
>     userMap("ValidSearchUsers",Owner,LigoSearchUser) is LigoSearchUser || \
>     userMap("ValidSearchUsers",Owner) is undefined && Owner =?=
> LigoSearchUser
> SUBMIT_REQUIREMENT_ValidateSearchUser_REASON = \
>     strcat("Invalid value for search user: ", LigoSearchUser ?:
> "<undefined>", "\n", \
>            "       Valid values are: ",userMap("ValidSearchUsers",Owner))
> 
> 
> Now when I submit, I'm geting an error that my search tag isn't found:
> 
> 06/16/23 15:00:56 (pid:5534) Calling HandleReq <handle_q> (0) for
> command 1112 (QMGMT_WRITE_CMD) from
> michael.thomas@xxxxxxxxxxxxxxxxxxxxxxxx <10.13.5.32:27419>
> 06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.-1: 2
> considered, 2 applied (TagJob,RemoveAcctGroup)
> 06/16/23 15:00:56 (pid:5534) Return from HandleReq <handle_q> (handler:
> 0.045252s, sec: 0.002s, payload: 0.001s)
> 06/16/23 15:00:56 (pid:5534) Return from Handler
> <DaemonCore::HandleReqPayloadReady> 0.045702s
> 06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.0
> step=0 row=0
> 06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.1
> step=0 row=1
> 06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.0: 2
> considered, 2 applied (TagJob,RemoveAcctGroup)
> 06/16/23 15:00:56 (pid:5534) CommitTransaction() failed for cluster
> 19803671 rval=-1 (Invalid value for search tag: None)
> 
> Which I presume means that either the transform failed to copy
> AccountingGroup to LigoSearchTag, or that it didn't execute in the
> scheduler universe and deleted the AccountingGroup tag.  Any tips on how
> to debug this or what might have changed between v9 and v10 are appreciated.
> 
> --Mike
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/