[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] transfer plugin exit codes



Hi Thomas,

If you want ideas, here is the job transform that we use to add the necessary attributes to any job that uses "stash://" URLs:

---
# These are the default values; they can be overridden later.
StashRetry_MaxRetries = 3
# Delay at least 300 seconds ...
StashRetry_MinimumDelay = 300
# ... and up to 300 more seconds, before retrying.
StashRetry_RandomDelay = 300

JOB_TRANSFORM_StashRetry @=jt
    REQUIREMENTS regexp("stash://", TransferInput)

    # https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html#HoldReasonCode
    transfer_input_error_code = 13
    # FT plugin error codes are left-shifted by 8.
    # stash plugin uses 11 for retriable failures.
    EVALMACRO transfer_input_error_subcode_retriable = 11 << 8
    EVALMACRO retry_delay = $(StashRetry_MinimumDelay) + random($(StashRetry_RandomDelay))

    SET StashRetryCondition \
        ( HoldReasonCode == $(transfer_input_error_code) && \
          HoldReasonSubCode == $(transfer_input_error_subcode_retriable) && \
          NumHoldsByReason.TransferInputError > 0 && \
          NumHoldsByReason.TransferInputError <= $(StashRetry_MaxRetries) \
         ) ?: false

    SET StashRetryTime EnteredCurrentStatus + $(retry_delay)

    SET PeriodicRelease ($(MY.PeriodicRelease:false)) || (StashRetryCondition && (time() > StashRetryTime))
@jt
---

-Mat


On 11/15/2022 10:24 AM, Thomas Hartmann wrote:
Hi Mat,

that sounds good :D

Something like that is what we envisage - where the jobs get released
occasionally and put back on hold until all their files have been staged
from tape for good.

Cheers and thanks,
    Thomas

On 15/11/2022 16.12, MÃtyÃs Selmeci via HTCondor-users wrote:
Not strict at all -- OSG uses 11 in one of our plugins to indicate a
"retryable" failure. Any nonzero exit code results in a hold with the
HoldReasonSubCode being the exit code left shifted by 8 (so multiplied
by 256). We have a PeriodicRelease that retries the job after a random
delay in case it was one of these failures.

-Mat

On 11/15/2022 8:25 AM, Thomas Hartmann wrote:
Hi all,

quick question on transfer plugins - how strict is the constraint on
exit codes 0,1,2? According to
https://htcondor.readthedocs.io/en/latest/admin-manual/setting-up-special-environments.html#enabling-the-transfer-of-files-specified-by-a-url
these three exit codes are the (only?) expected ones by Condor.

Potentially, I would like to distinguish between a few fail reasons,
e.g., if a file is not present vs a file only nearline. So that one
could send a job back into hold and maybe release it later on if a file
was nearline but not release it, if not found in the namespace. I.e.,
evaluating `HoldReasonSubCode` occasionally.

Cheers,
 ÂÂÂ Thomas


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/