[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Resubmitting failed job to a different site



Hi Mat

On 8 Feb 2023, at 18:36, MÃtyÃs Selmeci via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hi Jeff,

We (OSG staff) *strongly* discourage users from avoiding specifically named sites because:


Thanks - this was also not the userâs intent.  The observation made is that if a job failed at a site right now, the chance that that site is fixed within the next two minutes is pretty low.  Hence, the desire to make a directive that does not exclude a specific site, it just says âfor a failed job, donât re-run it at the place it just failed at, whatever that site happens to be."  It prevents the âblack hole syndromeâ where a site always has free slots because all jobs are failing, so it attracts a large set of jobs and fails them all.

The user is now in contact with someone from Mironâs team and someone from Pegasus, to try and find an appropriate solution that works for all parties involved.

JT


1. if a site is persistently broken for certain types of jobs, we want to know about it so we can get it fixed
2. user jobs are likely going to keep avoiding the site even after the site gets fixed (because they won't know that the site has gotten fixed)

instead we'd like to figure out what property of an execute point is causing the failure, and tell the user to match based on that property.

We only tell a user to avoid a site to "stop the bleeding" in case it's very broken and, in that case, we tell the user to stop avoiding it when the site gets fixed.


That being said, to avoid a site in the OSPool (our preferred term for the pool of EPs running OSG VO jobs), add the site name (the GLIDEIN_Site attribute of the EPs on the site) to the job attribute "UNDESIRED_Sites".  This is a StringList (comma-or-space-separated list) that the START expressions of the EPs look at.  This is something that is specific to OSPool pilots, not a general HTCondor or GlideinWMS feature but it can be added in the Glidein Frontend config.


-Mat (OSG Software)


On 2/8/2023 4:52 AM, Jeff Templon wrote:
Hi,
A user just asked me this question:
Using OSG sites, some jobs fail, if I just resubmit them they usually are retried at the same site and fail again.  Is there a way to ask to retry a job and explicitly avoid it being resubmitted to the same location?
Ps Not sure if itâs relevant, but theyâre using PEGASUS.  He showed me a pegasus config file which was full of GLIDEIN directives.
Thanks!
JT
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/