[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Keeping track of RemoteHosts for restarted or preempted jobs



Hi,

Ok, I understand. Thank you so much. I'm enabling the JOB_EPOCH_HISTORY in my test cluster to see if it can help me get the information I need.

Cheers,

Carles

On Thu, 10 Jul 2025 at 17:50, John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
You should be aware that SYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTH will grow your job_queue.log file fairly quickly if jobs are repeatedly trying to start and failing. Â

The EPOCH history file or the job's LOG file is a better way to get a record of where the job has run.Â

I would recommend that you use SYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTH only if you intend to reference the job attributes it creates in job policy expressions like Requirements. the history length should be no more than what you need for Requirements, etc.Â

-tj


From:ÂHTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
Sent:ÂThursday, July 10, 2025 8:28 AM
To:ÂHTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject:ÂRe: [HTCondor-users] Keeping track of RemoteHosts for restarted or preempted jobs

Hi Christoph,

Thank you very much! I'll try playing withÂSYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTH :)Â

Cheers,

Carles

On Thu, 10 Jul 2025 at 15:04, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi Carles,

try on the sched:Â

e.g.Â

SYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTH = 10

SYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTHÂ

The integer number of run attempts to store in the job ClassAd when recording the values of machine attributes listed in SYSTEM_JOB_MACHINE_ATTRS. The default is 1. The history length may also be extended on a per-job basis by using the submit file command job_machine_attrs_history_lengthÂThe larger of the system and per-job history lengths will be used. A history length of 0 disables recording of machine attributes.


Also maybe interesting:Â

SYSTEM_JOB_MACHINE_ATTRS = " ... "Â

If you want to use it in a START _expression_ e.g. do not start on the same machine twice:Â

STARTD_ATTRS = JobMachineAttrs < ...>Â

set_Requirements = Base2Requirements && Target.Machine =!= MachineAttrMachine0 && Target.Machine =!= MachineAttrMachine1 <.. >Â

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Carles Acosta" <cacosta@xxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 10. Juli 2025 14:34:34
Betreff: [HTCondor-users] Keeping track of RemoteHosts for restarted orÂÂÂÂÂÂÂÂpreempted jobs

Dear all,

On our site, the jobs can be preempted or restarted several times for various reasons. When a job finishes, the only host information we can retrieve is from the LastRemoteHost attribute. We have no record of the other execution nodes where the job has previously run.

We're looking for a way to keep track of the full list of hosts on which a job has been running.

Weâve been playing with condor_chirp to implement a custom ExecutionHostHistoryÂattribute. The idea is to append the current host to a history variable on the job wrapper. Something like this:

# Host history
host=$(hostname)

previous_history=$(/usr/libexec/condor/condor_chirp get_job_attr ExecutionHostHistory)

if [[ $previous_history != "UNDEFINED" ]]; then
  new_history="${previous_history},${host}"
else
  new_history="${host}"
fi

/usr/libexec/condor/condor_chirp set_job_attr ExecutionHostHistory "\"${new_history}\""

This works correctly on the first run, and we can see ExecutionHostHistory set with the hostname value. However, when the job is restarted again, the attribute appears as undefined.

Has anyone tried to do something similar? Or maybe there is already a variable with this information that I haven't found?

Thank you very much in advance.

Best regards,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxÂwith a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxÂwith a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es