[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] aggregating job statistics over all job instances



Hi Cole and Matt,

for now the idea would be to provide the users with integrated usage statistics for their jobs. At the moment, our users get a very rough summary of their last week's resource usage in CO2 equivalents. It would be nice, to have a better handle to integrate of all job instances.

We have per worker efficiency and benchmark ads, that are added to the job ads [1] and would be neat to be used for weighting the instances' run times per worker. E.g., a job with three failed instances, where the last Runtime is in the "final" job ads as
  Runtime = 789
and maybe a list ad like
  RuntimeArrayAd = [123, 456, 789]
where each instance's Runtime is appended (or the Job*Date ads)? So that one could calculate for all a job's instances a "complete" weighted summary.

Probably somewhat difficult might be, that the machine ads added to the jobs are actually incremented as separate ads - so maybe similarly
  Runtime0
  Runtime1
  ...
might be better(??) than a list?

---

Alternatively, one could use the EventLogs. In principle, we could probably parse the json events through Spark or so and calculate the job instance details from the events - but that might need a bit of additional overhead.

Cheers,
  Thomas

[1]
MachineAttrCpus0 = 1
MachineAttrHS060 = 2035
MachineAttrHS06PerSlot0 = 41.53061224489796
MachineAttrHS06perWatt0 = 3.7
MachineAttrMachine0 = "batch1369.desy.de"


On 19/01/2023 17.26, Cole Bollig via HTCondor-users wrote:
Hi Thomas,

At the moment there isn't that elegant of a solution as you either need to set up a job post run analysisÂor have some other program/script that understands job states to query information about jobs. As Matthew stated, the python bindings may be useful for this sort of monitoring/job management.

However, I have been working on a feature for this exact query for a bit now where the shadow writes the job ad to file like normal job history. It should hopefully be officially announced within a couple of feature series releases.

Cheers,
Cole Bollig
------------------------------------------------------------------------
*From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Hartmann <thomas.hartmann@xxxxxxx>
*Sent:* Thursday, January 19, 2023 10:03 AM
*To:* HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
*Subject:* [HTCondor-users] aggregating job statistics over all job instances
Hi all,

I would like to collect job metrics for all runs of a job. So far my
approach would be a postCMD - but that seems not very elegant/condor-like.
I.e., a postCMD script following the payload job, that chirps a class ad
array, attaches a new element and chirps the updated job ad again to the
collector. However, one would need to be careful to not drop user postCMDs.

Maybe there is a better way?

Cheers,
 Â Thomas


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature