[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] Additional GPU statistics



The script thatâs launched by the startd can do whatever it wants, really. The only thing Condor cares about is the stdout classad it gets back when the run finishes, so the script could deliver the data anywhere you wish, such as a Grafana receiver or the like. It doesnât even have to deliver a ClassAd at the end, really. It might be useful to set a âGpuStatsLastRunâ attribute so you can check easily to make sure itâs still running. Just be sure to keep the run as tight as possible, for best performance - a startd_cron job ideally shouldnât run more than a couple seconds at most. And itâs good to have sanity-checking, so that if the internal write fails it wonât leave the startd hanging around waiting for the ClassAd that isnât coming.

Also, this may be old info, but as I recall you can terminate a ClassAd output with a â==â double-equals sign which flags the startd that it doesnât have to keep looking for more attributes.

 

Michael Pelletier

Principal Technologist

High Performance Computing

Classified Infrastructure Services

 

C: +1 339.293.9149
michael.v.pelletier@xxxxxxx

 

From: Benedikt Riedel <briedel@xxxxxxxxxxxxxxxx>
Sent: Wednesday, March 20, 2024 9:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Pelletier, Michael V. RTX <Michael.V.Pelletier@xxxxxxx>
Subject: Re: [HTCondor-users] [External] Additional GPU statistics

 

Hi,

 

Is there a way to have the startd cron write somewhere else than the class ads to get a finer granularity?

 

Benedikt

 

On Wed, Mar 20, 2024 at 14:05 Pelletier, Michael V. RTX via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hello,

 

What youâd want to do is set up a startd cron job. The ClassAd output from this is pulled into the Machine ClassAd and this becomes queriable by condor_status. 

 

I do something similar with a job that calls ipmitool to check the power and cooling status of the machine and set a PowerOrCoolingFault Boolean attribute, allowing it to reject jobs if a PSU or fan fault is flagged.

 

You can set the interval for startd cron jobs in the configuration. Bear in mind that the collector is only updated periodically so a higher frequency doesnât gain you anything. I think itâs possible to push updates immediately from startd cron, but youâd want to keep an eye on the collector load in that case if you have a lot of machines. 

 

-Michael Pelletier. 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Benedikt Riedel <briedel@xxxxxxxxxxxxxxxx>
Sent: Wednesday, March 20, 2024 5:08:58 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] Additional GPU statistics

 

Hi,

 

Is there a way to get additional GPU statistics like the power draw through condor? Is there a way to increase the query rate for GPU statistics from HTCondor?

 

Thanks,

 

Benedikt


--

Benedikt Riedel

Global Computing Coordinator IceCube Neutrino Observatory

Technical Coordinator IceCube Neutrino Observatory

Computing Manager Wisconsin IceCube Particle Astrophysics Center

University of Wisconsin-Madison

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/