[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Pool-wide event log



On 11/2/2022 6:55 AM, Joachim Meyer wrote:
Hi all,

I didn't find any documentation on this: can I have a single pool-wide event 
log on the condor master node or is the event log only supported on the submit 
nodes?

By adding the following lines to a submit node's config, I was able to get an 
eventlog on that machine, on the master node, this does not have any effect, 
though.

EVENT_LOG = /var/log/condor/EventLog
EVENT_LOG_FORMAT_OPTIONS = JSON
EVENT_LOG_JOB_AD_INFORMATION_ATTRS = 
CpusProvisioned,MemoryProvisioned,GPUsProvisioned,DockerImage,JobBatchName

If a single pool-wide event log is possible, do I have to configure that 
specially?

A note on why we use the event log at all: we want to plug the HTCondor job 
information into ClusterCockpit <https://github.com/ClusterCockpit> and 
afaict, monitoring the event log is one of the better solutions to do so.

Thanks,
Joachim

Hi Joachim,

As you surmised above, the job event log is per-job, and can optionally be consolidated across all jobs/users per access point (submit nodes) via EVENT_LOG.  There currently is no built-in mechanism to consolidate all the job event logs pool-wide.

Depending on the information ClusterCockpit wants to observe, the EVENT_LOG may be the perfect tool for the job.... alternatively, be aware that  condor_history can gather complete information for jobs that have left the system (i.e. either completed or removed) from both the access points (submit nodes) and execution points (execute nodes), and condor_history can access remote systems.  In fact, performing condor_history to remote machines is what condor_adstash does to push completed job information into Elasticsearch [1].   As for jobs still in the system, you can get job aggregates (total jobs running, idle, held, transferring files, etc)  from the condor_collector via "condor_status -schedd" or "condor_status -submitter"....  and of course, information about the execute nodes themselves (including any attribute from the jobs running there).... polling the condor_collector for this information is how condor_gangliad pushes information to Ganglia / Nagios.

[1] https://htcondor.readthedocs.io/en/latest/admin-manual/monitoring.html#elasticsearch

[2] https://htcondor.readthedocs.io/en/latest/admin-manual/monitoring.html#ganglia

regards,
Todd