Hello David!
There are many different reasons why your memory usage is not being recorded. One reason could be that the job never ran, or it ran for too short a time for HTCondor's periodic scan of usage to pick it up.
Another reason could be the connection between the shadow and the starter is lost, and the job runs to completion before the reconnect, and the reconnect was successful, there may never be a memory usage update (which is fairly common for jobs that run only
a few minutes when the schedd is restarted.
However, if your job did run, it should always be present in its condor_history. So check the condor_history for your job and see if the memory usage is being tracked correctly there. It's best not be make any assumptions about memory usage in the user
job log and to check condor_history for the proper value.
I hope this helps!
Joe Reuss
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, April 1, 2024 9:34 AM To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx> Subject: [HTCondor-users] Job log not reporting Memory usage Hi,
We are running condor version 10.9.0.Some of the jobs don't report memory usage at the job summary nor update ResidentSetSize while running. For example, this job, running manually on a login node:
$ /usr/bin/time -v ./rsem_one_rep.sh
And when sent as a job, it's log shows:
Command being timed: "./rsem_one_rep.sh" User time (seconds): 12634.50 System time (seconds): 191.66 Percent of CPU this job got: 490% Elapsed (wall clock) time (h:mm:ss or m:ss): 43:35.72 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 28911148 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 59384808 Voluntary context switches: 65380 Involuntary context switches: 20889 Swaps: 0 File system inputs: 63054832 File system outputs: 39826176 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 005 (477410.000.000) 2024-04-01 14:14:56 Job terminated.
(1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 4.74 8 8 Disk (KB) : 1 1 1610814 Memory (MB) : 1 32768 32768 Job terminated of its own accord at 2024-04-01T11:14:56Z with exit-code 0. ... |