[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job log not reporting Memory usage



Hello David!

There are many different reasons why your memory usage is not being recorded. One reason could be that the job never ran, or it ran for too short a time for HTCondor's periodic scan of usage to pick it up.

Another reason could be the connection between the shadow and the starter is lost, and the job runs to completion before the reconnect, and the reconnect was successful, there may never be a memory usage update (which is fairly common for jobs that run only a few minutes when the schedd is restarted.

However, if your job did run, it should always be present in its condor_history. So check the condor_history for your job and see if the memory usage is being tracked correctly there. It's best not be make any assumptions about memory usage in the user job log and to check condor_history for the proper value.

I hope this helps!
Joe Reuss


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, April 1, 2024 9:34 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Job log not reporting Memory usage
 
Hi,
We are running condor version 10.9.0.
Some of the jobs don't report memory usage at the job summary nor update ResidentSetSize while running.

For example, this job, running manually on a login node:
$ /usr/bin/time -v ./rsem_one_rep.sh  
       Command being timed: "./rsem_one_rep.sh"
       User time (seconds): 12634.50
       System time (seconds): 191.66
       Percent of CPU this job got: 490%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 43:35.72
       Average shared text size (kbytes): 0
       Average unshared data size (kbytes): 0
       Average stack size (kbytes): 0
       Average total size (kbytes): 0
       Maximum resident set size (kbytes): 28911148
       Average resident set size (kbytes): 0
       Major (requiring I/O) page faults: 0
       Minor (reclaiming a frame) page faults: 59384808
       Voluntary context switches: 65380
       Involuntary context switches: 20889
       Swaps: 0
       File system inputs: 63054832
       File system outputs: 39826176
       Socket messages sent: 0
       Socket messages received: 0
       Signals delivered: 0
       Page size (bytes): 4096
       Exit status: 0

And when sent as a job, it's log shows:

005 (477410.000.000) 2024-04-01 14:14:56 Job terminated.
       (1) Normal termination (return value 0)
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
       Partitionable Resources :    Usage  Request Allocated  
          Cpus                 :     4.74        8         8  
          Disk (KB)            :     1           1   1610814  
          Memory (MB)          :     1       32768     32768  

       Job terminated of its own accord at 2024-04-01T11:14:56Z with exit-code 0.
...