[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job log not reporting Memory usage




Hi Joe,
This is not the case, the job lasts about an hour.
Querying for ResidentSetSize while the job is running either shows undefined at the beginning or a few MB when updated, as in the job log.
That behavior is observed when submitting the job alone.Â

When submitting him as part of a cluster of jobs we often get over reporting.
Meaning jobs that need 28GB and are submitted with 40GB requests are held due to memory exceeded with values up to 73GB.
Those jobs are manually monitored and respect the memory limit.

Regards,
David


On Wed, Apr 3, 2024 at 11:29âPM Joe Reuss via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hello David!

There are many different reasons why your memory usage is not being recorded. One reason could be that the job never ran, or it ran for too short a time for HTCondor's periodic scan of usage to pick it up.

Another reason could be the connection between the shadow and the starter is lost, and the job runs to completion before the reconnect, and the reconnect was successful, there may never be a memory usage update (which is fairly common for jobs that run only a few minutes when the schedd is restarted.

However, if your job did run, it should always be present in its condor_history. So check the condor_history for your job and see if the memory usage is being tracked correctly there. It's best not be make any assumptions about memory usage in the user job log and to check condor_history for the proper value.

I hope this helps!
Joe Reuss


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, April 1, 2024 9:34 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Job log not reporting Memory usage
Â
Hi,
We are running condor version 10.9.0.
Some of the jobs don't report memory usage at the job summary nor update ResidentSetSize while running.

For example, this job, running manually on a login node:
$ /usr/bin/time -v ./rsem_one_rep.sh Â
ÂÂÂÂÂÂÂCommand being timed: "./rsem_one_rep.sh"
ÂÂÂÂÂÂÂUser time (seconds): 12634.50
ÂÂÂÂÂÂÂSystem time (seconds): 191.66
ÂÂÂÂÂÂÂPercent of CPU this job got: 490%
ÂÂÂÂÂÂÂElapsed (wall clock) time (h:mm:ss or m:ss): 43:35.72
ÂÂÂÂÂÂÂAverage shared text size (kbytes): 0
ÂÂÂÂÂÂÂAverage unshared data size (kbytes): 0
ÂÂÂÂÂÂÂAverage stack size (kbytes): 0
ÂÂÂÂÂÂÂAverage total size (kbytes): 0
ÂÂÂÂÂÂÂMaximum resident set size (kbytes): 28911148
ÂÂÂÂÂÂÂAverage resident set size (kbytes): 0
ÂÂÂÂÂÂÂMajor (requiring I/O) page faults: 0
ÂÂÂÂÂÂÂMinor (reclaiming a frame) page faults: 59384808
ÂÂÂÂÂÂÂVoluntary context switches: 65380
ÂÂÂÂÂÂÂInvoluntary context switches: 20889
ÂÂÂÂÂÂÂSwaps: 0
ÂÂÂÂÂÂÂFile system inputs: 63054832
ÂÂÂÂÂÂÂFile system outputs: 39826176
ÂÂÂÂÂÂÂSocket messages sent: 0
ÂÂÂÂÂÂÂSocket messages received: 0
ÂÂÂÂÂÂÂSignals delivered: 0
ÂÂÂÂÂÂÂPage size (bytes): 4096
ÂÂÂÂÂÂÂExit status: 0

And when sent as a job, it's log shows:

005 (477410.000.000) 2024-04-01 14:14:56 Job terminated.
ÂÂÂÂÂÂÂ(1) Normal termination (return value 0)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂUsr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Remote Usage
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂUsr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂUsr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Remote Usage
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂUsr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
ÂÂÂÂÂÂÂ0 Â- ÂRun Bytes Sent By Job
ÂÂÂÂÂÂÂ0 Â- ÂRun Bytes Received By Job
ÂÂÂÂÂÂÂ0 Â- ÂTotal Bytes Sent By Job
ÂÂÂÂÂÂÂ0 Â- ÂTotal Bytes Received By Job
ÂÂÂÂÂÂÂPartitionable Resources : ÂÂÂUsage ÂRequest Allocated Â
ÂÂÂÂÂÂÂÂÂÂCpus ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ: ÂÂÂÂ4.74 ÂÂÂÂÂÂÂ8 ÂÂÂÂÂÂÂÂ8 Â
ÂÂÂÂÂÂÂÂÂÂDisk (KB) ÂÂÂÂÂÂÂÂÂÂÂ: ÂÂÂÂ1 ÂÂÂÂÂÂÂÂÂÂ1 ÂÂ1610814 Â
ÂÂÂÂÂÂÂÂÂÂMemory (MB) ÂÂÂÂÂÂÂÂÂ: ÂÂÂÂ1 ÂÂÂÂÂÂ32768 ÂÂÂÂ32768 Â

ÂÂÂÂÂÂÂJob terminated of its own accord at 2024-04-01T11:14:56Z with exit-code 0.
...

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/