Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] RemoteWallClockTime vs CommittedTime
- Date: Wed, 16 Oct 2013 11:25:32 -0500
- From: John Weigand <weigand@xxxxxxxx>
- Subject: Re: [HTCondor-users] RemoteWallClockTime vs CommittedTime
Many thanks Dan.... more comments embedded..
John Weigand
On 10/16/2013 9:34 AM, Daniel Forrest wrote:
Hi John,
No, this is always after the job has completed or been killed. The
RemoteWallClockTime is always > 0.
Interesting that you mention that if the job gets evicted, then
RemoteWallClockTime is updated. I am now wondering if this is an additive
process or just a plain update. By that I mean, if a job gets evicted
multiple times, does the RemoteWallClockTime get added to each time or just
updated with the latest time.
RemoteWallClockTime is cumulative.
or
If there is some means of killing the job (maybe kill -9) that would leave
the values in the state I am seeing.
Maybe I am not understanding the doc correctly but it seems that
RemoteWallClockTime should always be >= CommittedTime.
I have access to almost a decade worth of Condor history data so I ran
a script overnight to look for RemoteWallClockTime < CommittedTime.
I have found multiple cases (and luckily some still had UserLog files
lying around) with two different scenarios.
1.) Errors connecting to the schedd after writing a checkpoint.
The CommittedTime is logged when the checkpoint is written, but
the RemoteWallClockTime is logged at shadow exit. If this second
update fails then some RemoteWallClockTime will be lost.
This then may explain where I am seeing these...
21,690 CPU
20,289 RemoteWallClockTime
22,869 CommittedTime
which using RemoteWallClockTime, CPU efficiency = 106.9%
using CommittedTime, CPU efficiency = 94.8%
Even though 94.8 is probably a not so realistic utilization factor it is more
believable than 106.9.
Could also explain a lot of these (many more)...
22,054 CPU
20,097 RemoteWallClockTime
0 CommittedTime
This 2nd update failure may be more common than #2 below.
I also see a lot where RemoteWallClockTime = CommittedTime yet CPU is > than
both and cores don't appear to be a factor (but I can't rule cores out yet).
Is there a way to tell when that update fails?
2.) Job running a second time with TerminationPending = TRUE.
A comment from condor_shadow.V6/shadow.C:
/* If the completed job had been committed to the job queue,
but for some reason the shadow exited wierdly and the
schedd is trying to run it again, then simply write
the job termination events and send the email as if the job had
just ended. */
I believe this is the case you are seeing when RemoteWallClockTime
is only 1 second. That is the amount of time it takes for the job
to exit immediately from TerminationPending set. It appears to be
a bug that RemoteWallClockTime is being reset. To verify this you
should check for TerminationPending in the classAd.
Unfortunately, I only see the accounting data sent up to our accounting system
(Gratia) from many grids within OSG. I will have to get one of the admins to
check their condor history logs for some of these to verify. This case is much
more rare than #1 above but it was the one that triggered my curiosity as to how
we are handling the data. I have a lot more #1s.
This particular one (RemoteWallClockTime = 1) came up from 1 site on 1 day then
stopped.