Sorry for all the spam! I’ve made a table of CommitedSlotTime and (CompletionDate – JobCurrentStartExecutingDate) values. In all cases where they do NOT match (within a couple of seconds) the CommitedSlotTime is ALWAYS an integer multiple (sometimes 2, sometimes 3) of the (CompletionDate – JobCurrentStartExecutingDate) values (allowing for the few seconds difference between JobCurrentStartDate and JobCurrentStartExecutingDate).
Cheers Greg From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Greg.Hitchen@xxxxxxxx While kludging our scripts to work around this I’ve discovered other instances where there are not –ve values BUT: CumulativeSlotTime < CommittedSlotTime This also doesn’t make sense? In this instance the JobLastStartDate is also later than the CompletionDate by 1.5 days!? From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Greg.Hitchen@xxxxxxxx Hi John Thanks for the replies. I'm not sure that the clock jumping scenario is the cause in this case. Below is an extract of part of the output of condor_history for a particular history file. As you can see there is a mix of -ve values amongst the correct +ve ones and these always correspond to a jump in the JobCurrentStartDate. For this history file this JobCurrentStartDate is ALWAYS the same. For other history files that also have -ve CommittedSlotTime they also have a constant (but different for each history file) JobCurrentStartDate value. If you want I can send you this history file offline? condor_history -f history.20180320T072752 -format "%6s " ClusterId -format "%6s " ProcId -format "%10s " CommittedSlotTime -format "%d " CompletionDate -format "%d\n" JobCurrentStartDate 865057 18518 1707.0 1521453470 1521451762 865057 1842 1703.0 1521453099 1521451396 865057 19534 1700.0 1521452998 1521451297 865057 17784 1680.0 1521451762 1521450082 865057 15067 1718.0 1521451396 1521449647 865057 13836 1694.0 1521451297 1521449603 865057 14087 -186856.0 1521450290 1521637146 865057 12124 1800.0 1521450129 1521448329 865057 13952 -187151.0 1521449995 1521637146 865057 10558 1673.0 1521449895 1521448222 865057 14468 -187269.0 1521449877 1521637146 865057 13944 -187322.0 1521449824 1521637146 865057 10696 -187575.0 1521449571 1521637146 865057 16028 -187576.0 1521449570 1521637146 865057 12565 -187625.0 1521449521 1521637146 865057 27255 -187676.0 1521449470 1521637146 865057 15906 -187707.0 1521449439 1521637146 865057 27356 -187709.0 1521449437 1521637146 865057 15751 -187719.0 1521449429 1521637146 865057 15977 -187774.0 1521449372 1521637146 865057 26993 -187778.0 1521449368 1521637146 865057 26663 -187815.0 1521449331 1521637146 865057 26529 -187831.0 1521449315 1521637146 865057 6636 1765.0 1521449315 1521447550 865057 9149 -187842.0 1521449304 1521637146 865057 25891 -187858.0 1521449288 1521637146 865057 13271 -187865.0 1521449281 1521637146 865057 11931 -187935.0 1521449211 1521637146 865057 13019 -187938.0 1521449208 1521637146 865057 5811 1720.0 1521449198 1521447478 865057 13516 -187952.0 1521449194 1521637146 865057 15757 -187953.0 1521449193 1521637146 865057 6362 -187955.0 1521449192 1521637146 865057 15131 -187957.0 1521449189 1521637146 865057 15861 -187977.0 1521449169 1521637146 865057 15777 -187977.0 1521449169 1521637146 865057 5322 1737.0 1521449157 1521447404 865057 9882 -188014.0 1521449157 1521637146 865057 15754 -188003.0 1521449157 1521637146 I’ve done some more checking on the submit node itself and the 2 history files with –ve CommitedSlotTime values are either side of the .schedd_address.super file (time wise) in the spool directory. i.e. history.20180320T221705 and history.20180320T072752 This time also corresponds to what looks like a condor restart at that same time, based on the times for these files in the log directory. Maybe this has something to do with the issue? Thanks. Cheers Greg -----Original Message----- We've done some digging in the code, and the only thing we can think that would explain this is if the clock jumped ahead by 2 days just at the time the job started, then it jumped back again for the remainder of the job execution.
JobCurrentStartDate is set by the schedd at the time it creates the shadow, the rest of the numbers are set or calculated by the shadow based on the current system clock time. So in order for this to happen, the schedd would need to get 1521637146 from the system clock when it creates the shadow and the shadow would need to get 1521445871 from the system clock when the job actually begins running just a few seconds later. Otherwise, we can't explain how the CommittedSlotTime is negative - it is calculated by the shadow which means that the shadow would have to SEE the JobCurrentStartDate that is in the future relative to it's own clock. If this were a case of all of the values but JobCurrentStartDate being stale somehow, then the CommittedSlotTime would not be negative. -tj -----Original Message----- From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller Sent: Thursday, April 5, 2018 10:12 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] History file with incorrect JobCurrentStartDate? - version windows 8.4.4 This is not a known bug. but it seems similar to this bug https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6626 which will be fixed in the upcoming 8.6.11 release. could you send me the entire job ad so I can have a look? thanks -tj -----Original Message----- From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of
Greg.Hitchen@xxxxxxxx Sent: Thursday, April 5, 2018 2:00 AM To: htcondor-users@xxxxxxxxxxx Subject: [HTCondor-users] History file with incorrect JobCurrentStartDate? - version windows 8.4.4 Hi All We have a number of windows submit nodes in our HTCondor pool. We have reporting scripts on a linux machine that download all the history files and can provide monthly (or daily) usage on a per user basis. In our latest report for March 2018 we noticed something strange, -ve run times! After tracking back through the scripts we eventually found the answer in the history files themselves. e.g. for one job these are the relevant values in the history file JobCurrentStartDate 1521637146 CompletionDate 1521450290 EnteredCurrentStatus 1521450290 JobCurrentStartTransferOutputDate 1521450290 JobFinishedHookDone 1521450290 LastJobLeaseRenewal 1521450290 JobCurrentStartExecutingDate 1521445871 LastMatchTime 1521445870 LastVacateTime 1521444418 JobLastStartDate 1521442616 LastRejMatchTime 1521436370 JobStartDate 1521192534 QDate 1521181266 CumulativeSlotTime 45124 RemoteWallClockTime 45124 RemoteUserCpu 2970 RemoteSysCpu 1286 CommittedSuspensionTime 0 CumulativeSuspensionTime 0 CommittedSlotTime -186856 CommittedTime -186856 Note the 2 -ve committed time values. These are equal to (CompletionDate - JobCurrentStartDate). In fact the JobCurrentStartDate is nearly 2 days AFTER the CompletionDate!? For the moment we will need to change our scripts to NOT use the CommittedSlotTime BUT calculate it from (CompletionDate - JobLastStartDate). Is this a known bug? Or something that someone has come across before? I've had a look through the 8.2.* and 8.4.* release notes and bug fixes but couldn't see anything. Thanks for any help. Cheers Greg P.S. we will soon(ish) be upgrading our submit nodes from win2008 to win2016 and will upgrade to the latest 8.6.* version then. Meanwhile the history files already exist so we will kludge a workaround. _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: |