_______________________________________________Yo
I wrote a while back about using condor_history to generate an âaccounting log fileâ that reproduces what we have with our Torque LRMS. Note here I use accounting in the Torque sense : accounting means transaction records like a bank account. What HTCondor calls accounting is something different, I already know that.
For right now, our HTCondor system only has one schedd, making the simplest approach to use condor_history on the schedd node. My prototype command is like this:
condor_history -json -completedsince $(date -d "2021-09-13 15:00:00" +â%sâ) \ -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId \ RemoteHost RequestCpus RequestMemory ExitCode CpusUsage \ ResidentSetSize ImageSize RemoteWallClockTime
Everything was fine, until I realised that I had made a mistake while submitting a bunch of jobs aimed at testing the command. I executed
condor_rm
to remove those jobs, and doing so not only removed the jobs from the queue, it removed them from the history!!. This violates some fundamental principle of accounting for me - deleting something from the queue should not erase the record of it ever having happened. While looking into this, I discovered something very weird: using a different form of the constraint command (abovecompletedsince
) gives different results. I am using âgrep CpusUsâ to grab out one line of output per found job, and then wc -l to count how many jobs were found.â> date -d "2021-09-13 15:00:00" +"%s" 1631538000 â> condor_history -json -completedsince $(date -d "2021-09-13 15:00:00" +"%s") \ -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId \ RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize \ ImageSize RemoteWallClockTime | grep CpusUs | wc -l 23 â> condor_history -json -constraint "CompletionDate > 1631538000 " -af:hj Owner \ Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost \ RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize \ RemoteWallClockTime | grep CpusUs | wc -l 331
This difference seems to depend on whether the
condor_rm
has been issued or not. As I write this, itâs 16:00 :â> date -d "2021-09-13 16:00:00" +"%s" 1631541600 â> condor_history -json -constraint "Owner == \"templon\" && CompletionDate > 1631541600" \ -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId \ RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize \ ImageSize RemoteWallClockTime | grep CpusUs | wc -l 0 â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") \ -constraint "Owner == \"templon\"" -af:hj Owner Cmd Args QDate \ JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus \ RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize \ RemoteWallClockTime | grep CpusUs | wc -l 0
All looks fine. Now I submit a bunch of jobs, some of which will complete quickly, wait a bit, and try again.
â> condor_history -json -constraint "Owner == \"templon\" && CompletionDate > 1631541600" -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l 40 â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") -constraint "Owner == \"templon\"" -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l 40
Still looks fine. Now use condor_rm to delete the entire set of jobs:
â> condor_history -json -constraint "Owner == \"templon\" && CompletionDate > 1631541600" -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l 41 â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") -constraint "Owner == \"templon\"" -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l 0
The 41 result instead of 40 is because one more job completed between the last
condor_history
command and thecondor_rm
command.Using the
CompletionDate
form of the constraint, all the jobs are still there, butcompletedsince
is no longer accurate.Whatâs going on?? The accounting records (note again: accounting in the Torque/bank account sense!) should be holy, and using
completedsince
they are not holy. This makes me wonder in what other circumstances are the records not holy. Is there some documentation on how to ensure that only holy output will result from my commands?Thanks,
JT
ps: I did indeed check the jobs disappear (and not just the CpusUsage field).
â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") -constraint "Owner == \"templon\"" -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime [ ]
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/