[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] exit hook not always report correct ImageSize



Todd, thanks a lot for the detailed explanation.Â

On Fri, Jan 27, 2023 at 11:12 AM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 1/26/2023 9:35 PM, JM wrote:
Todd,

By the way, I noticed that ResidentSetSize from stdin is actuallyÂResidentSetSize_RAW (not rounded up) during the life of the exit hook. It will be nice to be consistent everywhere and at any time.

Thanks!

J.

Hi J,

Glad I could be of assistance!Â

Unfortunately, from a scalability standpoint, having data consistency everywhere all the time is not feasible in a distributed system like HTCondor, the best we can hope for is eventual consistency between what all the components "see" (starter, startd, shadow, schedd, collector, etc).Â

As for attribute naming consistency, I totally agree with you. Some background here re the naming of job attributes like ResidentSetSize -vs- ResidentSetSize_RAW: the issue here is the condor_schedd is the component that rounds up the ResidentSetSize_RAW value into bucketed values to batch up matchmaking requests [*] For example, when the schedd is requesting resources from the central manager, this rounding allows the schedd wants to make one request like "give me 50 machines with 2GB RAM", instead of making 50 separate requests like "give me one machine with 1.87GB RAM, and another machine with 1.89GB RAM, ...". In hindsight, the raw uncooked value should have just been named "ResidentSetSize" everywhere, and the rounded value something else like "ResidentSetSize_Rounded" just in the job ad inside the schedd, but changing it retroactively in a manner that maintains backwards-compatibility everywhere is not trivial....Â

Hope the above helps,
Todd



On Thu, Jan 26, 2023 at 9:15 PM JM <jm@xxxxxxxxxxxxxxxxxxxx> wrote:
Todd,

Reading from stdin works great.Â

Regarding updating .job.ad upon job exit, soundsÂlike a good idea to do so to make a consistent state of this file.Â

Thank you.

J.

On Thu, Jan 26, 2023 at 4:51 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 1/26/2023 1:25 PM, JM wrote:
HTCondor users,

I have an exit hook to send .job.ad to a database. However, I noticed that in some uncertain cases, ImageSize is 1250 instead of the real ImageSize from condor_history. The jobs run much longer than 15 seconds. I would expect startd will update .job.ad. I even tried to sleep 30 seconds in exit hook to make sure the update happens.

Does anyone have a clue why?

Hi,

I did not positively confirm this, but my guess is the .job.ad file sitting in the scratch directory is written at the start of job execution, and not re-written every time the job ad is updated.Â

However, note that HTCondor will give a current/updated copy of the job classad to your exit hook script via stdin [*]. Instead of having your exit hook read the .job.ad file, I suggest you use the information passed to it via stdin. Let us know if you have any additional problems or questions here. It would not be a big deal for us to patch HTCondor to update the .job.ad upon job exit (i.e. before invoking the exit hook), but using the standard input should do what you want today....

Hope this helps,
Todd

[*] = In the manual at link:
ÂÂ https://htcondor.readthedocs.io/en/latest/admin-manual/hooks.html#work-fetching-hooks-invoked-by-htcondor
look for "HOOK_JOB_EXIT" and note what it says in the section "Standard input given to the hook".



-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685