04/21/17 22:50:18 (pid:4780) (D_ALWAYS:2) UserProfile::destroy: Removing condor-slot1's profile directory failed. (last-error = 145) error 145 is “The directory is not empty.” This error is coming back from a call to the Window’s API. DeleteProfile(). This is consistent with having a process still around after the job exits. A somewhat less likely reason is that the job the job created a file or directory under the profile directory with permissions that prevent the profile directory
from being deleted. The other error is less clear. 202 is “The system could not find the environment option that was entered.”. It’s hard to tell if this is a real problem or just a consequence if an existence check. I think we can ignore it for now. I think the way to go from here would be to have a special build of the HTCondor starter that has extra logging when DeleteProfile fails. With HTCondor week coming up next week, It will be a few weeks before I have time to make a special
build. You could probably make progress more quickly if you can build HTCondor yourself.
there are instructions here. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=BuildingHtcondorOnWindows The place where more logging is needed is where the __leave statement is in src\condor_utils\profile.WINDOWS.cpp at around line 349. -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Michael Schwarzfischer Hi, I finally caught another orphaned user folder with a corresponding log. The only thing I found is the following line in StarterLog on the client: 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) In OwnerProfile::loaded() 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) In OwnerProfile::unload() 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) In OwnerProfile::loaded() 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) In OwnerProfile::unloadProfile() 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) OwnerProfile::unloadProfile: Unloading condor-slot1's profile succeeded. (last-error = 0) 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) In OwnerProfile::destroy() 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) UserProfile::destroy: Loading condor-slot1's SID succeeded. (last-error = 0) 04/21/17 22:50:17 (pid:4780) (D_ALWAYS:2) UserProfile::destroy: Converting SID to a string succeeded. (last-error = 0) 04/21/17 22:50:18 (pid:4780) (D_ALWAYS:2) UserProfile::destroy: Removing condor-slot1's profile directory failed. (last-error = 145) Not sure if this is relevant, but all the jobs show a “last-error” during loading: 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) In OwnerProfile::update() 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) In OwnerProfile::load() 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) In OwnerProfile::loaded() 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) In OwnerProfile::directory() 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) OwnerProfile::directory: this user has no profile directory. 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) OwnerProfile::load: Profile directory does not exist, so we're going to create one. 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) In OwnerProfile::create() 04/21/17 22:50:50 (pid:464) (D_ALWAYS:2) In OwnerProfile::loadProfile() 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) OwnerProfile::loadProfile: Loading the condor-slot1's profile succeeded. (last-error = 0) 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) In OwnerProfile::unloadProfile() 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) OwnerProfile::unloadProfile: Unloading condor-slot1's profile succeeded. (last-error = 0) 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) In OwnerProfile::directory() 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) OwnerProfile::load: Creation of profile for condor-slot1 succeeded. (last-error = 0) 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) In OwnerProfile::loadProfile() 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) OwnerProfile::loadProfile: Loading the condor-slot1's profile succeeded. (last-error = 0) 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) TokenCache contents:
condor-slot1@. 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) In OwnerProfile::environment() 04/21/17 22:50:52 (pid:464) (D_ALWAYS:2) OwnerProfile::environment: Loading succeeded while retrieving condor-slot1's environment (last-error = 203) Any ideas what I could do next? Thanks, Best, Michael From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller the condor_master runs a task called ‘preen’ once a day that cleans up orphaned files, that’s most likely how the execute directories get cleaned up. HTCondor doesn’t have a lot of Windows specific knowledge, but there is some. If you configure STARTER_DEBUG = $(STARTER_DEBUG) D_CAT D_ALWAYS:2 STARTD_DEBUG = $(STARTER_DEBUG) D_CAT D_ALWAYS:2 You should start to see messages in the log with the pattern “OwnerProfile:”. These will be the bits of code that create and delete the user directories on Windows. I’d be curious to know what error messages have this pattern. From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Michael Schwarzfischer Hey, Thanks for your quick response. We indeed see sometimes error logging that the execute folder cannot be deleted. We know that there is a problem in this type of job. However, in many other cases (different job type) we do not see any logging, which might hint to a problem of deleting the execute folder. In either case the execute folder gets finally deleted somehow. What persists are the user folders in the windows user directory, including only empty folders. It is always the “AppData” folder with some further subfolders, mostly
Local\Microsoft\something or Roaming\Microsoft\somethingElse. Furthermore, the “something” is always something different… Thanks for any further ideas! Best, Michael From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller We have seen this happen when the user’s job creates worker processes that are still alive when the job exits. HTCondor tries to clean up, but detection of worker processes is imperfect on Windows because Windows doesn’t actually keep
track of parent-child relationships between processes. If there is a processes that has one of the directories we are trying to delete as their current working directory, or have a file open in that directory, then it is simply not possible to delete the directory without first killing the
process. Is there anything in the logs on the execute node that indicate that we tried and failed to delete the execute directory? It’s likely that the problem is caused by a specific job. You can use process explorer (one of the sys-internals tools) to identify what processes are keeping the directories from being deleted.
-tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Almansour Blanco Hello, We are using Windows 7 and condor version 8.4.9 When condor runs on the system, it creates the folders condor-slot and TEMP directories in the user home directory. However, in some cases when the condor job is done, the condor-slot* directories are not cleaned up even though they are empty, and they keep on accumulating until
there are hundreds of them, and at some point, condor jobs will stop executing on that machine, maybe because it can’t create any more folders. Has someone faced this problem before? And is there any solution to solve this issue and prevent it from happening? Kind regards Almansour Belleh Blanco |