On Mar 19, 2013, at 4:47 AM, Michael Hanke <michael.hanke@xxxxxxxxx> wrote:On Mon, Mar 18, 2013 at 9:38 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:Hi Michael,I suspect we are chasing an incorrect lead with respect to the job suspension; the fakeroot is being leaked to the mount namespace, not the HTCondor one (so the bug I thought of does not apply here).However, if you add:MOUNT_UNDER_SCRATCH=/tmpit should make those warning/error messages go away.Could you tell a little more on why bind-mounting /tmp will disable the warnings? From the documentation it is not obvious to me.Sorry, I'm too terse sometimes -Condor is complaining about sandbox cleaning (I think) because it is finding files owned by root in the job sandbox (there are assumptions littered throughout the code, especially sandbox cleanup, that there is only one UID for files in a sandbox; we hit similar issues when using glexec).It sounds like the root-owned files are all from filesystems which are remounted / bind-mounted into the sandbox by pbuilder (/proc, /dev/pts). By enabling MOUNT_UNDER_SCRATCH, HTCondor will put the job in a separate "mount namespace" that makes mounts in the job invisible to the rest of the system; this is required to give the job a private /tmp, but the private /tmp is a side-effect in this case.Hence, /proc and /dev/pts would be invisible to the condor_starter and wouldn't be cleaned up.
What are your SUSPEND-related attributes set to on that worker node?% condor_config_val -dump |grep -i suspendMAXSUSPENDTIME = 10 * $(MINUTE)SUSPEND = $(UWCS_SUSPEND)TESTINGMODE_SUSPEND = FalseTESTINGMODE_WANT_SUSPEND = FalseUWCS_PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) > $(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)) )UWCS_SUSPEND = ( $(KeyboardBusy) || ( (CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 90 ) )UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) || $(IsVanilla) ) && ( $(SUSPEND) )VM_SOFT_SUSPEND = TrueWANT_SUSPEND = $(UWCS_WANT_SUSPEND)This is a dedicated cluster node -- no keyboard.Ah -What does CpuBusyTime look like? If there's enough system activity (or if the root-owned processes are not being tracked by the procd and counting as system activity), then the SUSPEND _expression_ could trigger.