[HTCondor-devel] Fwd: [HTCondor-users] CGROUPS + OOM / HOLD on exit


Date: Thu, 25 Jul 2013 13:07:59 -0400 (EDT)
From: Tim St Clair <tstclair@xxxxxxxxxx>
Subject: [HTCondor-devel] Fwd: [HTCondor-users] CGROUPS + OOM / HOLD on exit
I couldn't agree more.  

But please send the agreement to Todd (cc'd). 

Cheers,
Tim

From: "Joan J. Piles" <jpiles@xxxxxxxxx>
To: "Tim St Clair" <tstclair@xxxxxxxxxx>
Sent: Thursday, July 25, 2013 11:55:57 AM
Subject: Re: Fwd: [HTCondor-users] CGROUPS + OOM / HOLD on exit

I just thought there was a more streamlined way to just open a ticket (having to physically sign and scan an agreement for what amounts to a bug report is somewhat convoluted to say the least).

Anyway, I'm out of office right now, so I'll send you the signed form tomorrow morning.

Cheers,

Joan

On 25/07/13 18:29, Tim St Clair wrote:
Joan - 

The "process" is outlined here: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=MakingContributions , and imho is obtuse. 

Cheers,
Tim


From: "Joan J. Piles" <jpiles@xxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, July 25, 2013 11:16:13 AM
Subject: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit

Hi, Tim,

What is the procedure to open a ticket? I didn't manage to find a registration form or something.

Regards,

Joan

On 25/07/13 15:55, Tim St Clair wrote:
Hi Joan - 

Would you like to open a ticket?  If not, I'll open it.  

Cheers,
Tim


From: "Joan J. Piles" <jpiles@xxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, July 25, 2013 5:47:24 AM
Subject: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit

Hi,

Just in case it can be useful for somebody, we have been able to solve (or workaround) the problem with a little patch to the condor source:

diff -ur condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp
--- condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp    2013-05-29 18:58:09.000000000 +0200
+++ condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp    2013-07-25 12:13:09.000000000 +0200
@@ -798,6 +798,18 @@
 int
 VanillaProc::outOfMemoryEvent(int /* fd */)
 {
+
+    /* If we have no jobs left, return and do nothing */
+    if (num_pids == 0) {
+        dprintf(D_FULLDEBUG, "Closing event FD pipe %d.\n", m_oom_efd);
+        daemonCore->Close_Pipe(m_oom_efd);
+        close(m_oom_fd);
+        m_oom_efd = -1;
+        m_oom_fd = -1;
+
+        return 0;
+    }
+
     std::stringstream ss;
     if (m_memory_limit >= 0) {
         ss << "Job has gone over memory limit of " << m_memory_limit << " megabytes.";

I don't know if it is the best way to work around this problem, but at least it seems to work for us. We have forced a (true) OOM condition, and it responded as it should, whereas the jobs weren't put on hold at exit.

I don't think it's too clean, either, but as I've said it's more of a quick-and-dirty hack to get this feature (which is really interesting for us) running.

Regards,

Joan

El 24/07/13 17:24, Paolo Perfetti escribiÃ:
Hi,

On 24/07/2013 13:07, Joan J. Piles wrote:
Hi all:

We are having some problems using cgroups for memory limiting. When jobs
exit, the OOM-Killer routines get called, placing the job on hold
instead of letting it end normally. With a full starter log (and a
really short job) debug we have:

Right now I'm getting crazy on the same problem since a week.
My system is an updated Debian Wheezy  with condor version 8.0.1-148801 (from research.cs.wisc.edu repository)
odino:~$ uname  -a
Linux odino 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux


cgroups seems working properly:
odino:~$ condor_config_val BASE_CGROUP
htcondor
odino:~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
soft
odino:~$ grep cgroup /etc/default/grub
GRUB_CMDLINE_LINUX="cgroup_enable=memory"
odino:~$ cat /etc/cgconfig.conf
mount {
        cpu     = /cgroup/cpu;
        cpuset  = /cgroup/cpuset;
        cpuacct = /cgroup/cpuacct;
        memory  = /cgroup/memory;
        freezer = /cgroup/freezer;
        blkio   = /cgroup/blkio;
}

group htcondor {
        cpu {}
        cpuset {}
        cpuacct {}
        memory {
# Tested both memory.limit_in_bytes and memory.soft_limit_in_bytes
#memory.limit_in_bytes = 16370672K;
          memory.soft_limit_in_bytes = 16370672K;
        }
        freezer {}
        blkio {}
}
odino:~$ mount | grep cgrou
cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio)

Submit file is trivial:
universe = parallel
executable = /bin/sleep
arguments = 15
machine_count = 4
#request_cpu = 1
request_memory = 128
log = log
output = output
error  = error
notification = never
should_transfer_files = always
when_to_transfer_output = on_exit
queue

Below is my StarterLog.

Any suggestion would be appreciated.
tnx, Paolo


07/24/13 16:56:09 Enumerating interfaces: lo 127.0.0.1 up
07/24/13 16:56:09 Enumerating interfaces: eth0 192.168.100.161 up
07/24/13 16:56:09 Enumerating interfaces: eth1 10.5.0.2 up
07/24/13 16:56:09 Initializing Directory: curr_dir = /etc/condor/config.d
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 ** condor_starter (CONDOR_STARTER) STARTING UP
07/24/13 16:56:09 ** /usr/sbin/condor_starter
07/24/13 16:56:09 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
07/24/13 16:56:09 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
07/24/13 16:56:09 ** $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $
07/24/13 16:56:09 ** $CondorPlatform: x86_64_Debian7 $
07/24/13 16:56:09 ** PID = 31181
07/24/13 16:56:09 ** Log last touched 7/24 16:37:26
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 Using config source: /etc/condor/condor_config
07/24/13 16:56:09 Using local config sources:
07/24/13 16:56:09    /etc/condor/config.d/00-asgard-common
07/24/13 16:56:09    /etc/condor/config.d/10-asgard-execute
07/24/13 16:56:09    /etc/condor/condor_config.local
07/24/13 16:56:09 Running as root.  Enabling specialized core dump routines
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 DaemonCore: command socket at <192.168.100.161:35626>
07/24/13 16:56:09 DaemonCore: private command socket at <192.168.100.161:35626>
07/24/13 16:56:09 Setting maximum accepts per cycle 8.
07/24/13 16:56:09 Will use UDP to update collector odino.bo.ingv.it <192.168.100.160:9618>
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 Entering JICShadow::receiveMachineAd
07/24/13 16:56:09 Communicating with shadow <192.168.100.160:36378?noUDP>
07/24/13 16:56:09 Shadow version: $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $
07/24/13 16:56:09 Submitting machine is "odino.bo.ingv.it"
07/24/13 16:56:09 Instantiating a StarterHookMgr
07/24/13 16:56:09 Job does not define HookKeyword, not invoking any job hooks.
07/24/13 16:56:09 setting the orig job name in starter
07/24/13 16:56:09 setting the orig job iwd in starter
07/24/13 16:56:09 ShouldTransferFiles is "YES", transfering files
07/24/13 16:56:09 Submit UidDomain: "bo.ingv.it"
07/24/13 16:56:09  Local UidDomain: "bo.ingv.it"
07/24/13 16:56:09 Initialized user_priv as "username"
07/24/13 16:56:09 Done moving to directory "/var/lib/condor/execute/dir_31181"
07/24/13 16:56:09 Job has WantIOProxy=true
07/24/13 16:56:09 Initialized IO Proxy.
07/24/13 16:56:09 LocalUserLog::initFromJobAd: path_attr = StarterUserLog
07/24/13 16:56:09 LocalUserLog::initFromJobAd: xml_attr = StarterUserLogUseXML
07/24/13 16:56:09 No StarterUserLog found in job ClassAd
07/24/13 16:56:09 Starter will not write a local UserLog
07/24/13 16:56:09 Done setting resource limits
07/24/13 16:56:09 Changing the executable name
07/24/13 16:56:09 entering FileTransfer::Init
07/24/13 16:56:09 entering FileTransfer::SimpleInit
07/24/13 16:56:09 FILETRANSFER: protocol "http" handled by "/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "ftp" handled by "/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "file" handled by "/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "data" handled by "/usr/lib/condor/libexec/data_plugin"
07/24/13 16:56:09 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:09 TransferIntermediate="(none)"
07/24/13 16:56:09 entering FileTransfer::DownloadFiles
07/24/13 16:56:09 entering FileTransfer::Download
07/24/13 16:56:09 FileTransfer: created download transfer process with id 31184
07/24/13 16:56:09 entering FileTransfer::DownloadThread
07/24/13 16:56:09 entering FileTransfer::DoDownload sync=1
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 DaemonCore: in SendAliveToParent()
07/24/13 16:56:09 REMAP: begin with rules:
07/24/13 16:56:09 REMAP: 0: condor_exec.exe
07/24/13 16:56:09 REMAP: res is 0 ->  !
07/24/13 16:56:09 Sending GoAhead for 192.168.100.160 to send /var/lib/condor/execute/dir_31181/condor_exec.exe and all further files.
07/24/13 16:56:09 Completed DC_CHILDALIVE to daemon at <192.168.100.161:53285>
07/24/13 16:56:09 DaemonCore: Leaving SendAliveToParent() - success
07/24/13 16:56:09 Received GoAhead from peer to receive /var/lib/condor/execute/dir_31181/condor_exec.exe.
07/24/13 16:56:09 get_file(): going to write to filename /var/lib/condor/execute/dir_31181/condor_exec.exe
07/24/13 16:56:09 get_file: Receiving 31136 bytes
07/24/13 16:56:09 get_file: wrote 31136 bytes to file
07/24/13 16:56:09 ReliSock::get_file_with_permissions(): going to set permissions 755
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 File transfer completed successfully.
07/24/13 16:56:09 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Calling client FileTransfer handler function.
07/24/13 16:56:10 HOOK_PREPARE_JOB not configured.
07/24/13 16:56:10 Job 90.0 set to execute immediately
07/24/13 16:56:10 Starting a PARALLEL universe job with ID: 90.0
07/24/13 16:56:10 In OsProc::OsProc()
07/24/13 16:56:10 Main job KillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job RmKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job HoldKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Constructor of ParallelProc::ParallelProc
07/24/13 16:56:10 in ParallelProc::StartJob()
07/24/13 16:56:10 Found Node = 0 in job ad
07/24/13 16:56:10 ParallelProc::addEnvVars()
07/24/13 16:56:10 No Path in ad, $PATH in env
07/24/13 16:56:10 before: /bin:/sbin:/usr/bin:/usr/sbin
07/24/13 16:56:10 New env: PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_PROCNO=0 CONDOR_CONFIG=/etc/condor/condor_config _CONDOR_NPROCS=4 _CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0
07/24/13 16:56:10 in VanillaProc::StartJob()
07/24/13 16:56:10 Requesting cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx for job.
07/24/13 16:56:10 Value of RequestedChroot is unset.
07/24/13 16:56:10 PID namespace option: false
07/24/13 16:56:10 in OsProc::StartJob()
07/24/13 16:56:10 IWD: /var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Input file: /dev/null
07/24/13 16:56:10 Output file: /var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:10 Error file: /var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:10 About to exec /var/lib/condor/execute/dir_31181/condor_exec.exe 15
07/24/13 16:56:10 Env = TEMP=/var/lib/condor/execute/dir_31181 _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_31181 _CONDOR_SLOT=slot1_1 TMPDIR=/var/lib/condor/execute/dir_31181 _CONDOR_PROCNO=0 _CONDOR_JOB_PIDS= TMP=/var/lib/condor/execute/dir_31181 _CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0 _CONDOR_JOB_AD=/var/lib/condor/execute/dir_31181/.job.ad _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_31181 CONDOR_CONFIG=/etc/condor/condor_config PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_31181/.machine.ad _CONDOR_NPROCS=4
07/24/13 16:56:10 Setting job's virtual memory rlimit to 17179869184 megabytes
07/24/13 16:56:10 ENFORCE_CPU_AFFINITY not true, not setting affinity
07/24/13 16:56:10 Running job as user username
07/24/13 16:56:10 track_family_via_cgroup: Tracking PID 31185 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx.
07/24/13 16:56:10 About to tell ProcD to track family with root 31185 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx
07/24/13 16:56:10 Create_Process succeeded, pid=31185
07/24/13 16:56:10 Initializing cgroup library.
07/24/13 16:56:18 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Entering JICShadow::updateShadow()
07/24/13 16:56:18 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Sent job ClassAd update to startd.
07/24/13 16:56:18 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 DaemonCore: No more children processes to reap.
07/24/13 16:56:25 Process exited, pid=31185, status=0
07/24/13 16:56:25 Inside VanillaProc::JobReaper()
07/24/13 16:56:25 Inside OsProc::JobReaper()
07/24/13 16:56:25 Inside UserProc::JobReaper()
07/24/13 16:56:25 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 HOOK_JOB_EXIT not configured.
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 Entering JICShadow::updateShadow()
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 JICShadow::transferOutput(void): Transferring...
07/24/13 16:56:25 Begin transfer of sandbox to shadow.
07/24/13 16:56:25 entering FileTransfer::UploadFiles (final_transfer=1)
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Sending new file _condor_stdout, time==1374677770, size==0
07/24/13 16:56:25 Skipping file in exception list: .job.ad
07/24/13 16:56:25 Sending new file _condor_stderr, time==1374677770, size==0
07/24/13 16:56:25 Skipping file in exception list: .machine.ad
07/24/13 16:56:25 Skipping file chirp.config, t: 1374677769==1374677769, s: 54==54
07/24/13 16:56:25 Skipping file condor_exec.exe, t: 1374677769==1374677769, s: 31136==31136
07/24/13 16:56:25 FileTransfer::UploadFiles: sent TransKey=1#51efeb09437ffa2dcc159bc
07/24/13 16:56:25 entering FileTransfer::Upload
07/24/13 16:56:25 entering FileTransfer::DoUpload
07/24/13 16:56:25 DoUpload: sending file _condor_stdout
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for _condor_stdout
07/24/13 16:56:25 Received GoAhead from peer to send /var/lib/condor/execute/dir_31181/_condor_stdout.
07/24/13 16:56:25 Sending GoAhead for 192.168.100.160 to receive /var/lib/condor/execute/dir_31181/_condor_stdout and all further files.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send permissions 100644
07/24/13 16:56:25 put_file: going to send from filename /var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: sending file _condor_stderr
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for _condor_stderr
07/24/13 16:56:25 Received GoAhead from peer to send /var/lib/condor/execute/dir_31181/_condor_stderr.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send permissions 100644
07/24/13 16:56:25 put_file: going to send from filename /var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: exiting at 3294
07/24/13 16:56:25 End transfer of sandbox to shadow.
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Inside OsProc::JobExit()
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Notifying exit status=0 reason=100
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Hold all jobs
07/24/13 16:56:25 All jobs were removed due to OOM event.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Closing event FD pipe 65536.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Got SIGQUIT.  Performing fast shutdown.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 dirscat: dirpath = /
07/24/13 16:56:25 dirscat: subdir = /var/lib/condor/execute
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/
07/24/13 16:56:25 Removing /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Attempting to remove /var/lib/condor/execute/dir_31181 as SuperUser (root)
07/24/13 16:56:25 **** condor_starter (condor_STARTER) pid 31181 EXITING WITH STATUS 0

07/24/13 12:47:39 Initializing cgroup library.
07/24/13 12:47:44 DaemonCore: No more children processes to reap.
07/24/13 12:47:44 Process exited, pid=32686, status=0
07/24/13 12:47:44 Inside VanillaProc::JobReaper()
07/24/13 12:47:44 Inside OsProc::JobReaper()
07/24/13 12:47:44 Inside UserProc::JobReaper()
07/24/13 12:47:44 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 HOOK_JOB_EXIT not configured.
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 Entering JICShadow::updateShadow()
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Leaving JICShadow::updateShadow(): success
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 JICShadow::transferOutput(void): Transferring...
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Inside OsProc::JobExit()
07/24/13 12:47:44 Notifying exit status=0 reason=100
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Hold all jobs
07/24/13 12:47:44 All jobs were removed due to OOM event.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Closing event FD pipe 0.
07/24/13 12:47:44 Close_Pipe on invalid pipe end: 0
07/24/13 12:47:44 ERROR "Close_Pipe error" at line 2104 in file
/slots/01/dir_5373/userdir/src/condor_daemon_core.V6/daemon_core.cpp
07/24/13 12:47:44 ShutdownFast all jobs.
07/24/13 12:47:44 Got ShutdownFast when no jobs running.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)

It seems an event is fired for some reason to the OOM eventfd (the
cgroup itself being destroyed, perhaps?). Has anybody else seen the same
issue? Could it be a change in the kernel cgroups' interface?

Thanks,

Joan

--
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de InvestigaciÃn en IngenierÃa de AragÃn
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es  --jpiles@xxxxxxxxx
--------------------------------------------------------------------------



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de InvestigaciÃn en IngenierÃa de AragÃn
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



[← Prev in Thread] Current Thread [Next in Thread→]