I couldn't agree more.
But please send the agreement to Todd (cc'd).
Cheers, Tim
From: "Joan J. Piles" <jpiles@xxxxxxxxx> To: "Tim St Clair" <tstclair@xxxxxxxxxx> Sent: Thursday, July 25, 2013 11:55:57 AM Subject: Re: Fwd: [HTCondor-users] CGROUPS + OOM / HOLD on exit
I just thought there was a more streamlined way to just open a ticket (having to physically sign and scan an agreement for what amounts to a bug report is somewhat convoluted to say the least). Anyway, I'm out of office right now, so I'll send you the signed form tomorrow morning. Cheers, Joan On 25/07/13 18:29, Tim St Clair wrote:Joan -
Cheers, Tim
From: "Joan J. Piles" <jpiles@xxxxxxxxx> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx> Sent: Thursday, July 25, 2013 11:16:13 AM Subject: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
Hi, Tim, What is the procedure to open a ticket? I didn't manage to find a registration form or something. Regards, Joan On 25/07/13 15:55, Tim St Clair wrote:Hi Joan -
Would you like to open a ticket? If not, I'll open it.
Cheers, Tim
From: "Joan J. Piles" <jpiles@xxxxxxxxx> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx> Sent: Thursday, July 25, 2013 5:47:24 AM Subject: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
Hi, Just in case it can be useful for somebody, we have been able to solve (or workaround) the problem with a little patch to the condor source:
diff -ur condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp --- condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp 2013-05-29 18:58:09.000000000 +0200 +++ condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp 2013-07-25 12:13:09.000000000 +0200 @@ -798,6 +798,18 @@ int VanillaProc::outOfMemoryEvent(int /* fd */) { + + /* If we have no jobs left, return and do nothing */ + if (num_pids == 0) { + dprintf(D_FULLDEBUG, "Closing event FD pipe %d.\n", m_oom_efd); + daemonCore->Close_Pipe(m_oom_efd); + close(m_oom_fd); + m_oom_efd = -1; + m_oom_fd = -1; + + return 0; + } + std::stringstream ss; if (m_memory_limit >= 0) { ss << "Job has gone over memory limit of " << m_memory_limit << " megabytes.";
I don't know if it is the best way to work around this problem, but at least it seems to work for us. We have forced a (true) OOM condition, and it responded as it should, whereas the jobs weren't put on hold at exit. I don't think it's too clean, either, but as I've said it's more of a quick-and-dirty hack to get this feature (which is really interesting for us) running. Regards, Joan
El 24/07/13 17:24, Paolo Perfetti escribiÃ:
Hi, On 24/07/2013 13:07, Joan J. Piles wrote:
Hi all: We are having some problems using cgroups for memory limiting. When jobs exit, the OOM-Killer routines get called, placing the job on hold instead of letting it end normally. With a full starter log (and a really short job) debug we have:
Right now I'm getting crazy on the same problem since a week. My system is an updated Debian Wheezy with condor version 8.0.1-148801 (from research.cs.wisc.edu repository) odino:~$ uname -a Linux odino 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux cgroups seems working properly: odino:~$ condor_config_val BASE_CGROUP htcondor odino:~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY soft odino:~$ grep cgroup /etc/default/grub GRUB_CMDLINE_LINUX="cgroup_enable=memory" odino:~$ cat /etc/cgconfig.conf mount { cpu = /cgroup/cpu; cpuset = /cgroup/cpuset; cpuacct = /cgroup/cpuacct; memory = /cgroup/memory; freezer = /cgroup/freezer; blkio = /cgroup/blkio; } group htcondor { cpu {} cpuset {} cpuacct {} memory { # Tested both memory.limit_in_bytes and memory.soft_limit_in_bytes #memory.limit_in_bytes = 16370672K; memory.soft_limit_in_bytes = 16370672K; } freezer {} blkio {} } odino:~$ mount | grep cgrou cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu) cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset) cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct) cgroup on /cgroup/memory type cgroup (rw,relatime,memory) cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer) cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio) Submit file is trivial: universe = parallel executable = /bin/sleep arguments = 15 machine_count = 4 #request_cpu = 1 request_memory = 128 log = log output = output error = error notification = never should_transfer_files = always when_to_transfer_output = on_exit queue Below is my StarterLog. Any suggestion would be appreciated. tnx, Paolo 07/24/13 16:56:09 Enumerating interfaces: lo 127.0.0.1 up 07/24/13 16:56:09 Enumerating interfaces: eth0 192.168.100.161 up 07/24/13 16:56:09 Enumerating interfaces: eth1 10.5.0.2 up 07/24/13 16:56:09 Initializing Directory: curr_dir = /etc/condor/config.d 07/24/13 16:56:09 ****************************************************** 07/24/13 16:56:09 ** condor_starter (CONDOR_STARTER) STARTING UP 07/24/13 16:56:09 ** /usr/sbin/condor_starter 07/24/13 16:56:09 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 07/24/13 16:56:09 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 07/24/13 16:56:09 ** $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $ 07/24/13 16:56:09 ** $CondorPlatform: x86_64_Debian7 $ 07/24/13 16:56:09 ** PID = 31181 07/24/13 16:56:09 ** Log last touched 7/24 16:37:26 07/24/13 16:56:09 ****************************************************** 07/24/13 16:56:09 Using config source: /etc/condor/condor_config 07/24/13 16:56:09 Using local config sources: 07/24/13 16:56:09 /etc/condor/config.d/00-asgard-common 07/24/13 16:56:09 /etc/condor/config.d/10-asgard-execute 07/24/13 16:56:09 /etc/condor/condor_config.local 07/24/13 16:56:09 Running as root. Enabling specialized core dump routines 07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false 07/24/13 16:56:09 DaemonCore: command socket at <192.168.100.161:35626> 07/24/13 16:56:09 DaemonCore: private command socket at <192.168.100.161:35626> 07/24/13 16:56:09 Setting maximum accepts per cycle 8. 07/24/13 16:56:09 Will use UDP to update collector odino.bo.ingv.it <192.168.100.160:9618> 07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false 07/24/13 16:56:09 Entering JICShadow::receiveMachineAd 07/24/13 16:56:09 Communicating with shadow <192.168.100.160:36378?noUDP> 07/24/13 16:56:09 Shadow version: $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $ 07/24/13 16:56:09 Submitting machine is "odino.bo.ingv.it" 07/24/13 16:56:09 Instantiating a StarterHookMgr 07/24/13 16:56:09 Job does not define HookKeyword, not invoking any job hooks. 07/24/13 16:56:09 setting the orig job name in starter 07/24/13 16:56:09 setting the orig job iwd in starter 07/24/13 16:56:09 ShouldTransferFiles is "YES", transfering files 07/24/13 16:56:09 Submit UidDomain: "bo.ingv.it" 07/24/13 16:56:09 Local UidDomain: "bo.ingv.it" 07/24/13 16:56:09 Initialized user_priv as "username" 07/24/13 16:56:09 Done moving to directory "/var/lib/condor/execute/dir_31181" 07/24/13 16:56:09 Job has WantIOProxy=true 07/24/13 16:56:09 Initialized IO Proxy. 07/24/13 16:56:09 LocalUserLog::initFromJobAd: path_attr = StarterUserLog 07/24/13 16:56:09 LocalUserLog::initFromJobAd: xml_attr = StarterUserLogUseXML 07/24/13 16:56:09 No StarterUserLog found in job ClassAd 07/24/13 16:56:09 Starter will not write a local UserLog 07/24/13 16:56:09 Done setting resource limits 07/24/13 16:56:09 Changing the executable name 07/24/13 16:56:09 entering FileTransfer::Init 07/24/13 16:56:09 entering FileTransfer::SimpleInit 07/24/13 16:56:09 FILETRANSFER: protocol "http" handled by "/usr/lib/condor/libexec/curl_plugin" 07/24/13 16:56:09 FILETRANSFER: protocol "ftp" handled by "/usr/lib/condor/libexec/curl_plugin" 07/24/13 16:56:09 FILETRANSFER: protocol "file" handled by "/usr/lib/condor/libexec/curl_plugin" 07/24/13 16:56:09 FILETRANSFER: protocol "data" handled by "/usr/lib/condor/libexec/data_plugin" 07/24/13 16:56:09 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:09 TransferIntermediate="(none)" 07/24/13 16:56:09 entering FileTransfer::DownloadFiles 07/24/13 16:56:09 entering FileTransfer::Download 07/24/13 16:56:09 FileTransfer: created download transfer process with id 31184 07/24/13 16:56:09 entering FileTransfer::DownloadThread 07/24/13 16:56:09 entering FileTransfer::DoDownload sync=1 07/24/13 16:56:09 DaemonCore: No more children processes to reap. 07/24/13 16:56:09 DaemonCore: in SendAliveToParent() 07/24/13 16:56:09 REMAP: begin with rules: 07/24/13 16:56:09 REMAP: 0: condor_exec.exe 07/24/13 16:56:09 REMAP: res is 0 -> ! 07/24/13 16:56:09 Sending GoAhead for 192.168.100.160 to send /var/lib/condor/execute/dir_31181/condor_exec.exe and all further files. 07/24/13 16:56:09 Completed DC_CHILDALIVE to daemon at <192.168.100.161:53285> 07/24/13 16:56:09 DaemonCore: Leaving SendAliveToParent() - success 07/24/13 16:56:09 Received GoAhead from peer to receive /var/lib/condor/execute/dir_31181/condor_exec.exe. 07/24/13 16:56:09 get_file(): going to write to filename /var/lib/condor/execute/dir_31181/condor_exec.exe 07/24/13 16:56:09 get_file: Receiving 31136 bytes 07/24/13 16:56:09 get_file: wrote 31136 bytes to file 07/24/13 16:56:09 ReliSock::get_file_with_permissions(): going to set permissions 755 07/24/13 16:56:09 DaemonCore: No more children processes to reap. 07/24/13 16:56:09 File transfer completed successfully. 07/24/13 16:56:09 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:10 Calling client FileTransfer handler function. 07/24/13 16:56:10 HOOK_PREPARE_JOB not configured. 07/24/13 16:56:10 Job 90.0 set to execute immediately 07/24/13 16:56:10 Starting a PARALLEL universe job with ID: 90.0 07/24/13 16:56:10 In OsProc::OsProc() 07/24/13 16:56:10 Main job KillSignal: 15 (SIGTERM) 07/24/13 16:56:10 Main job RmKillSignal: 15 (SIGTERM) 07/24/13 16:56:10 Main job HoldKillSignal: 15 (SIGTERM) 07/24/13 16:56:10 Constructor of ParallelProc::ParallelProc 07/24/13 16:56:10 in ParallelProc::StartJob() 07/24/13 16:56:10 Found Node = 0 in job ad 07/24/13 16:56:10 ParallelProc::addEnvVars() 07/24/13 16:56:10 No Path in ad, $PATH in env 07/24/13 16:56:10 before: /bin:/sbin:/usr/bin:/usr/sbin 07/24/13 16:56:10 New env: PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_PROCNO=0 CONDOR_CONFIG=/etc/condor/condor_config _CONDOR_NPROCS=4 _CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0 07/24/13 16:56:10 in VanillaProc::StartJob() 07/24/13 16:56:10 Requesting cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx for job. 07/24/13 16:56:10 Value of RequestedChroot is unset. 07/24/13 16:56:10 PID namespace option: false 07/24/13 16:56:10 in OsProc::StartJob() 07/24/13 16:56:10 IWD: /var/lib/condor/execute/dir_31181 07/24/13 16:56:10 Input file: /dev/null 07/24/13 16:56:10 Output file: /var/lib/condor/execute/dir_31181/_condor_stdout 07/24/13 16:56:10 Error file: /var/lib/condor/execute/dir_31181/_condor_stderr 07/24/13 16:56:10 About to exec /var/lib/condor/execute/dir_31181/condor_exec.exe 15 07/24/13 16:56:10 Env = TEMP=/var/lib/condor/execute/dir_31181 _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_31181 _CONDOR_SLOT=slot1_1 TMPDIR=/var/lib/condor/execute/dir_31181 _CONDOR_PROCNO=0 _CONDOR_JOB_PIDS= TMP=/var/lib/condor/execute/dir_31181 _CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0 _CONDOR_JOB_AD=/var/lib/condor/execute/dir_31181/.job.ad _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_31181 CONDOR_CONFIG=/etc/condor/condor_config PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_31181/.machine.ad _CONDOR_NPROCS=4 07/24/13 16:56:10 Setting job's virtual memory rlimit to 17179869184 megabytes 07/24/13 16:56:10 ENFORCE_CPU_AFFINITY not true, not setting affinity 07/24/13 16:56:10 Running job as user username 07/24/13 16:56:10 track_family_via_cgroup: Tracking PID 31185 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx. 07/24/13 16:56:10 About to tell ProcD to track family with root 31185 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx 07/24/13 16:56:10 Create_Process succeeded, pid=31185 07/24/13 16:56:10 Initializing cgroup library. 07/24/13 16:56:18 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:18 In ParallelProc::PublishUpdateAd() 07/24/13 16:56:18 In VanillaProc::PublishUpdateAd() 07/24/13 16:56:18 Inside OsProc::PublishUpdateAd() 07/24/13 16:56:18 Inside UserProc::PublishUpdateAd() 07/24/13 16:56:18 Entering JICShadow::updateShadow() 07/24/13 16:56:18 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:18 In ParallelProc::PublishUpdateAd() 07/24/13 16:56:18 In VanillaProc::PublishUpdateAd() 07/24/13 16:56:18 Inside OsProc::PublishUpdateAd() 07/24/13 16:56:18 Inside UserProc::PublishUpdateAd() 07/24/13 16:56:18 Sent job ClassAd update to startd. 07/24/13 16:56:18 Leaving JICShadow::updateShadow(): success 07/24/13 16:56:25 DaemonCore: No more children processes to reap. 07/24/13 16:56:25 Process exited, pid=31185, status=0 07/24/13 16:56:25 Inside VanillaProc::JobReaper() 07/24/13 16:56:25 Inside OsProc::JobReaper() 07/24/13 16:56:25 Inside UserProc::JobReaper() 07/24/13 16:56:25 Reaper: all=1 handled=1 ShuttingDown=0 07/24/13 16:56:25 In ParallelProc::PublishUpdateAd() 07/24/13 16:56:25 In VanillaProc::PublishUpdateAd() 07/24/13 16:56:25 Inside OsProc::PublishUpdateAd() 07/24/13 16:56:25 Inside UserProc::PublishUpdateAd() 07/24/13 16:56:25 HOOK_JOB_EXIT not configured. 07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:25 In ParallelProc::PublishUpdateAd() 07/24/13 16:56:25 In VanillaProc::PublishUpdateAd() 07/24/13 16:56:25 Inside OsProc::PublishUpdateAd() 07/24/13 16:56:25 Inside UserProc::PublishUpdateAd() 07/24/13 16:56:25 Entering JICShadow::updateShadow() 07/24/13 16:56:25 Sent job ClassAd update to startd. 07/24/13 16:56:25 Leaving JICShadow::updateShadow(): success 07/24/13 16:56:25 Inside JICShadow::transferOutput(void) 07/24/13 16:56:25 JICShadow::transferOutput(void): Transferring... 07/24/13 16:56:25 Begin transfer of sandbox to shadow. 07/24/13 16:56:25 entering FileTransfer::UploadFiles (final_transfer=1) 07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:25 Sending new file _condor_stdout, time==1374677770, size==0 07/24/13 16:56:25 Skipping file in exception list: .job.ad 07/24/13 16:56:25 Sending new file _condor_stderr, time==1374677770, size==0 07/24/13 16:56:25 Skipping file in exception list: .machine.ad 07/24/13 16:56:25 Skipping file chirp.config, t: 1374677769==1374677769, s: 54==54 07/24/13 16:56:25 Skipping file condor_exec.exe, t: 1374677769==1374677769, s: 31136==31136 07/24/13 16:56:25 FileTransfer::UploadFiles: sent TransKey=1#51efeb09437ffa2dcc159bc 07/24/13 16:56:25 entering FileTransfer::Upload 07/24/13 16:56:25 entering FileTransfer::DoUpload 07/24/13 16:56:25 DoUpload: sending file _condor_stdout 07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for _condor_stdout 07/24/13 16:56:25 Received GoAhead from peer to send /var/lib/condor/execute/dir_31181/_condor_stdout. 07/24/13 16:56:25 Sending GoAhead for 192.168.100.160 to receive /var/lib/condor/execute/dir_31181/_condor_stdout and all further files. 07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send permissions 100644 07/24/13 16:56:25 put_file: going to send from filename /var/lib/condor/execute/dir_31181/_condor_stdout 07/24/13 16:56:25 put_file: Found file size 0 07/24/13 16:56:25 put_file: sending 0 bytes 07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes 07/24/13 16:56:25 DoUpload: sending file _condor_stderr 07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for _condor_stderr 07/24/13 16:56:25 Received GoAhead from peer to send /var/lib/condor/execute/dir_31181/_condor_stderr. 07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send permissions 100644 07/24/13 16:56:25 put_file: going to send from filename /var/lib/condor/execute/dir_31181/_condor_stderr 07/24/13 16:56:25 put_file: Found file size 0 07/24/13 16:56:25 put_file: sending 0 bytes 07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes 07/24/13 16:56:25 DoUpload: exiting at 3294 07/24/13 16:56:25 End transfer of sandbox to shadow. 07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void) 07/24/13 16:56:25 Inside OsProc::JobExit() 07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181 07/24/13 16:56:25 Notifying exit status=0 reason=100 07/24/13 16:56:25 Sent job ClassAd update to startd. 07/24/13 16:56:25 Hold all jobs 07/24/13 16:56:25 All jobs were removed due to OOM event. 07/24/13 16:56:25 Inside JICShadow::transferOutput(void) 07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void) 07/24/13 16:56:25 Closing event FD pipe 65536. 07/24/13 16:56:25 ShutdownFast all jobs. 07/24/13 16:56:25 Got ShutdownFast when no jobs running. 07/24/13 16:56:25 Inside JICShadow::transferOutput(void) 07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void) 07/24/13 16:56:25 Got SIGQUIT. Performing fast shutdown. 07/24/13 16:56:25 ShutdownFast all jobs. 07/24/13 16:56:25 Got ShutdownFast when no jobs running. 07/24/13 16:56:25 Inside JICShadow::transferOutput(void) 07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void) 07/24/13 16:56:25 dirscat: dirpath = / 07/24/13 16:56:25 dirscat: subdir = /var/lib/condor/execute 07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/ 07/24/13 16:56:25 Removing /var/lib/condor/execute/dir_31181 07/24/13 16:56:25 Attempting to remove /var/lib/condor/execute/dir_31181 as SuperUser (root) 07/24/13 16:56:25 **** condor_starter (condor_STARTER) pid 31181 EXITING WITH STATUS 0
07/24/13 12:47:39 Initializing cgroup library. 07/24/13 12:47:44 DaemonCore: No more children processes to reap. 07/24/13 12:47:44 Process exited, pid=32686, status=0 07/24/13 12:47:44 Inside VanillaProc::JobReaper() 07/24/13 12:47:44 Inside OsProc::JobReaper() 07/24/13 12:47:44 Inside UserProc::JobReaper() 07/24/13 12:47:44 Reaper: all=1 handled=1 ShuttingDown=0 07/24/13 12:47:44 In VanillaProc::PublishUpdateAd() 07/24/13 12:47:44 Inside OsProc::PublishUpdateAd() 07/24/13 12:47:44 Inside UserProc::PublishUpdateAd() 07/24/13 12:47:44 HOOK_JOB_EXIT not configured. 07/24/13 12:47:44 In VanillaProc::PublishUpdateAd() 07/24/13 12:47:44 Inside OsProc::PublishUpdateAd() 07/24/13 12:47:44 Inside UserProc::PublishUpdateAd() 07/24/13 12:47:44 Entering JICShadow::updateShadow() 07/24/13 12:47:44 Sent job ClassAd update to startd. 07/24/13 12:47:44 Leaving JICShadow::updateShadow(): success 07/24/13 12:47:44 Inside JICShadow::transferOutput(void) 07/24/13 12:47:44 JICShadow::transferOutput(void): Transferring... 07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void) 07/24/13 12:47:44 Inside OsProc::JobExit() 07/24/13 12:47:44 Notifying exit status=0 reason=100 07/24/13 12:47:44 Sent job ClassAd update to startd. 07/24/13 12:47:44 Hold all jobs 07/24/13 12:47:44 All jobs were removed due to OOM event. 07/24/13 12:47:44 Inside JICShadow::transferOutput(void) 07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void) 07/24/13 12:47:44 Closing event FD pipe 0. 07/24/13 12:47:44 Close_Pipe on invalid pipe end: 0 07/24/13 12:47:44 ERROR "Close_Pipe error" at line 2104 in file /slots/01/dir_5373/userdir/src/condor_daemon_core.V6/daemon_core.cpp 07/24/13 12:47:44 ShutdownFast all jobs. 07/24/13 12:47:44 Got ShutdownFast when no jobs running. 07/24/13 12:47:44 Inside JICShadow::transferOutput(void) 07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
It seems an event is fired for some reason to the OOM eventfd (the cgroup itself being destroyed, perhaps?). Has anybody else seen the same issue? Could it be a change in the kernel cgroups' interface? Thanks, Joan -- -------------------------------------------------------------------------- Joan Josep Piles Contreras - Analista de sistemas I3A - Instituto de InvestigaciÃn en IngenierÃa de AragÃn Tel: 876 55 51 47 (ext. 845147) http://i3a.unizar.es --jpiles@xxxxxxxxx -------------------------------------------------------------------------- _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
--
--------------------------------------------------------------------------
Joan Josep Piles Contreras - Analista de sistemas
I3A - Instituto de InvestigaciÃn en IngenierÃa de AragÃn
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
|