Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
- Date: Wed, 24 Jul 2013 17:24:32 +0200
- From: Paolo Perfetti <paolo.perfetti@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
Hi,
On 24/07/2013 13:07, Joan J. Piles wrote:
Hi all:
We are having some problems using cgroups for memory limiting. When jobs
exit, the OOM-Killer routines get called, placing the job on hold
instead of letting it end normally. With a full starter log (and a
really short job) debug we have:
Right now I'm getting crazy on the same problem since a week.
My system is an updated Debian Wheezy with condor version 8.0.1-148801
(from research.cs.wisc.edu repository)
odino:~$ uname -a
Linux odino 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux
cgroups seems working properly:
odino:~$ condor_config_val BASE_CGROUP
htcondor
odino:~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
soft
odino:~$ grep cgroup /etc/default/grub
GRUB_CMDLINE_LINUX="cgroup_enable=memory"
odino:~$ cat /etc/cgconfig.conf
mount {
cpu = /cgroup/cpu;
cpuset = /cgroup/cpuset;
cpuacct = /cgroup/cpuacct;
memory = /cgroup/memory;
freezer = /cgroup/freezer;
blkio = /cgroup/blkio;
}
group htcondor {
cpu {}
cpuset {}
cpuacct {}
memory {
# Tested both memory.limit_in_bytes and memory.soft_limit_in_bytes
#memory.limit_in_bytes = 16370672K;
memory.soft_limit_in_bytes = 16370672K;
}
freezer {}
blkio {}
}
odino:~$ mount | grep cgrou
cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio)
Submit file is trivial:
universe = parallel
executable = /bin/sleep
arguments = 15
machine_count = 4
#request_cpu = 1
request_memory = 128
log = log
output = output
error = error
notification = never
should_transfer_files = always
when_to_transfer_output = on_exit
queue
Below is my StarterLog.
Any suggestion would be appreciated.
tnx, Paolo
07/24/13 16:56:09 Enumerating interfaces: lo 127.0.0.1 up
07/24/13 16:56:09 Enumerating interfaces: eth0 192.168.100.161 up
07/24/13 16:56:09 Enumerating interfaces: eth1 10.5.0.2 up
07/24/13 16:56:09 Initializing Directory: curr_dir = /etc/condor/config.d
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 ** condor_starter (CONDOR_STARTER) STARTING UP
07/24/13 16:56:09 ** /usr/sbin/condor_starter
07/24/13 16:56:09 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
07/24/13 16:56:09 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
07/24/13 16:56:09 ** $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $
07/24/13 16:56:09 ** $CondorPlatform: x86_64_Debian7 $
07/24/13 16:56:09 ** PID = 31181
07/24/13 16:56:09 ** Log last touched 7/24 16:37:26
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 Using config source: /etc/condor/condor_config
07/24/13 16:56:09 Using local config sources:
07/24/13 16:56:09 /etc/condor/config.d/00-asgard-common
07/24/13 16:56:09 /etc/condor/config.d/10-asgard-execute
07/24/13 16:56:09 /etc/condor/condor_config.local
07/24/13 16:56:09 Running as root. Enabling specialized core dump routines
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 DaemonCore: command socket at <192.168.100.161:35626>
07/24/13 16:56:09 DaemonCore: private command socket at
<192.168.100.161:35626>
07/24/13 16:56:09 Setting maximum accepts per cycle 8.
07/24/13 16:56:09 Will use UDP to update collector odino.bo.ingv.it
<192.168.100.160:9618>
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 Entering JICShadow::receiveMachineAd
07/24/13 16:56:09 Communicating with shadow <192.168.100.160:36378?noUDP>
07/24/13 16:56:09 Shadow version: $CondorVersion: 8.0.1 Jul 15 2013
BuildID: 148801 $
07/24/13 16:56:09 Submitting machine is "odino.bo.ingv.it"
07/24/13 16:56:09 Instantiating a StarterHookMgr
07/24/13 16:56:09 Job does not define HookKeyword, not invoking any job
hooks.
07/24/13 16:56:09 setting the orig job name in starter
07/24/13 16:56:09 setting the orig job iwd in starter
07/24/13 16:56:09 ShouldTransferFiles is "YES", transfering files
07/24/13 16:56:09 Submit UidDomain: "bo.ingv.it"
07/24/13 16:56:09 Local UidDomain: "bo.ingv.it"
07/24/13 16:56:09 Initialized user_priv as "username"
07/24/13 16:56:09 Done moving to directory
"/var/lib/condor/execute/dir_31181"
07/24/13 16:56:09 Job has WantIOProxy=true
07/24/13 16:56:09 Initialized IO Proxy.
07/24/13 16:56:09 LocalUserLog::initFromJobAd: path_attr = StarterUserLog
07/24/13 16:56:09 LocalUserLog::initFromJobAd: xml_attr =
StarterUserLogUseXML
07/24/13 16:56:09 No StarterUserLog found in job ClassAd
07/24/13 16:56:09 Starter will not write a local UserLog
07/24/13 16:56:09 Done setting resource limits
07/24/13 16:56:09 Changing the executable name
07/24/13 16:56:09 entering FileTransfer::Init
07/24/13 16:56:09 entering FileTransfer::SimpleInit
07/24/13 16:56:09 FILETRANSFER: protocol "http" handled by
"/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "ftp" handled by
"/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "file" handled by
"/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "data" handled by
"/usr/lib/condor/libexec/data_plugin"
07/24/13 16:56:09 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:09 TransferIntermediate="(none)"
07/24/13 16:56:09 entering FileTransfer::DownloadFiles
07/24/13 16:56:09 entering FileTransfer::Download
07/24/13 16:56:09 FileTransfer: created download transfer process with
id 31184
07/24/13 16:56:09 entering FileTransfer::DownloadThread
07/24/13 16:56:09 entering FileTransfer::DoDownload sync=1
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 DaemonCore: in SendAliveToParent()
07/24/13 16:56:09 REMAP: begin with rules:
07/24/13 16:56:09 REMAP: 0: condor_exec.exe
07/24/13 16:56:09 REMAP: res is 0 -> !
07/24/13 16:56:09 Sending GoAhead for 192.168.100.160 to send
/var/lib/condor/execute/dir_31181/condor_exec.exe and all further files.
07/24/13 16:56:09 Completed DC_CHILDALIVE to daemon at
<192.168.100.161:53285>
07/24/13 16:56:09 DaemonCore: Leaving SendAliveToParent() - success
07/24/13 16:56:09 Received GoAhead from peer to receive
/var/lib/condor/execute/dir_31181/condor_exec.exe.
07/24/13 16:56:09 get_file(): going to write to filename
/var/lib/condor/execute/dir_31181/condor_exec.exe
07/24/13 16:56:09 get_file: Receiving 31136 bytes
07/24/13 16:56:09 get_file: wrote 31136 bytes to file
07/24/13 16:56:09 ReliSock::get_file_with_permissions(): going to set
permissions 755
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 File transfer completed successfully.
07/24/13 16:56:09 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Calling client FileTransfer handler function.
07/24/13 16:56:10 HOOK_PREPARE_JOB not configured.
07/24/13 16:56:10 Job 90.0 set to execute immediately
07/24/13 16:56:10 Starting a PARALLEL universe job with ID: 90.0
07/24/13 16:56:10 In OsProc::OsProc()
07/24/13 16:56:10 Main job KillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job RmKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job HoldKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Constructor of ParallelProc::ParallelProc
07/24/13 16:56:10 in ParallelProc::StartJob()
07/24/13 16:56:10 Found Node = 0 in job ad
07/24/13 16:56:10 ParallelProc::addEnvVars()
07/24/13 16:56:10 No Path in ad, $PATH in env
07/24/13 16:56:10 before: /bin:/sbin:/usr/bin:/usr/sbin
07/24/13 16:56:10 New env: PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin
_CONDOR_PROCNO=0 CONDOR_CONFIG=/etc/condor/condor_config
_CONDOR_NPROCS=4
_CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0
07/24/13 16:56:10 in VanillaProc::StartJob()
07/24/13 16:56:10 Requesting cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx for job.
07/24/13 16:56:10 Value of RequestedChroot is unset.
07/24/13 16:56:10 PID namespace option: false
07/24/13 16:56:10 in OsProc::StartJob()
07/24/13 16:56:10 IWD: /var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Input file: /dev/null
07/24/13 16:56:10 Output file:
/var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:10 Error file:
/var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:10 About to exec
/var/lib/condor/execute/dir_31181/condor_exec.exe 15
07/24/13 16:56:10 Env = TEMP=/var/lib/condor/execute/dir_31181
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_31181
_CONDOR_SLOT=slot1_1 TMPDIR=/var/lib/condor/execute/dir_31181
_CONDOR_PROCNO=0 _CONDOR_JOB_PIDS= TMP=/var/lib/condor/execute/dir_31181
_CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_31181/.job.ad
_CONDOR_JOB_IWD=/var/lib/condor/execute/dir_31181
CONDOR_CONFIG=/etc/condor/condor_config
PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_31181/.machine.ad
_CONDOR_NPROCS=4
07/24/13 16:56:10 Setting job's virtual memory rlimit to 17179869184
megabytes
07/24/13 16:56:10 ENFORCE_CPU_AFFINITY not true, not setting affinity
07/24/13 16:56:10 Running job as user username
07/24/13 16:56:10 track_family_via_cgroup: Tracking PID 31185 via cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxx
07/24/13 16:56:10 About to tell ProcD to track family with root 31185
via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx
07/24/13 16:56:10 Create_Process succeeded, pid=31185
07/24/13 16:56:10 Initializing cgroup library.
07/24/13 16:56:18 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Entering JICShadow::updateShadow()
07/24/13 16:56:18 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Sent job ClassAd update to startd.
07/24/13 16:56:18 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 DaemonCore: No more children processes to reap.
07/24/13 16:56:25 Process exited, pid=31185, status=0
07/24/13 16:56:25 Inside VanillaProc::JobReaper()
07/24/13 16:56:25 Inside OsProc::JobReaper()
07/24/13 16:56:25 Inside UserProc::JobReaper()
07/24/13 16:56:25 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 HOOK_JOB_EXIT not configured.
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 Entering JICShadow::updateShadow()
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 JICShadow::transferOutput(void): Transferring...
07/24/13 16:56:25 Begin transfer of sandbox to shadow.
07/24/13 16:56:25 entering FileTransfer::UploadFiles (final_transfer=1)
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Sending new file _condor_stdout, time==1374677770, size==0
07/24/13 16:56:25 Skipping file in exception list: .job.ad
07/24/13 16:56:25 Sending new file _condor_stderr, time==1374677770, size==0
07/24/13 16:56:25 Skipping file in exception list: .machine.ad
07/24/13 16:56:25 Skipping file chirp.config, t: 1374677769==1374677769,
s: 54==54
07/24/13 16:56:25 Skipping file condor_exec.exe, t:
1374677769==1374677769, s: 31136==31136
07/24/13 16:56:25 FileTransfer::UploadFiles: sent
TransKey=1#51efeb09437ffa2dcc159bc
07/24/13 16:56:25 entering FileTransfer::Upload
07/24/13 16:56:25 entering FileTransfer::DoUpload
07/24/13 16:56:25 DoUpload: sending file _condor_stdout
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for
_condor_stdout
07/24/13 16:56:25 Received GoAhead from peer to send
/var/lib/condor/execute/dir_31181/_condor_stdout.
07/24/13 16:56:25 Sending GoAhead for 192.168.100.160 to receive
/var/lib/condor/execute/dir_31181/_condor_stdout and all further files.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send
permissions 100644
07/24/13 16:56:25 put_file: going to send from filename
/var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: sending file _condor_stderr
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for
_condor_stderr
07/24/13 16:56:25 Received GoAhead from peer to send
/var/lib/condor/execute/dir_31181/_condor_stderr.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send
permissions 100644
07/24/13 16:56:25 put_file: going to send from filename
/var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: exiting at 3294
07/24/13 16:56:25 End transfer of sandbox to shadow.
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Inside OsProc::JobExit()
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Notifying exit status=0 reason=100
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Hold all jobs
07/24/13 16:56:25 All jobs were removed due to OOM event.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Closing event FD pipe 65536.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Got SIGQUIT. Performing fast shutdown.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 dirscat: dirpath = /
07/24/13 16:56:25 dirscat: subdir = /var/lib/condor/execute
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/
07/24/13 16:56:25 Removing /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Attempting to remove /var/lib/condor/execute/dir_31181
as SuperUser (root)
07/24/13 16:56:25 **** condor_starter (condor_STARTER) pid 31181 EXITING
WITH STATUS 0
07/24/13 12:47:39 Initializing cgroup library.
07/24/13 12:47:44 DaemonCore: No more children processes to reap.
07/24/13 12:47:44 Process exited, pid=32686, status=0
07/24/13 12:47:44 Inside VanillaProc::JobReaper()
07/24/13 12:47:44 Inside OsProc::JobReaper()
07/24/13 12:47:44 Inside UserProc::JobReaper()
07/24/13 12:47:44 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 HOOK_JOB_EXIT not configured.
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 Entering JICShadow::updateShadow()
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Leaving JICShadow::updateShadow(): success
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 JICShadow::transferOutput(void): Transferring...
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Inside OsProc::JobExit()
07/24/13 12:47:44 Notifying exit status=0 reason=100
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Hold all jobs
07/24/13 12:47:44 All jobs were removed due to OOM event.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Closing event FD pipe 0.
07/24/13 12:47:44 Close_Pipe on invalid pipe end: 0
07/24/13 12:47:44 ERROR "Close_Pipe error" at line 2104 in file
/slots/01/dir_5373/userdir/src/condor_daemon_core.V6/daemon_core.cpp
07/24/13 12:47:44 ShutdownFast all jobs.
07/24/13 12:47:44 Got ShutdownFast when no jobs running.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
It seems an event is fired for some reason to the OOM eventfd (the
cgroup itself being destroyed, perhaps?). Has anybody else seen the same
issue? Could it be a change in the kernel cgroups' interface?
Thanks,
Joan
--
--------------------------------------------------------------------------
Joan Josep Piles Contreras - Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es --jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/