Re: [HTCondor-devel] [HTCondor-users] spontaneous reboots after enabling cgroups


Date: Thu, 11 Jul 2013 13:04:15 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] [HTCondor-users] spontaneous reboots after enabling cgroups

So... seems like we should do something about the below. The question is what. Some options:

1. Never mount the freezer controller. If we did this, does it mean that processes in a job could "get away" from our control if the job forks quickly?

2. Add logic in the code to not use the freezer controller if want_suspend =!= FALSE or suspend =!= false

3. ????

Thoughts?

-Todd



On 7/11/2013 11:58 AM, Jason Ferrara wrote:
On 6/27/2013 10:07 PM, Brian Bockelman wrote:
Hi Jason,

If memory serves, the RHEL 6.4 kernel can crash when attempting to freeze a set of SIGSTOP'd processes.  I don't know if it is fixed in the upstream kernel though...

Two workarounds come to mind:

1) Unmount the freezer controller.  HTCondor should simply not use controllers that are not available.
That worked. Thanks!
2) Set SUSPEND=FALSE on the worker node configuration.
I tried that before making my original post. It seems that even with
SUSPEND=FALSE condor is still messing around with the freezer controller.

Hope this helps,

Brian

On Jun 26, 2013, at 7:31 PM, Jason Ferrara<jason.ferrara@xxxxxxxxxxxxx>  wrote:

I have a pool of machines running CentOS 6.4, Kernel 2.6.32-358, and HTCondor 7.9.4.

Today, in order to try to stop jobs which underestimate their memory usage from making the machines swap a lot and get slow, I enabled cgroups and set

CGROUP_MEMORY_LIMIT_POLICY = soft
RESERVED_MEMORY = 1024

The idea was to make sure there was always at least 1G of physical memory available for system and interactive processes. This worked as intended, and the thrashing problems went away, but now I'm seeing machines randomly reboot, without any error messages in the system logs.

In the one machine where I have kdump enabled, the error below was in vcore-dmesg.txt from the crash dump.


<2>kernel BUG at kernel/cgroup_freezer.c:247!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/virtual/block/dm-0/uevent
<4>CPU 1
<4>Modules linked in: fuse nfsd exportfs gfs2 nfs lockd fscache auth_rpcgss nfs_acl bnx2fc fcoe lib
fcoe libfc scsi_transport_fc scsi_tgt dlm configfs 8021q garp stp llc sunrpc ipt_REJECT nf_conntrac
k_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_stat
e nf_conntrack ip6table_filter ip6_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr
iscsi_tcp sg dcdbas k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mb
cache jbd2 sd_mod crc_t10dif ixgbe igb dca ptp pps_core ata_generic pata_acpi pata_atiixp ahci dm_m
irror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3
mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_
scan]
<4>
<4>Pid: 3618, comm: condor_procd Not tainted 2.6.32-358.11.1.el6.x86_64 #1 Dell Inc.              P
owerEdge C6105       /0MVKG0
<4>RIP: 0010:[<ffffffff810ca64b>] [<ffffffff810ca64b>] update_if_frozen+0x9b/0xc0
<4>RSP: 0018:ffff880803183d98  EFLAGS: 00010097
<4>RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8800378e3e18
<4>RDX: 0000000000000000 RSI: ffff880803183da8 RDI: ffff88055242d000
<4>RBP: ffff880803183de8 R08: ffff88080527c318 R09: 0000000000000000
<4>R10: 00000000ffffffff R11: 0000000000000246 R12: ffff88055242d000
<4>R13: ffff880803183da8 R14: 0000000000000000 R15: 0000000000000002
<4>FS:  00007f2e19ca0b40(0000) GS:ffff88002c240000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>CR2: 00007f2e1b4c7000 CR3: 0000000819747000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process condor_procd (pid: 3618, threadinfo ffff880803182000, task ffff8808042c2ae0)
<4>Stack:
<4> 00007f2e1b4c7000 ffff8808197cbb80 0000000000000000 ffff8800378e3e18
<4><d> ffff88055242d000 ffff88055242d000 00000000ffffffed 0000000000000000
<4><d> ffff8808197cbb80 ffff8808197cbba4 ffff880803183e38 ffffffff810ca6fd
<4>Call Trace:
<4> [<ffffffff810ca6fd>] freezer_write+0x8d/0x1a0
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff810ca670>] ? freezer_write+0x0/0x1a0
<4> [<ffffffff810c59df>] cgroup_file_write+0x16f/0x320
<4> [<ffffffff8114a8da>] ? do_mmap_pgoff+0x33a/0x380
<4> [<ffffffff811810d8>] vfs_write+0xb8/0x1a0
<4> [<ffffffff811819d1>] sys_write+0x51/0x90
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 1f 45 85 f6 75 44 4c 89 ee 4c 89 e7 e8 af 9f ff ff 48 83 c4 28 5b 41 5c 41 5d 41 5e 41 5f
c9 c3 41 83 ff 01 74 12 41 39 de 74 db <0f> 0b 0f 1f 00 eb fb 66 0f 1f 44 00 00 41 39 de 75 c9 48 8
b 45
<1>RIP  [<ffffffff810ca64b>] update_if_frozen+0x9b/0xc0
<4> RSP <ffff880803183d98>



Has anyone seen this before? Does anyone know of a solution? Is anyone successfully using cgroups with HTCondor under CentOS 6.4?

Thanks

- Jason

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx  with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx  with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685
[← Prev in Thread] Current Thread [Next in Thread→]
  • Re: [HTCondor-devel] [HTCondor-users] spontaneous reboots after enabling cgroups, Todd Tannenbaum <=