[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_starter hanging with 100% CPU



Samik:

What's happening here is that the Debian developers have made the choice to, by default, turn off the memory cgroups controller in the kernel. Because they (rightly or wrongly) think it decreases memory performance if you have it enabled but are not using it. Condor isn't handling that well although it should do it better.Â

But you should really be using cgroups! What you want is to ensure that the kernel boots with "cgroup_enable=memory swapaccount=1" in the parameter list. For example,

linux  /boot/vmlinuz-3.16.0-4-amd64 root=UUID=YOUR_UUID_HERE ro quiet cgroup_enable=
memory swapaccount=1

I ensure the presence of these parameters by editing /etc/default/grub to include

GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1"

I'd be curious if there's a way to do this by creating a file under /etc/grub.d. This would allow the Condor developers to package a file that automates the addition of this line more easily... if you happen to use puppet this will edit /etc/default/grub in the right way:

 file_line { 'enable-cgroups-memory-controller':
  ensure => present,
  path  => '/etc/default/grub',
  line  => 'GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1"',
  match Â=> '^GRUB_CMDLINE_LINUX_DEFAULT=',
 }

 exec { 'update-grub':
  command   => '/usr/sbin/update-grub',
  user    Â=> root,
  refreshonly => true,
  subscribe  => File_line['enable-cgroups-memory-controller'],
  logoutput => 'on_failure',
 }

You'll also have to create the htcondor cgroup. Section 3.14.12 of the HTCondor manual covers this (and vaguely mentions the problem you're encountering), although it's not the method I use myself.

--
Tom Downes
Senior Scientist and Data CenterÂManager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678

On Tue, Apr 11, 2017 at 12:32 AM, Samik Raychaudhuri <samikr@xxxxxxxxx> wrote:
Hi Greg,

Yes - that fixes the starter problem. Before setting it to blank, the value was:

root@condor-01:/etc/condor# condor_config_val BASE_CGROUP
htcondor

I tried reading up a bit and followed a few trails for this variable, but things weren't too clear. Where did that previous value come from during installation?

Regards,
-Samik

On 11-Apr-17 1:55 AM, Greg Thain wrote:
On 04/10/2017 01:43 PM, Samik Raychaudhuri wrote:
Hello,

I am trying to run a single-node v8.6.1 of HTCondor on a debian VM. The condor system seems to start up, and the condor_status does show the node as available. However, once I submit a job, condor_starter goes to 100% CPU and stays there till I manually shut down the condor system. Upon further investigations, I noticed that the job is actually getting done, and the output is written in the temporary folder (see logs below), but for some reason condor_starter is not getting notification of this, or is getting busy in something else.

Here are some relevant logs - please let me know if someone wants to see something else. Where should I start checking?

If you set

BASE_CGROUP =

(that is, blank), does this fix the starter problem?

-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/