Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] User process tree w/PPID=1 : not valid (runaway/breakaway), confirmed?
- Date: Wed, 16 Mar 2016 09:01:46 +0000
- From: Iain Bradford Steers <iain.steers@xxxxxxx>
- Subject: Re: [HTCondor-users] User process tree w/PPID=1 : not valid (runaway/breakaway), confirmed?
Hi,
What does condor_who return when run on the worker node?
Also is this htcondor cluster sandboxed by pid_namespaces and/or cgroups?
You can tell from the worker-node by:
~]# condor_config_val USE_PID_NAMESPACES
and
~]# condor_config_val BASE_CGROUP
Cheers, Iain
________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Winnie Lacesso [Winnie.Lacesso@xxxxxxxxxxxxx]
Sent: 16 March 2016 09:44
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] User process tree w/PPID=1 : not valid (runaway/breakaway), confirmed?
Good morning!
I'm extremely new to htcondor, having managed pbs/torque/maui CREAM-CEs
for years. On one WN converted to htcondor is a suspiciously (to me) high
load & in looking at top & tracing PIDs back, 3 process trees owned by a
pool account with PPID=1 show up. They're all using (trying to) 100% of a
CPU, thus interefering with legit jobs assigned by condor (or however it's
phrased) to that WN.
UID PID PPID C STIME TTY TIME CMD
cms457 2009833 1 0 Feb26 ? 00:00:05 ./combine -H ProfileLikelihood -t 10 -M HybridNew -m 650 -s 16 -d cut_based_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_ll_m650_13TeV_MJJ-95-135_MVA-0p1_All.dat.root -n CutBased_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_MJJ-95-135_MVA-0p1_All
cms457 2009838 1 0 Feb26 ? 00:00:05 ./combine -H ProfileLikelihood -t 10 -M HybridNew -m 650 -s 16 -d cut_based_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_ll_m650_13TeV_MJJ-95-135_MVA-0p1_All.dat.root -n CutBased_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_MJJ-95-135_MVA-0p1_All
cms457 2009858 1 0 Feb26 ? 00:00:04 ./combine -H ProfileLikelihood -t 10 -M HybridNew -m 650 -s 16 -d cut_based_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_ll_m650_13TeV_MJJ-95-135_MVA-0p1_All.dat.root -n CutBased_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_MJJ-95-135_MVA-0p1_All
root@sm09> pstree -lp 2009858
combine(2009858)---combine(2009924)---combine(2009932)---combine(2009938)---combine(2009947)---combine(2009955)---combine(2668426)---combine(2729346)
root@sm09> pstree -lp 2009833
combine(2009833)---combine(2090922)---combine(2102168)
root@sm09> pstree -lp 2009838
combine(2009838)---combine(3219772)---combine(3499009)---combine(3547343)
So based on years of pb/tq/maui admin coupled with their PPID=1 (& that
they've been running since Feb26!!!), I think they're breakaway/runaway
process tress, not properly killed or exited by a previous legit htcondor
job, & so should be killed.
My colleague who built the htcondor system here is 99.9% sure that a pool
account process tree with PPID=1 is breakaway/runaway but not 100% sure,
so recommed asking on this list.
If anyone has been a pbs/torque/maui admin & now does htcondor admin & has
built a translation table of "this way on pbs/torque/maui = howto on
htcondor" I'd be VERY grateful for a copy!
In particular, qstat in pbs/torque world gives the PID of the "start" of
any pool account job, so pstree -lp $pid shows the whole process tree.
eg:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1984254.lcgce04.lhcbpil0 long cream_190561598 4236 1 1 -- 60:00 R 25:04 sm00
root@sm00> pstree -lp 4236
bash(4236)---1984254.lcgce04(4251)---CREAM190561598_(4256)---perl(4314)-+-perl(4316)
`-sh(4315)---DIRAC_9ofzuZ_pi(4318)---python(4320)---python(4321)---python(6301)-+-Job127964102(7411)---python2.7(7412)-+-python(7444)-+-sh(7778)---python(7787)---python(7788)---bd2kstarmumu_eo(7812)
| | `-{python}(7445)
| |-{python2.7}(7415)
| `-{python2.7}(7443)
`-{python}(6434)
My colleague says he knows of no way (yet) to get that start-of-job PID in
htcondor. Does anyone on this list know how?
Grateful for advice+pointers!
PS If my above questions are answered in some online
tutorial/documentation, a URL would be most welcome!
Winnie Lacesso / Bristol University Particle Physics Computing Systems
HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/