Hi Todd and Brian, yes, it is probably not version related. We downgraded fro 8.6.0 to 8.4.11 and the node got into the strange behaviour again. Thing is, that after some time, I see multiple condor_schedd processes running, each using 4GB-5GB of memory [1]. The master is spawning just one schedd after being restarted [2] and I have no idea, where the other schedds are coming from. Judging from the PIDs, they are spawned pretty close to each other (unfortunately, I just restarted condor and forgot to dive into their /proc/PIDs). Suspiciously, dmesg tells several times that the original schedd 2060869 started by the master had run out of memory [3] Cheers, Thomas [1] 1950398 condor 20 0 6797m 4.3g 644 D 38.3 27.6 0:31.61 condor_schedd 1950399 condor 20 0 6797m 5.2g 624 D 38.3 33.4 0:39.68 condor_schedd ... 1951012 condor 20 0 6797m 5.2g 624 R 27.7 33.2 0:38.87 condor_schedd 1950418 condor 20 0 6797m 4.7g 576 D 25.5 30.4 0:36.00 condor_schedd [2] > grep "/usr/sbin/condor_schedd" /var/log/condor/MasterLog | tail -n 6 12/19/16 16:41:41 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 3157 02/01/17 16:42:44 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 3734842 02/09/17 13:29:19 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 3163 02/09/17 18:00:00 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 196569 02/10/17 15:58:47 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1852855 02/10/17 16:30:08 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 2060869 [3] > dmesg ... [1948198] 497 1948198 10372 17 5 0 0 nrpe Out of memory: Kill process 2060869 (condor_schedd) score 339 or sacrifice child Killed process 1947478, UID 0, (condor_schedd) total-vm:6960724kB, anon-rss:5505256kB, file-rss:148kB condor_shadow invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 condor_shadow cpuset=/ mems_allowed=0 Pid: 1953543, comm: condor_shadow Not tainted 2.6.32-642.13.1.el6.x86_64 #1 Call Trace: [<ffffffff81131420>] ? dump_header+0x90/0x1b0 ... [1954251] 0 1954251 3576 15 5 0 0 arc-lcmaps Out of memory: Kill process 2060869 (condor_schedd) score 339 or sacrifice child Killed process 1950394, UID 0, (condor_schedd) total-vm:6960724kB, anon-rss:5156116kB, file-rss:636kB condor_q invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 condor_q cpuset=/ mems_allowed=0 Pid: 1950796, comm: condor_q Not tainted 2.6.32-642.13.1.el6.x86_64 #1 Call Trace: [<ffffffff81131420>] ? dump_header+0x90/0x1b0 ... Out of memory: Kill process 2060869 (condor_schedd) score 339 or sacrifice child Killed process 1950398, UID 0, (condor_schedd) total-vm:6960724kB, anon-rss:4717704kB, file-rss:8kB On 2017-02-10 21:02, Todd Tannenbaum wrote: > Just as a data point, fwiw, I just looked at the ganglia chart for a > fairly busy (~7000 jobs running at any moment) schedd here at UW-Madison > which has been running v8.6.0 for three weeks. No sign of memory leaks > or bursts. > > regards, > Todd > > > On 2/10/2017 9:13 AM, Thomas Hartmann wrote: >> Hi Brian, >> >> thanks for the suggestion >> >> On 2017-02-10 03:19, Brian Bockelman wrote: >>> Is it possible the extra memory usage is coming from when the >>> condor_schedd process forks to respond to condor_q queries? Are you >>> seeing an abnormally large amount of queries? >> >> not that I am aware of - any queries would should come only from the ARC >> CE, but afais both our ARCCEs have been ~equally busy. >> As cross-check, I restarted the CE daemon, but it had no effect on the >> memory consumption so far and only reduced the number of connections to >> the outside [1] compared to its sibling (should be the expected >> behaviour). >> On the affected node quite(?) a number of shadows were kept open [2], >> but that should be OK, or? >> >> We have now downgraded the version to >> 8.4.11 >> and will keep an eye on it over the weekend. >> If the behaviour gets back to normal, we can at least exclude Condor. >> >> Cheers, >> Thomas >> >> >> >> >> [1] >>> grid-arcce1 > wc -l /proc/net/tcp* >> 335 /proc/net/tcp >> 10 /proc/net/tcp6 >> 345 total >> >>> grid-arcce0 > wc -l /proc/net/tcp* >> 2733 /proc/net/tcp >> 16 /proc/net/tcp6 >> 2749 total >> >> >> [2] >>> lsof -i TCP | grep condor | cut -d " " -f 1 | sort | uniq -c >> 1 condor_de >> 1 condor_ma >> 4 condor_sc >>> lsof | grep condor | cut -d " " -f 1 | sort | uniq -c >> 27 condor_de >> 30 condor_ma >> 19 condor_pr >> 45 condor_sc >> 44776 condor_sh >> 1 scan-cond >> >> >> >> _______________________________________________ >> HTCondor-users mailing list >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx >> with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/htcondor-users/ >> > >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature