[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Collector killed by OOM killer



Hi,

The same problem happened a few times this morning, but we've been able to narrow down what seems to be causing this. We have an attribute StartJobs included in STARTD_ATTRS, and our START expression contains "(StartJobs =?= True)". We also have:

STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
ENABLE_PERSISTENT_CONFIG = TRUE
PERSISTENT_CONFIG_DIR = /etc/condor/ral

so that we can use condor_config_val to change the value of StartJobs. We've found that:

* changing the value of StartJobs in the config file and running condor_reconfig for lots of worker nodes at the same time (e.g. 60) is fine

* changing the value of StartJobs using condor_config_val and running condor_reconfig for lots of worker nodes at the same time is also fine

* someone wrote a 'clever' script which instead of running condor_config_val, just writes the appropriate files into PERSISTENT_CONFIG_DIR then runs condor_reconfig. When this is run for many worker nodes at the same time, this puts an enormous load on the collectors (high memory usage, CPU load, and wait io) and causes lots of communication problems, e.g.

07/04/14 13:34:45 condor_write(): Socket closed when trying to write 294 bytes to <aaa.bbb.ccc.ddd.eee:48342>, fd is 6
07/04/14 13:34:45 Buf::write(): condor_write() failed
07/04/14 13:34:45 SECMAN: Error sending response classad to <aaa.bbb.ccc.ddd.eee:48342>!

So it seems that it's our own fault, and we'll stop using this script of course :-) I'm still curious though why doing this puts such a high load on the collectors...

Regards,
Andrew.

________________________________________
From: andrew.lahiff@xxxxxxxxxx [andrew.lahiff@xxxxxxxxxx]
Sent: Thursday, July 03, 2014 7:44 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Collector killed by OOM killer

Hi,

We had a problem today where the collector on one of our 2 central managers started using so much memory it started swapping and the CPU load average almost got to 20. It was also killed twice by the OOM killer:

Jul  3 13:43:51 condor01 kernel: condor_collecto invoked oom-killer
Jul  3 13:55:59 condor01 kernel: condor_collecto invoked oom-killer

At the same time the other central manager had a high CPU load but didn’t get to the point of anything being killed.

It seemed to be triggered by rebooting around 10 or so worker nodes. In CollectorLog (for the collector which was killed) the number of active workers suddenly increased to the maximum of 16 (there normally seem to be at most 1 or 2):

07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 11
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 Number of Active Workers 13
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 12
07/03/14 13:36:53 Number of Active Workers 14
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 13
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 Number of Active Workers 15
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 14
07/03/14 13:36:53 Number of Active Workers 16
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 ForkWork: not forking because reached max workers 16
07/03/14 13:36:53 Number of Active Workers 16
07/03/14 13:36:53 Number of Active Workers 15

There was then a gap of 20 minutes in CollectorLog. After the collector was killed twice by the OOM, there were then failed condor_write attempts for the worker nodes which were down:

07/03/14 13:56:35 Buf::write(): condor_write() failed
07/03/14 13:56:35 Error sending query result to client -- aborting
07/03/14 13:56:35 condor_write(): Socket closed when trying to write 4096 bytes to <aaa.bbb.ccc.ddd:48596>, fd is 7

Then it seemed that every single ClassAd was removed:

07/03/14 13:56:52 Housekeeper:  Ready to clean old ads
07/03/14 13:56:52       Cleaning StartdAds ...
07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
... (many, many similar lines)

The negotiator had trouble contacting both collectors after this (*), and they were both blacklisted. Things later eventually returned to normal.

Does anyone know what happened? We are using HTCondor 8.0.6. I can provide a full log files off-list if necessary.

Many Thanks,
Andrew.

(*)
07/03/14 13:56:24 ---------- Started Negotiation Cycle ----------
07/03/14 13:56:24 Phase 1:  Obtaining ads from collector ...
07/03/14 13:56:24 Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","")
07/03/14 13:56:24   Getting startd private ads ...
07/03/14 13:57:24 condor_read(): timeout reading 21 bytes from collector at <a.b.c.d:9618>.
07/03/14 13:57:24 IO: Failed to read packet header
07/03/14 13:57:24 Will avoid querying collector condor01.domain <a.b.c.d:9618> for 3540s if an alternative succeeds.
07/03/14 13:58:24 condor_read(): timeout reading 21 bytes from collector at <a.b.c.d:9618>.
07/03/14 13:58:24 IO: Failed to read packet header
07/03/14 13:58:24 Will avoid querying collector condor02.domain <a.b.c.d:9618> for 3541s if an alternative succeeds.
07/03/14 13:58:24 Couldn't fetch ads: communication error
07/03/14 13:58:24 Aborting negotiation cycle
07/03/14 13:58:24 ---------- Started Negotiation Cycle ----------
07/03/14 13:58:24 Phase 1:  Obtaining ads from collector ...
07/03/14 13:58:24 Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","")
07/03/14 13:58:24   Getting startd private ads ...
07/03/14 13:58:24 Collector condor01.domain blacklisted; skipping
07/03/14 13:58:24 Collector condor02.domain blacklisted; skipping
07/03/14 13:58:24 Couldn't fetch ads: communication error
07/03/14 13:58:24 Aborting negotiation cycle


--
Scanned by iCritical.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Scanned by iCritical.