Hi Steffen, have a look at condor_advertise. The docs mention precisely your desired functionality: http://research.cs.wisc.edu/htcondor/manual/current/condor_advertise.html Cheers, Max > Am 11.07.2017 um 09:35 schrieb Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>: > > Good morning, > > our 400+ node HTCondor pool currently sees a lot of OOM conditions. > Apparently, the memory in use as detected by the starter is way below the > actual memory consumption by the jobs - I'm constantly running out > of swap, and in a number of cases cannot connect to the nodes any longer. > At some point, the jobs will fail on their own, and enter Hold state > (because there's no node matching the last memory footprint) - and the > node will be freed up for yet another greedy job. > > I have no means to set START=False in between, thus I cannot guarantee > the node didn't suffer from damage to the OS itself. (Setting START > would require remote access to run condor_reconfig, which fails.) > Is there a way to remove a node from the pool from the side of the > master node? Most HPC schedulers have it, but for HTCondor I cannot > find such a feature - condor_drain is close but still wants to talk > to the node (and apparently isn't graceful enough). > > There must be a way to exclude rogue nodes from a pool. Any suggestions? > > > Thanks, > Steffen > > > -- > Steffen Grunewald, Cluster Administrator > Max Planck Institute for Gravitational Physics (Albert Einstein Institute) > Am MÃhlenberg 1 > D-14476 Potsdam-Golm > Germany > ~~~ > Fon: +49-331-567 7274 > Fax: +49-331-567 7298 > Mail: steffen.grunewald(at)aei.mpg.de > ~~~ > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature