Ah, apologies should have been more specific. Can you set a new config value in your condor config and then issue condor_reconfig. SCHEDD_DEBUG = D_FULLDEBUG, D_SECURITY Also whatâs the output of: ]$ condor_config_val -v SEC_DEFAULT_AUTHENTICATION_METHODS and ]$ condor_config_val -v SEC_WRITE_AUTHENTICATION_METHODS Thanks, Iain > On Feb 2, 2016, at 21:44, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote: > > Thanks! Is D_FULLDEBUG a config variable? > > I am using the default auth mechanism. TRUST_UID_DOMAIN is true. > > >> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote: >> >> Hi, >> >> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs >> >> Did you increase the debug level of the SchedD as well, that would provide another view of the crash. >> >> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there? >> >> What auth mechanism are you using? GSI or something else? >> >> Thanks, >> >> Iain >> >>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote: >>> >>> Hi All, >>> >>> I am running into issues when submitting lots of jobs (tens of >>> thousands) from the python bindings. >>> >>> The submit code looks like >>> >>> schedd = htcondor.Schedd() >>> for i in some_list: >>> j = build_job_dict(i) >>> schedd.submit(j) >>> >>> >>> Here is the ouput with debugging turned on. Lines starting with >>> "Processing .." is output from my code. >>> >>> >>> Tue Feb 2 16:13:58 2016 Processing A >>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from >>> <10.x.xxx.xxx:12731>. >>> 02/02/16 16:15:18 IO: Failed to read packet header >>> 02/02/16 16:15:18 SECMAN: no classad from server, failing >>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad >>> message. >>> Can't send RESCHEDULE command to schedd. >>> Tue Feb 2 16:16:46 2016 Processing B >>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from >>> <10.x.xxx.xxx:12731>. >>> 02/02/16 16:18:43 IO: Failed to read packet header >>> 02/02/16 16:18:43 SECMAN: no classad from server, failing >>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad >>> message. >>> Can't send RESCHEDULE command to schedd. >>> Tue Feb 2 16:20:13 2016 Processing C >>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from >>> <10.x.xxx.xxx:12731>. >>> 02/02/16 16:22:10 IO: Failed to read packet header >>> 02/02/16 16:22:10 SECMAN: no classad from server, failing >>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad >>> message. >>> Can't send RESCHEDULE command to schedd. >>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at >>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe. >>> 02/02/16 16:22:10 Buf::write(): condor_write() failed >>> terminate called after throwing an instance of >>> 'boost::python::error_already_set' >>> Aborted >>> >>> >>> My initial suspicion was that I was running a lot of jobs which finished >>> very fast and thrashed the schedd process. But then I killed all my >>> workers and simply tried to queue jobs and got the same error. This is >>> not a one off occurrence and happens pretty deterministically. >>> >>> Any idea what is going on? >>> >>> >>> Both htcondor and python bindings are for 8.4.3 >>> >>> Installed Packages >>> Name : condor-python >>> Arch : x86_64 >>> Version : 8.4.3 >>> Release : 1.el7 >>> Size : 4.8 M >>> Repo : installed >>> From repo : htcondor-stable >>> Summary : Python bindings for HTCondor. >>> URL : http://www.cs.wisc.edu/condor/ >>> License : ASL 2.0 >>> Description : The python bindings allow one to directly invoke the C++ >>> implementations of >>> : the ClassAd library and HTCondor from python >>> >>> >>> Thanks, >>> S >>> _______________________________________________ >>> HTCondor-users mailing list >>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a >>> subject: Unsubscribe >>> You can also unsubscribe by visiting >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >>> >>> The archives can be found at: >>> https://lists.cs.wisc.edu/archive/htcondor-users/ >> > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature