Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Python binding crashes
- Date: Tue, 02 Feb 2016 17:21:56 -0500
- From: Suchindra Sandhu <suchindra@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Python binding crashes
Will run with the debug flags and see. Meanwhile I don't have any auth
mechanisms defined.
$ condor_config_val -dump | grep SEC_
SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
SEC_CLAIMTOBE_USER =
SEC_DEBUG_PRINT_KEYS = false
SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
SEC_INVALIDATE_SESSIONS_VIA_TCP = true
SEC_PASSWORD_DOMAIN =
SEC_PASSWORD_FILE =
SEC_SESSION_DURATION_SLOP = 20
SEC_TCP_SESSION_TIMEOUT = 20
On Tue, Feb 2, 2016, at 03:51 PM, Iain Bradford Steers wrote:
> Ah, apologies should have been more specific.
>
> Can you set a new config value in your condor config and then issue
> condor_reconfig.
>
> SCHEDD_DEBUG = D_FULLDEBUG, D_SECURITY
>
> Also whatâs the output of:
>
> ]$ condor_config_val -v SEC_DEFAULT_AUTHENTICATION_METHODS
>
> and
>
> ]$ condor_config_val -v SEC_WRITE_AUTHENTICATION_METHODS
>
> Thanks,
>
> Iain
>
>
> > On Feb 2, 2016, at 21:44, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> >
> > Thanks! Is D_FULLDEBUG a config variable?
> >
> > I am using the default auth mechanism. TRUST_UID_DOMAIN is true.
> >
> >
> >> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs
> >>
> >> Did you increase the debug level of the SchedD as well, that would provide another view of the crash.
> >>
> >> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?
> >>
> >> What auth mechanism are you using? GSI or something else?
> >>
> >> Thanks,
> >>
> >> Iain
> >>
> >>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I am running into issues when submitting lots of jobs (tens of
> >>> thousands) from the python bindings.
> >>>
> >>> The submit code looks like
> >>>
> >>> schedd = htcondor.Schedd()
> >>> for i in some_list:
> >>> j = build_job_dict(i)
> >>> schedd.submit(j)
> >>>
> >>>
> >>> Here is the ouput with debugging turned on. Lines starting with
> >>> "Processing .." is output from my code.
> >>>
> >>>
> >>> Tue Feb 2 16:13:58 2016 Processing A
> >>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
> >>> <10.x.xxx.xxx:12731>.
> >>> 02/02/16 16:15:18 IO: Failed to read packet header
> >>> 02/02/16 16:15:18 SECMAN: no classad from server, failing
> >>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
> >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> >>> message.
> >>> Can't send RESCHEDULE command to schedd.
> >>> Tue Feb 2 16:16:46 2016 Processing B
> >>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
> >>> <10.x.xxx.xxx:12731>.
> >>> 02/02/16 16:18:43 IO: Failed to read packet header
> >>> 02/02/16 16:18:43 SECMAN: no classad from server, failing
> >>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
> >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> >>> message.
> >>> Can't send RESCHEDULE command to schedd.
> >>> Tue Feb 2 16:20:13 2016 Processing C
> >>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
> >>> <10.x.xxx.xxx:12731>.
> >>> 02/02/16 16:22:10 IO: Failed to read packet header
> >>> 02/02/16 16:22:10 SECMAN: no classad from server, failing
> >>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
> >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> >>> message.
> >>> Can't send RESCHEDULE command to schedd.
> >>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
> >>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
> >>> 02/02/16 16:22:10 Buf::write(): condor_write() failed
> >>> terminate called after throwing an instance of
> >>> 'boost::python::error_already_set'
> >>> Aborted
> >>>
> >>>
> >>> My initial suspicion was that I was running a lot of jobs which finished
> >>> very fast and thrashed the schedd process. But then I killed all my
> >>> workers and simply tried to queue jobs and got the same error. This is
> >>> not a one off occurrence and happens pretty deterministically.
> >>>
> >>> Any idea what is going on?
> >>>
> >>>
> >>> Both htcondor and python bindings are for 8.4.3
> >>>
> >>> Installed Packages
> >>> Name : condor-python
> >>> Arch : x86_64
> >>> Version : 8.4.3
> >>> Release : 1.el7
> >>> Size : 4.8 M
> >>> Repo : installed
> >>> From repo : htcondor-stable
> >>> Summary : Python bindings for HTCondor.
> >>> URL : http://www.cs.wisc.edu/condor/
> >>> License : ASL 2.0
> >>> Description : The python bindings allow one to directly invoke the C++
> >>> implementations of
> >>> : the ClassAd library and HTCondor from python
> >>>
> >>>
> >>> Thanks,
> >>> S
> >>> _______________________________________________
> >>> HTCondor-users mailing list
> >>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >>>
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/htcondor-users/
> >>
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> Email had 1 attachment:
> + smime.p7s
> 4k (application/pkcs7-signature)