Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Python binding crashes
- Date: Tue, 02 Feb 2016 15:44:28 -0500
- From: Suchindra Sandhu <suchindra@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Python binding crashes
Thanks! Is D_FULLDEBUG a config variable?
I am using the default auth mechanism. TRUST_UID_DOMAIN is true.
> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>
> Hi,
>
> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs
>
> Did you increase the debug level of the SchedD as well, that would provide another view of the crash.
>
> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?
>
> What auth mechanism are you using? GSI or something else?
>
> Thanks,
>
> Iain
>
>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
>>
>> Hi All,
>>
>> I am running into issues when submitting lots of jobs (tens of
>> thousands) from the python bindings.
>>
>> The submit code looks like
>>
>> schedd = htcondor.Schedd()
>> for i in some_list:
>> j = build_job_dict(i)
>> schedd.submit(j)
>>
>>
>> Here is the ouput with debugging turned on. Lines starting with
>> "Processing .." is output from my code.
>>
>>
>> Tue Feb 2 16:13:58 2016 Processing A
>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
>> <10.x.xxx.xxx:12731>.
>> 02/02/16 16:15:18 IO: Failed to read packet header
>> 02/02/16 16:15:18 SECMAN: no classad from server, failing
>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>> message.
>> Can't send RESCHEDULE command to schedd.
>> Tue Feb 2 16:16:46 2016 Processing B
>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
>> <10.x.xxx.xxx:12731>.
>> 02/02/16 16:18:43 IO: Failed to read packet header
>> 02/02/16 16:18:43 SECMAN: no classad from server, failing
>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>> message.
>> Can't send RESCHEDULE command to schedd.
>> Tue Feb 2 16:20:13 2016 Processing C
>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
>> <10.x.xxx.xxx:12731>.
>> 02/02/16 16:22:10 IO: Failed to read packet header
>> 02/02/16 16:22:10 SECMAN: no classad from server, failing
>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>> message.
>> Can't send RESCHEDULE command to schedd.
>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
>> 02/02/16 16:22:10 Buf::write(): condor_write() failed
>> terminate called after throwing an instance of
>> 'boost::python::error_already_set'
>> Aborted
>>
>>
>> My initial suspicion was that I was running a lot of jobs which finished
>> very fast and thrashed the schedd process. But then I killed all my
>> workers and simply tried to queue jobs and got the same error. This is
>> not a one off occurrence and happens pretty deterministically.
>>
>> Any idea what is going on?
>>
>>
>> Both htcondor and python bindings are for 8.4.3
>>
>> Installed Packages
>> Name : condor-python
>> Arch : x86_64
>> Version : 8.4.3
>> Release : 1.el7
>> Size : 4.8 M
>> Repo : installed
>> From repo : htcondor-stable
>> Summary : Python bindings for HTCondor.
>> URL : http://www.cs.wisc.edu/condor/
>> License : ASL 2.0
>> Description : The python bindings allow one to directly invoke the C++
>> implementations of
>> : the ClassAd library and HTCondor from python
>>
>>
>> Thanks,
>> S
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>