Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Python binding crashes
- Date: Tue, 02 Feb 2016 15:25:23 -0500
- From: Suchindra Sandhu <suchindra@xxxxxxxxx>
- Subject: [HTCondor-users] Python binding crashes
Hi All,
I am running into issues when submitting lots of jobs (tens of
thousands) from the python bindings.
The submit code looks like
schedd = htcondor.Schedd()
for i in some_list:
j = build_job_dict(i)
schedd.submit(j)
Here is the ouput with debugging turned on. Lines starting with
"Processing .." is output from my code.
Tue Feb 2 16:13:58 2016 Processing A
02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
<10.x.xxx.xxx:12731>.
02/02/16 16:15:18 IO: Failed to read packet header
02/02/16 16:15:18 SECMAN: no classad from server, failing
02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
message.
Can't send RESCHEDULE command to schedd.
Tue Feb 2 16:16:46 2016 Processing B
02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
<10.x.xxx.xxx:12731>.
02/02/16 16:18:43 IO: Failed to read packet header
02/02/16 16:18:43 SECMAN: no classad from server, failing
02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
message.
Can't send RESCHEDULE command to schedd.
Tue Feb 2 16:20:13 2016 Processing C
02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
<10.x.xxx.xxx:12731>.
02/02/16 16:22:10 IO: Failed to read packet header
02/02/16 16:22:10 SECMAN: no classad from server, failing
02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
message.
Can't send RESCHEDULE command to schedd.
02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
<10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
02/02/16 16:22:10 Buf::write(): condor_write() failed
terminate called after throwing an instance of
'boost::python::error_already_set'
Aborted
My initial suspicion was that I was running a lot of jobs which finished
very fast and thrashed the schedd process. But then I killed all my
workers and simply tried to queue jobs and got the same error. This is
not a one off occurrence and happens pretty deterministically.
Any idea what is going on?
Both htcondor and python bindings are for 8.4.3
Installed Packages
Name : condor-python
Arch : x86_64
Version : 8.4.3
Release : 1.el7
Size : 4.8 M
Repo : installed
>From repo : htcondor-stable
Summary : Python bindings for HTCondor.
URL : http://www.cs.wisc.edu/condor/
License : ASL 2.0
Description : The python bindings allow one to directly invoke the C++
implementations of
: the ClassAd library and HTCondor from python
Thanks,
S