(gdb) backtrace #0 0x0000000000000000 in ?? ()#1 0x00007fffe82187ba in Condor_Auth_Kerberos::init_user (this=this@entry=0x858900) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_auth_kerberos.cpp:725 #2 0x00007fffe8219fbf in Condor_Auth_Kerberos::authenticate (this=0x858900) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_auth_kerberos.cpp:286 #3 0x00007fffe8212985 in Authentication::authenticate_continue (this=this@entry=0x1df8190, errstack=errstack@entry=0x1c595b8, non_blocking=<optimized out>) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:331 #4 0x00007fffe8212f13 in Authentication::authenticate_inner (this=this@entry=0x1df8190, hostAddr=hostAddr@entry=0xe010d0 "<206.12.154.223:9618?addrs=206.12.154.223-9618+[2607-f8f0-c10-70f3-2--223]-9618&noUDP&sock=2297468_8802_4>", auth_methods=auth_methods@entry=0x1e43860 "FS,KERBEROS,GSI", errstack=errstack@entry=0x1c595b8, timeout=timeout@entry=20, non_blocking=non_blocking@entry=false) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:163 #5 0x00007fffe8212fba in Authentication::authenticate (this=this@entry=0x1df8190, hostAddr=0xe010d0 "<206.12.154.223:9618?addrs=206.12.154.223-9618+[2607-f8f0-c10-70f3-2--223]-9618&noUDP&sock=2297468_8802_4>", auth_methods=auth_methods@entry=0x1e43860 "FS,KERBEROS,GSI", errstack=errstack@entry=0x1c595b8, timeout=timeout@entry=20, non_blocking=non_blocking@entry=false) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:117 #6 0x00007fffe821300b in Authentication::authenticate (this=this@entry=0x1df8190, hostAddr=<optimized out>, key=@0x1c597d8: 0x0, auth_methods=auth_methods@entry=0x1e43860 "FS,KERBEROS,GSI", errstack=errstack@entry=0x1c595b8, timeout=timeout@entry=20, non_blocking=non_blocking@entry=false) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:105 #7 0x00007fffe8238337 in ReliSock::perform_authenticate (this=0x7fffffffd110, with_key=with_key@entry=true, key=@0x1c597d8: 0x0, methods=0x1e43860 "FS,KERBEROS,GSI", errstack=0x1c595b8, auth_timeout=20, non_blocking=false, method_used=method_used@entry=0x0) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/reli_sock.cpp:1181 #8 0x00007fffe82383cc in ReliSock::authenticate (this=<optimized out>, key=<optimized out>, methods=<optimized out>, errstack=<optimized out>, auth_timeout=<optimized out>, non_blocking=<optimized out>, method_used=0x0) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/reli_sock.cpp:1238 #9 0x00007fffe822e107 in SecManStartCommand::authenticate_inner (this=0x1c59570) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1920 #10 0x00007fffe8233b25 in SecManStartCommand::startCommand_inner (this=this@entry=0x1c59570) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1295 #11 0x00007fffe8233cda in SecManStartCommand::startCommand (this=this@entry=0x1c59570) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1227 #12 0x00007fffe8233f71 in SecMan::startCommand (this=0x7fffffffd788, cmd=cmd@entry=478, sock=sock@entry=0x7fffffffd110, raw_protocol=<optimized out>, errstack=errstack@entry=0x0, subcmd=<optimized out>, callback_fn=0x0, misc_data=0x0, nonblocking=false, cmd_description=0x0, sec_session_id_hint=0x0) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1119 #13 0x00007fffe824df24 in Daemon::startCommand (cmd=cmd@entry=478, sock=sock@entry=0x7fffffffd110, timeout=timeout@entry=0, errstack=errstack@entry=0x0, subcmd=subcmd@entry=0, callback_fn=callback_fn@entry=0x0, misc_data=misc_data@entry=0x0, nonblocking=nonblocking@entry=false, cmd_description=cmd_description@entry=0x0, sec_man=0x0, sec_man@entry=0x7fffffffd788, raw_protocol=raw_protocol@entry=false, sec_session_id=sec_session_id@entry=0x0) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/daemon.cpp:567 #14 0x00007fffe824e1eb in Daemon::startCommand (this=this@entry=0x7fffffffd700, cmd=cmd@entry=478, sock=sock@entry=0x7fffffffd110, timeout=timeout@entry=0, errstack=errstack@entry=0x0, cmd_description=cmd_description@entry=0x0, raw_protocol=raw_protocol@entry=false, sec_session_id=sec_session_id@entry=0x0) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/daemon.cpp:728 #15 0x00007fffe825fd78 in DCSchedd::actOnJobs (this=this@entry=0x7fffffffd700, action=action@entry=JA_HOLD_JOBS, constraint=constraint@entry=0x0, ids=ids@entry=0x7fffffffd670, reason=reason@entry=0x1b54498 "Python-initiated action.", reason_attr=reason_attr@entry=0x7fffe82e8a1b "HoldReason", reason_code=reason_code@entry=0x0, reason_code_attr=reason_code_attr@entry=0x7fffe82c0326 "HoldReasonSubCode", result_type=result_type@entry=AR_TOTALS, errstack=errstack@entry=0x0) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/dc_schedd.cpp:1409 #16 0x00007fffe82603ec in DCSchedd::holdJobs (this=this@entry=0x7fffffffd700, ids=ids@entry=0x7fffffffd670, reason=reason@entry=0x1b54498 "Python-initiated action.", reason_code=reason_code@entry=0x0, errstack=errstack@entry=0x0, result_type=result_type@entry=AR_TOTALS) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/dc_schedd.cpp:134 #17 0x00007fffe92a74b1 in Schedd::actOnJobs (this=this@entry=0x7fffe1aa3040, action=action@entry=JA_HOLD_JOBS, job_spec=..., reason=...) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/python-bindings/schedd.cpp:1399 #18 0x00007fffe92a85a0 in Schedd::actOnJobs2 (this=0x7fffe1aa3040, action=JA_HOLD_JOBS, job_spec=...) at /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/python-bindings/schedd.cpp:1466 #19 0x00007fffe928a0ef in invoke<boost::python::to_python_value<boost::python::api::object const&>, boost::python::api::object (Schedd::*)(JobAction, boost::python::api::object), boos---Type <return> to continue, or q <return> to quit--- t::python::arg_from_python<Schedd&>, boost::python::arg_from_python<JobAction>, boost::python::arg_from_python<boost::python::api::object> > (ac1=<synthetic pointer>, ac0=...,
tc=<synthetic pointer>, f=@0xade1e8: (boost::python::api::object (Schedd::*)(Schedd * const, JobAction, boost::python::api::object)) 0x7fffe92a8530 <Schedd::actOnJobs2(JobAction, boost::python::api::object)>, rc=...) at /var/lib/condor/execute/slot1/dir_5037/htcondor_pypi_build/bld_external/boost-1.66.0/install/include/boost/python/detail/invoke.hpp:86
#20 operator() (args_=<optimized out>, this=0xade1e8)at /var/lib/condor/execute/slot1/dir_5037/htcondor_pypi_build/bld_external/boost-1.66.0/install/include/boost/python/detail/caller.hpp:221 #21 boost::python::objects::caller_py_function_impl<boost::python::detail::caller<boost::python::api::object (Schedd::*)(JobAction, boost::python::api::object), boost::python::default_call_policies, boost::mpl::vector4<boost::python::api::object, Schedd&, JobAction, boost::python::api::object> > >::operator() (this=0xade1e0, args=<optimized out>, kw=<optimized out>) at /var/lib/condor/execute/slot1/dir_5037/htcondor_pypi_build/bld_external/boost-1.66.0/install/include/boost/python/object/py_function.hpp:38 #22 0x00007fffe8f6033a in boost::python::objects::function::call(_object*, _object*) const () from /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so #23 0x00007fffe8f606a8 in boost::detail::function::void_function_ref_invoker0<boost::python::objects::(anonymous namespace)::bind_return, void>::invoke(boost::detail::function::function_buffer&) () from /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so #24 0x00007fffe8f5ac63 in boost::python::handle_exception_impl(boost::function0<void>) () from /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so #25 0x00007fffe8f5efb3 in function_call () from /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so #26 0x00007ffff795e88b in _PyObject_FastCallDict () from /lib64/libpython3.6m.so.1.0
#27 0x00007ffff7a1e244 in call_function () from /lib64/libpython3.6m.so.1.0#28 0x00007ffff7a224c4 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0 #29 0x00007ffff7a1d5e0 in _PyFunction_FastCall () from /lib64/libpython3.6m.so.1.0
#30 0x00007ffff7a1e2f6 in call_function () from /lib64/libpython3.6m.so.1.0#31 0x00007ffff7a224c4 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0 #32 0x00007ffff7a1df45 in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0 #33 0x00007ffff7a1e47d in PyEval_EvalCodeEx () from /lib64/libpython3.6m.so.1.0 #34 0x00007ffff7a1e4cb in PyEval_EvalCode () from /lib64/libpython3.6m.so.1.0
#35 0x00007ffff7a49914 in run_mod () from /lib64/libpython3.6m.so.1.0#36 0x00007ffff7a4bf5d in PyRun_FileExFlags () from /lib64/libpython3.6m.so.1.0 #37 0x00007ffff7a4c0c7 in PyRun_SimpleFileExFlags () from /lib64/libpython3.6m.so.1.0
#38 0x00007ffff7a62733 in Py_Main () from /lib64/libpython3.6m.so.1.0 #39 0x0000000000400a3e in main () -Colson On 03/26/2019 03:20 PM, Colson Driemel wrote:
I've tried as many things as I could come up with to try and get something back from the call but nothing seems to generate anything useful.The subprocess is setup to pipe any output back to stdout and sterr and nothing is coming through those channels before the crash. I tried to use python's fault handler to generate a trace but this is what i got:Fatal Python error: Segmentation fault Current thread 0x00007f7287d31740 (most recent call first):File "/opt/cloudscheduler/data_collectors/condor/csjobs.py", line 309 in job_pollerFile "/usr/lib64/python3.6/multiprocessing/process.py", line 93 in runFile "/usr/lib64/python3.6/multiprocessing/process.py", line 258 in _bootstrap File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 73 in _launch File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19 in __init__ File "/usr/lib64/python3.6/multiprocessing/context.py", line 277 in _Popen File "/usr/lib64/python3.6/multiprocessing/context.py", line 223 in _Popen File "/usr/lib64/python3.6/multiprocessing/process.py", line 105 in start File "/opt/cloudscheduler/data_collectors/condor/cloudscheduler/lib/ProcessMonitor.py", line 65 in start_all File "/opt/cloudscheduler/data_collectors/condor/csjobs.py", line 526 in <module>Not terribly useful unfortunately as csjobs 309 is: hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)I don't have a lot of experience with C extensions in python so if anyone knows a way that i can get my hands on the coredump I'd appreciate it.I tried using gdb and backtrace but since it was only a sub-process that died I wasn't able to come up with anything.-Colson On 03/26/2019 02:00 PM, John M Knoeller wrote:Is there a way to get a stack trace for the SIGSEGV? -----Original Message-----From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson DriemelSent: Tuesday, March 26, 2019 1:13 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>Subject: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobsThe local log initially said very little, the debug messages are going to a configured log and the only message is the main process noticing the process in question has died and restarts it. I've added the exit code of the subprocess to the log and it is returning -11 which is SIGSEGV (Segmentation fault) 2019-03-26 10:52:47,968 - Job Poller - DEBUG - Adding job <removed>#15688.0#1553144582 2019-03-26 10:52:47,980 - Job Poller - DEBUG - No alias found in requirements expression 2019-03-26 10:52:47,981 - Job Poller - DEBUG - {'requirements': '(group_name is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)', 'request_ram': 15000, 'request_disk': 94371840, 'q_date': 1553432582, 'proc_id': 0, 'job_status': 1, 'user': '<removed>', 'request_cpus': 4, 'job_priority': 10, 'entered_current_status': 1553432582, 'global_job_id': '<removed>#16432.0#1553432582', 'cluster_id': 16432, 'group_name': 'test-dev2'} 2019-03-26 10:52:47,982 - Job Poller - DEBUG - inventory_item_hash(old): None 2019-03-26 10:52:47,982 - Job Poller - DEBUG - inventory_item_hash(new):97b4c6c61bad8e44a72dfd34cfe1d6f8,cluster_id=16432,entered_current_status=1553432582,global_job_id=<removed>#16432.0#1553432582,job_priority=10,job_status=1,proc_id=0,q_date=1553432582,request_cpus=4,request_disk=94371840,request_ram=15000,requirements=(group_nameis "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer),user=<removed> 2019-03-26 10:52:47,982 - Job Poller - DEBUG - Adding job csv2-dev2.heprc.uvic.ca#16432.0#1553432582 2019-03-26 10:52:47,988 - Job Poller - DEBUG - No alias found in requirements expression 2019-03-26 10:52:47,988 - Job Poller - DEBUG - testing is not a valid group for csv2-dev2.heprc.uvic.ca, ignoring foreign job. 2019-03-26 10:52:47,989 - Job Poller - INFO - 6845 jobs held or to be held due to invalid user or group specifications. 2019-03-26 10:52:47,992 - Job Poller - DEBUG - Holding: ['16335.0','16335.1', '16335.2', '16335.3', '16335.4', <SHORTENED FOR READABILITY> '']2019-03-26 10:52:47,993 - Job Poller - DEBUG - Executing job action hold on csv2-dev2.heprc.uvic.ca 2019-03-26 10:52:55,698 - MainProcess - ERROR - job process died, restarting... 2019-03-26 10:52:55,993 - MainProcess - DEBUG - exit code: -11 2019-03-26 10:52:57,158 - Job Poller - INFO - Retrieved inventory from the database. 2019-03-26 10:52:57,159 - Job Poller - DEBUG - Beginning poller cycle -Colson On 03/25/2019 02:51 PM, John M Knoeller wrote:What does the local log file say? (I'm assuming ToolLog is where your logging.debug messages go?)Do you get a core file when the python script aborts?What I'm trying to get at is - is this a segfault. or is HTCondor aborting on purpose because of some failure. This will be easy to fix if we can figure out exactly where in the HTCondor code the segfault or abort it happing.-tj -----Original Message-----From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson DriemelSent: Monday, March 25, 2019 1:18 PM To: htcondor-users@xxxxxxxxxxxSubject: [HTCondor-users] Python Bindings crash without exception when remotely holding jobsHi All, So the system I'm working on inspects job queues from various condor instances and provisions cloud resources to run the jobs. As a part of this process jobs are held if they do not conform to certain conditions- a list of jobs are compiled and then held using: condor_session.act(htcondor.JobAction.Hold, held_job_ids) for a little more context: try: logging.debug("Executing job action hold on %s" % condor_host)hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)logging.debug("Hold result: %s" % hold_result)condor_session.edit(held_job_ids, "HoldReason", '"Invalid user orgroup name for htondor host %s, held by job poller"' % condor_host) except Exception as exc: logging.error("Failure holding jobs: %s" % exc) logging.error("Aborting cycle...") abort_cycle = True breakI am pretty sure the error has something to do with the configuration onthe remote condor host but my real issue is that it causes the python code to crash with no exception.This is a snapshot of the Schedd log from the remote condor in question:03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed! 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of <IPADDR:44307> did not result in a valid mapped user name, which is required for this command (478 ACT_ON_JOBS), so aborting. 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW) Any ideas on how i can stop this crash? Thanks, Colson _______________________________________________ HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ _______________________________________________ HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/_______________________________________________ HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ _______________________________________________ HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/_______________________________________________ HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/