[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Intermittent failure submitting via Python bindings



Is there a message in the SchedLog corresponding to that failure?

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Pelletier via HTCondor-users
Sent: Monday, November 9, 2020 9:20 AM
To: HTCondor-Users Mail List (htcondor-users@xxxxxxxxxxx) <htcondor-users@xxxxxxxxxxx>
Cc: Michael Pelletier <michael.v.pelletier@xxxxxxxxxxxx>
Subject: [HTCondor-users] Intermittent failure submitting via Python bindings

A user here is getting intermittent failures of job submissions through the Python bindings, throwing the following error text:

=====
  File "/user/1148605/sandbox/hpc_interface/Resources/htcondor_compute_resource.py", line 125, in run
    self.cluster_ids.append(sub.queue(txn, jobs, ad_results))
  File "/scratch/ml/sandbox/1148605/mosp_htcondor/lib/python3.8/site-packages/htcondor/_lock.py", line 69, in wrapper
    rv = func(*args, **kwargs)
RuntimeError: job 45263.-1 failed to set CurrentHosts=0 (110)
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "main_python.py", line 52, in <module>
    print(hpc.run(run_dict=submit_dict))
  File "/user/1148605/sandbox/hpc_interface/hpc_interface.py", line 32, in run
    return self.compute_resource.run(run, self.run_number, jobs)
  File "/user/1148605/sandbox/hpc_interface/Resources/htcondor_compute_resource.py", line 125, in run
    self.cluster_ids.append(sub.queue(txn, jobs, ad_results))
  File "/scratch/ml/sandbox/1148605/mosp_htcondor/lib/python3.8/site-packages/htcondor/_lock.py", line 99, in __exit__
    return self.cm.__exit__(*args, **kwargs)
  File "/scratch/ml/sandbox/1148605/mosp_htcondor/lib/python3.8/site-packages/htcondor/_lock.py", line 69, in wrapper
    rv = func(*args, **kwargs)
RuntimeError: Failed to abort transaction.
terminate called after throwing an instance of 'boost::python::error_already_set'
=====

As you can see, it gets the ClusterID (indicated by 45263.-1 above) so it's gotten past a certain point in the processing, but I'm having trouble pinning down the cause of the failure. Why did it fail to set CurrentHosts?

Thanks for any suggestions

Michael V Pelletier
Principal Engineer

C: +1 339.293.9149
michael.v.pelletier@xxxxxxx

Raytheon Technologies
Information Technology
50 Apple Hill Drive
Tewksbury, MA 01876-1198 

RTX.com | LinkedIn | Twitter | Instagram 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/