[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] problems holding jobs through SOAP
Hello everyone,
I'm sometimes seeing issues when attempting
to hold jobs through the SOAP interface. It's a fairly rare problem,
but I think I've narrowed down how to reproduce it, and I've included the
relevent snippet of the log file from the scheduler below. From my
own observation it seems that the job is moving from running/idle to held
inside of the transaction where the processing of the jobs in the cluster,
to hold them, is ongoing. As a result, when the component attempts
to hold the job, the request fails as the job is already held.
All of the following takes place in
a single transaction. First, I query condor to get all jobs running
or ilde, which you can see in the first bolded section. The constraint
specifically is ( JobStatus==1 ||
JobStatus==2 ). Then I iterate
over the jobs attempting to hold them. But before the next hold request,
you can see that the job exits and moves into held state in the second
bolded section. Finally, in the third, is the attempt to hold the
job which fails as the job is already held.
...
3/6 09:14:19 -------- Done starting
jobs --------
3/6 09:14:19 Received HTTP POST connection
from <157.209.215.39:54312>
3/6 09:14:19 Current Socket bufsize=8k
3/6 09:14:19 Current Socket bufsize=8k
3/6 09:14:19 About to serve HTTP
request...
3/6 09:14:19 SOAP entered getJobAds(),
transaction: 0
3/6 09:14:19 SOAP leaving getJobAds()
result=0
3/6 09:14:19 Completed servicing
HTTP request
3/6 09:14:20 DaemonCore: Command received
via UDP from host <172.30.90.55:2631>
3/6 09:14:20 DaemonCore: received command
60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
3/6 09:14:20 DaemonCore: Command received
via TCP from host <172.30.90.55:2634>
3/6 09:14:20 DaemonCore: received command
1111 (QMGMT_CMD), calling handler (handle_q)
3/6 09:14:20 condor_read(): Socket closed
when trying to read 5 bytes from <172.30.90.55:2634>
3/6 09:14:20 IO: EOF reading packet
header
3/6 09:14:20 QMGR Connection closed
3/6 09:14:20 DaemonCore: Command received
via UDP from host <172.30.90.55:2635>
3/6 09:14:20 DaemonCore: received command
60011 (DC_NOP), calling handler (handle_nop())
3/6 09:14:20 Shadow pid 1124 for
job 9.4 exited with status 112
3/6 09:14:20 Putting job 9.4 on hold
3/6 09:14:21 Deleting shadow rec for
PID 1124, job (9.4)
3/6 09:14:21 Entered check_zombie( 1124,
0x022D8A54, st=5 )
3/6 09:14:21 Writing record to user
logfile=C:\Condor\Local/spool10\cluster9.proc4.subproc0/7204.t-4.log owner=grid_qa
3/6 09:14:21 init_user_ids: want user
'grid_qa@AD2', current is '(null)@(null)'
3/6 09:14:21 init_user_ids: Already
have handle for grid_qa@AD2, so returning.
3/6 09:14:21 TokenCache contents:
grid_qa@AD2
3/6 09:14:21 TokenCache contents:
grid_qa@AD2
3/6 09:14:21 Exited check_zombie( 1124,
0x022D8A54 )
3/6 09:14:21
3/6 09:14:21 ..................
3/6 09:14:21 .. Shadow Recs (0/1)
3/6 09:14:21 ..................
3/6 09:14:21 Exited delete_shadow_rec(
1124 )
3/6 09:14:21 -------- Begin starting
jobs --------
3/6 09:14:21 Job 8.-1: not runnable
3/6 09:14:21 Job 9.5: is runnable
3/6 09:14:21 Scheduler::start_std -
job=9.5 on <157.209.215.39:60747>
3/6 09:14:21 Queueing job 9.5 in runnable
job queue
3/6 09:14:21 start next job after 3
sec, JobsThisBurst 0
3/6 09:14:21 Match (<157.209.215.39:60747>#1173132078#9)
- running 9.5
3/6 09:14:21 -------- Done starting
jobs --------
3/6 09:14:21 Received HTTP POST connection
from <157.209.215.39:54316>
3/6 09:14:21 Current Socket bufsize=8k
3/6 09:14:21 Current Socket bufsize=8k
3/6 09:14:21 About to serve HTTP
request...
3/6 09:14:21 SOAP entered holdJob(),
transaction: 3984014088
3/6 09:14:21 Job 9.4 is already on
hold
3/6 09:14:21 SOAP leaving holdJob()
result=1
3/6 09:14:21 Completed servicing
HTTP request
...
My questions are:
1 - Is this the correct and expected
behavior?
2 - If it is, is there a way to know
when the job failed to be held because it is already held or for some other
reason? If not, then I'll just go ahead, log and ignore the exception,
assuming that the job was already held.
3 - What is the system hold option?
I have it defaulting to true, but I'm not sure that is correct, and
I'd like to make sure this isn't my code causing the issue.
If you would like to see the code I'm
using, or if there's anything else I can do to assist, please let me know.
Regards,
Rob
*************************************************************************
This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************