HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] problems holding jobs through SOAP




Hello everyone,

I'm sometimes seeing issues when attempting to hold jobs through the SOAP interface.  It's a fairly rare problem, but I think I've narrowed down how to reproduce it, and I've included the relevent snippet of the log file from the scheduler below.  From my own observation it seems that the job is moving from running/idle to held inside of the transaction where the processing of the jobs in the cluster, to hold them, is ongoing.  As a result, when the component attempts to hold the job, the request fails as the job is already held.  

All of the following takes place in a single transaction.  First, I query condor to get all jobs running or ilde, which you can see in the first bolded section.  The constraint specifically is ( JobStatus==1 || JobStatus==2 ).  Then I iterate over the jobs attempting to hold them.  But before the next hold request, you can see that the job exits and moves into held state in the second bolded section.  Finally, in the third, is the attempt to hold the job which fails as the job is already held.  

...
3/6 09:14:19 -------- Done starting jobs --------
3/6 09:14:19 Received HTTP POST connection from <157.209.215.39:54312>
3/6 09:14:19 Current Socket bufsize=8k
3/6 09:14:19 Current Socket bufsize=8k
3/6 09:14:19 About to serve HTTP request...
3/6 09:14:19 SOAP entered getJobAds(), transaction: 0
3/6 09:14:19 SOAP leaving getJobAds() result=0
3/6 09:14:19 Completed servicing HTTP request
3/6 09:14:20 DaemonCore: Command received via UDP from host <172.30.90.55:2631>
3/6 09:14:20 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
3/6 09:14:20 DaemonCore: Command received via TCP from host <172.30.90.55:2634>
3/6 09:14:20 DaemonCore: received command 1111 (QMGMT_CMD), calling handler (handle_q)
3/6 09:14:20 condor_read(): Socket closed when trying to read 5 bytes from <172.30.90.55:2634>
3/6 09:14:20 IO: EOF reading packet header
3/6 09:14:20 QMGR Connection closed
3/6 09:14:20 DaemonCore: Command received via UDP from host <172.30.90.55:2635>
3/6 09:14:20 DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())
3/6 09:14:20 Shadow pid 1124 for job 9.4 exited with status 112
3/6 09:14:20 Putting job 9.4 on hold
3/6 09:14:21 Deleting shadow rec for PID 1124, job (9.4)
3/6 09:14:21 Entered check_zombie( 1124, 0x022D8A54, st=5 )
3/6 09:14:21 Writing record to user logfile=C:\Condor\Local/spool10\cluster9.proc4.subproc0/7204.t-4.log owner=grid_qa
3/6 09:14:21 init_user_ids: want user 'grid_qa@AD2', current is '(null)@(null)'
3/6 09:14:21 init_user_ids: Already have handle for grid_qa@AD2, so returning.
3/6 09:14:21 TokenCache contents:
grid_qa@AD2
3/6 09:14:21 TokenCache contents:
grid_qa@AD2
3/6 09:14:21 Exited check_zombie( 1124, 0x022D8A54 )
3/6 09:14:21
3/6 09:14:21 ..................
3/6 09:14:21 .. Shadow Recs (0/1)
3/6 09:14:21 ..................

3/6 09:14:21 Exited delete_shadow_rec( 1124 )
3/6 09:14:21 -------- Begin starting jobs --------
3/6 09:14:21 Job 8.-1: not runnable
3/6 09:14:21 Job 9.5: is runnable
3/6 09:14:21 Scheduler::start_std - job=9.5 on <157.209.215.39:60747>
3/6 09:14:21 Queueing job 9.5 in runnable job queue
3/6 09:14:21 start next job after 3 sec, JobsThisBurst 0
3/6 09:14:21 Match (<157.209.215.39:60747>#1173132078#9) - running 9.5
3/6 09:14:21 -------- Done starting jobs --------
3/6 09:14:21 Received HTTP POST connection from <157.209.215.39:54316>
3/6 09:14:21 Current Socket bufsize=8k
3/6 09:14:21 Current Socket bufsize=8k
3/6 09:14:21 About to serve HTTP request...
3/6 09:14:21 SOAP entered holdJob(), transaction: 3984014088
3/6 09:14:21 Job 9.4 is already on hold
3/6 09:14:21 SOAP leaving holdJob() result=1
3/6 09:14:21 Completed servicing HTTP request
...

My questions are:
1 - Is this the correct and expected behavior?  
2 - If it is, is there a way to know when the job failed to be held because it is already held or for some other reason?  If not, then I'll just go ahead, log and ignore the exception, assuming that the job was already held.
3 - What is the system hold option?  I have it defaulting to true, but I'm not sure that is correct, and I'd like to make sure this isn't my code causing the issue.

If you would like to see the code I'm using, or if there's anything else I can do to assist, please let me know.

Regards,
Rob



*************************************************************************
This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************