[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] submit jobs complete but then go back into queue
- Date: Tue, 11 May 2010 05:50:21 -0600
- From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
- Subject: [Condor-users] submit jobs complete but then go back into queue
I have a simple submit job that executes
on a client via Condor, but then returns to the Condor pool. I think the
problem is related to the Condor daemon owners, the submit owner and the
ad owner. Our daemons run under a service account because we are restricting
the configuration files on a shared drive. I am not sure this is the best
approach yet, but we are still trying to meet all of our security concerns.
We are using a small pool for testing,
SSL, NTSSPI, All condor clients and CM is Windows XP as well (SP3). SSL
is used for client authentication. If I run the Condor job under the condor
service account (the same account that the condor daemons run as, the jobs
exit the Queue properly. If I run the Condor job as my user account, the
job executes, but then returns to the Queue as 'I', and eventually it will
run again and again and so forth.
I tried to specify the owner class add
in the submit file but when I do this the job sits in idle and never runs.
Does anyone have any ideas as to what I might be doing wrong or have any
recommendations as to how we should handle running the daemons under a
different account. What does not make sense to me is that the state changes
from idle to run after the job is submitted, but then can not exit the
queue after it completes when the class ad (odonnellm) and submit owner
(igskabcb-condoradmin) are different.
PS I do not see anything else out of
the norm in the other log files (submit or execute machines).
Thank you for your help/suggestions,
Mike
Below are snapshots of the log files
indicating the errors when the state of the job is trying to be changed.
Shadowlog (submit machine):
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=595,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): SECMAN:
resume, other side is $CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173
$, NOT reauthenticating.
05/10 15:35:35 (248.0) (2088): SECMAN:
about to enable message authenticator.
05/10 15:35:35 (248.0) (2088): SECMAN:
successfully enabled message authenticator!
05/10 15:35:35 (248.0) (2088): SECMAN:
about to enable encryption.
05/10 15:35:35 (248.0) (2088): SECMAN:
successfully enabled encryption!
05/10 15:35:35 (248.0) (2088): SECMAN:
startCommand succeeded.
05/10 15:35:35 (248.0) (2088): Authorizing
server '*/IP'.
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=76,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(NumJobStarts, 1)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=92,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(LastJobLeaseRenewal, 1273527179)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=83,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(RemoteSysCpu, 0.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=85,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(RemoteUserCpu, 22.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=77,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(ImageSize, 69116)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=80,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(ExitBySignal, FALSE)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=72,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(ExitCode, 0)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=85,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(TerminationPending, TRUE)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=80,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(CommittedTime, 1502)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=80,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(BytesSent, 0.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=81,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree:
Failed SetAttribute(BytesRecvd, 0.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=29,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672
schedd at <IP:4905>,,size=8,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read():
fd=1672
05/10 15:35:35 (248.0) (2088): condor_read():
select returned 1
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672
schedd at <IP:4905>,,size=29,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): CLOSE
<IP:2912> fd=1672
05/10 15:35:35 (248.0) (2088): Failed
to perform final update to job queue!
05/10 15:35:35 (248.0) (2088): Maximum
number of job cleanup retry attempts (SHADOW_MAX_JOB_CLEANUP_RETRIES=5)
reached; Forcing job requeue!
05/10 15:35:35 (248.0) (2088): KEYCACHEENTRY:
deleted: 00D18648
05/10 15:35:35 (248.0) (2088): KEYCACHEENTRY:
deleted: 00D33CD0
05/10 15:35:35 (248.0) (2088): KEYCACHEENTRY:
deleted: 00D217A8
05/10 15:35:35 (248.0) (2088): KEYCACHE:
deleted: 00B7B6F0
05/10 15:35:35 (248.0) (2088): CLOSE
<IP:2880> fd=1716
05/10 15:35:35 (248.0) (2088): CLOSE
<127.0.0.1:2881> fd=1248
05/10 15:35:35 (248.0) (2088): CLOSE
<127.0.0.1:2882> fd=1732
05/10 15:35:35 (248.0) (2088): ****
condor_shadow (condor_SHADOW) pid 2088 EXITING WITH STATUS 107
SchedLog (Submit machine):
05/10 15:35:35 (pid:1844) ad owner:
odonnellm, queue submit owner: IGSKBACB-condoradmin
05/10 15:35:35 (pid:1844) OwnerCheck(IGSKBACB-condoradmin)
failed in SetAttribute for job 248.0
05/10 15:35:35 (pid:1844) condor_write(fd=1228
<IP:2912>,,size=37,timeout=20,flags=0)
05/10 15:35:35 (pid:1844) condor_read(fd=1228
<IP:2912>,,size=21,timeout=20,flags=0)
05/10 15:35:35 (pid:1844) condor_read():
fd=1228
05/10 15:35:35 (pid:1844) condor_read():
select returned 1
05/10 15:35:35 (pid:1844) condor_read(fd=1228
<IP:2912>,,size=71,timeout=20,flags=0)
05/10 15:35:35 (pid:1844) condor_read():
fd=1228
05/10 15:35:35 (pid:1844) condor_read():
select returned 1
...
05/10 16:10:08 (pid:1844) SECMAN: startCommand
succeeded.
05/10 16:10:08 (pid:1844) Authorizing
server '*/159.189.162.73'.
05/10 16:10:08 (pid:1844) SEND [1000]
<IP:3265> <159.189.162.73:1202>
05/10 16:10:08 (pid:1844) SEND [146]
<IP:3265> <159.189.162.73:1202>
05/10 16:10:08 (pid:1844) Completed
RELEASE_CLAIM to startd at <159.189.162.73:1202>
05/10 16:10:08 (pid:1844) Match record
(IGSKBACBLT214.gs.doi.net <IP2:1202>#1273511706#25#... for odonnellm@xxxxxxxxxx,
249.0) deleted
05/10 16:10:08 (pid:1844) IpVerify::FillHole:
removed DAEMON-level opening for 159.189.162.73
05/10 16:10:08 (pid:1844) IpVerify::FillHole:
removed WRITE-level opening for 159.189.162.73
05/10 16:10:08 (pid:1844) IpVerify::FillHole:
open count at level READ for 159.189.162.73 now 1
05/10 16:10:08 (pid:1844) IpVerify::FillHole:
removed READ-level opening for 159.189.162.73
05/10 16:10:08 (pid:1844) Deleting
shadow rec for PID 4404, job (249.0)
05/10 16:10:08 (pid:1844) Entered check_zombie(
4404, 0x00DD5824, st=2 )
05/10 16:10:08 (pid:1844) Prioritized
runnable job list will be rebuilt, because ClassAd attribute JobStatus=1
changed
05/10 16:10:09 (pid:1844) Marked job
249.0 as IDLE
05/10 16:10:09 (pid:1844) Exited check_zombie(
4404, 0x00DD5824 )
05/10 16:10:09 (pid:1844) Shadow does
not have a match record, so did not remove it from the match
05/10 16:10:09 (pid:1844)
05/10 16:10:09 (pid:1844) ..................
05/10 16:10:09 (pid:1844) .. Shadow
Recs (0/0)
05/10 16:10:09 (pid:1844) ..................