Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Can't find reason for eviction of vanilla jobs
- Date: Mon, 30 Mar 2009 16:02:15 +0100
- From: Adam Huffman <adam.huffman@xxxxxxxxx>
- Subject: [Condor-users] Can't find reason for eviction of vanilla jobs
I've been running some vanilla jobs in the past few days, but there has
been trouble with evictions - only on two machines within the pool.
This is with version 7.2.1. I would appreciate guidance in tracking
this down. Various log file excerpts are below.
Adam
Here's a typical example (8.018):
001 (008.018.000) 03/30 14:31:56 Job executing on host: <execute.host:48432>
...
001 (008.019.000) 03/30 14:32:01 Job executing on host: <execute.host:48432>
...
006 (008.018.000) 03/30 14:32:04 Image size of job updated: 241764
...
006 (008.019.000) 03/30 14:37:09 Image size of job updated: 241764
...
004 (008.018.000) 03/30 14:51:55 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:19:58, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
978049 - Run Bytes Received By Job
On the relevant execute hose StarterLog.slot1:
3/30 14:51:55 Got SIGQUIT. Performing fast shutdown.
3/30 14:51:55 ShutdownFast all jobs.
3/30 14:51:55 Process exited, pid=3908, signal=9
3/30 14:51:55 Last process exited, now Starter is exiting
3/30 14:51:55 **** condor_starter (condor_STARTER) pid 3906 EXITING WITH
STATUS 0
3/30 14:56:55 ******************************************************
3/30 14:56:55 ** condor_starter (CONDOR_STARTER) STARTING UP
3/30 14:56:55 ** /usr/sbin/condor_starter
3/30 14:56:55 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
3/30 14:56:55 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
3/30 14:56:55 ** $CondorVersion: 7.2.1 Mar 27 2009 BuildID:
Fedora-7.2.1-1.el5 $
3/30 14:56:55 ** $CondorPlatform: X86_64-LINUX_CENTOS52el5centos $
3/30 14:56:55 ** PID = 4003
3/30 14:56:55 ** Log last touched 3/30 14:51:55
3/30 14:56:55 ******************************************************
3/30 14:56:55 Using config source: /etc/condor/condor_config
3/30 14:56:55 Using local config sources:
3/30 14:56:55 /var/lib/condor/condor_config.local
3/30 14:56:55 DaemonCore: Command Socket at <127.0.0.1:57074>
3/30 14:56:55 Done setting resource limits
3/30 14:56:55 Communicating with shadow <central manager:60732>
3/30 14:56:55 Submitting machine is "x.x.x.x"
3/30 14:56:55 setting the orig job name in starter
3/30 14:56:55 setting the orig job iwd in starter
3/30 14:56:55 File transfer completed successfully.
3/30 14:56:56 warning: unable to chmod condor_exec.exe to ensure execute
bit is set: Operation not permitted
3/30 14:56:56 Job 8.18 set to execute immediately
3/30 14:56:56 Starting a VANILLA universe job with ID: 8.18
3/30 14:56:56 IWD: /var/lib/condor/execute/dir_4003
3/30 14:56:56 Input file:
/var/lib/condor/execute/dir_4003/dammin-Fib4nter.18.inp
3/30 14:56:56 Output file:
/var/lib/condor/execute/dir_4003/dammin-Fib4nter.18.out
3/30 14:56:56 Error file:
/var/lib/condor/execute/dir_4003/dammin-Fib4nter.18.err
3/30 14:56:56 About to exec
/var/lib/condor/execute/dir_4003/condor_exec.exe
3/30 14:56:56 Create_Process succeeded, pid=4005
3/30 15:16:55 Got SIGQUIT. Performing fast shutdown.
3/30 15:16:55 ShutdownFast all jobs.
3/30 15:16:55 Process exited, pid=4005, signal=9
3/30 15:16:55 Last process exited, now Starter is exiting
3/30 15:16:55 **** condor_starter (condor_STARTER) pid 4003 EXITING WITH
STATUS 0
and here's the StartLog from that machine:
3/30 14:51:55 slot1: State change: claim lease expired (condor_schedd gone?)
3/30 14:51:55 slot1: Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/30 14:51:55 slot1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/30 14:51:55 Starter pid 3906 exited with status 0
3/30 14:51:55 slot1: State change: starter exited
3/30 14:51:55 slot1: State change: No preempting claim, returning to owner
3/30 14:51:55 slot1: Changing state and activity: Preempting/Killing ->
Owner/Idle
3/30 14:51:55 slot1: State change: IS_OWNER is false
3/30 14:51:55 slot1: Changing state: Owner -> Unclaimed
In the ShadowLog on the central manager:
3/30 14:51:55 (8.18) (8272): Job 8.18 is being evicted from
slot1@xxxxxxxxxxxx
3/30 14:51:55 (8.18) (8272): SECMAN: command 404 KILL_FRGN_JOB to
<execute.host:48432> from TCP port 60686 (blocking).
3/30 14:51:55 (8.18) (8272): SECMAN: using session
execute.host:2701:1238419915:42 for {<execute.host:48432>,<404>}.
3/30 14:51:55 (8.18) (8272): SECMAN: startCommand succeeded.
3/30 14:51:55 (8.18) (8272): SECMAN: command 1111 QMGMT_CMD to schedd at
<central.manager:47959> from TCP port 60687 (blocking).
3/30 14:51:55 (8.18) (8272): SECMAN: using session
central.manager:32568:1238419915:405 for {<central.manager:47959>,<1111>}.
3/30 14:51:55 (8.18) (8272): SECMAN: startCommand succeeded.
3/30 14:51:55 (8.18) (8272): HANDSHAKE: in handshake(my_methods =
'FS,KERBEROS,GSI')
3/30 14:51:55 (8.18) (8272): HANDSHAKE: handshake() - i am the client
3/30 14:51:55 (8.18) (8272): HANDSHAKE: sending (methods == 100) to server
3/30 14:51:55 (8.18) (8272): HANDSHAKE: server replied (method = 4)
3/30 14:51:55 (8.18) (8272): AUTHENTICATE_FS: used dir
/tmp/FS_XXX7autYz, status: 1
3/30 14:51:55 (8.18) (8272): Authentication was a Success.
3/30 14:51:55 (8.18) (8272): ZKM: setting default map to (null)
3/30 14:51:55 (8.18) (8272): ZKM: post-map: current user is '(null)'
3/30 14:51:55 (8.18) (8272): ZKM: post-map: current domain is '(null)'
3/30 14:51:55 (8.18) (8272): ZKM: post-map: current FQU is '(null)'
3/30 14:51:55 (8.18) (8272): **** condor_shadow (condor_SHADOW) pid 8272
EXITING WITH STATUS 107
And SchedLog on the central manager:
3/30 14:51:55 (pid:32568) DC_AUTHENTICATE: received DC_AUTHENTICATE from
<central.manager:60687>
3/30 14:51:55 (pid:32568) DC_AUTHENTICATE: resuming session id
sienna:32568:1238419915:405 with return address <central.manager:60568>:
3/30 14:51:55 (pid:32568) DC_AUTHENTICATE: Success.
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to unauthenticated user
from host central.manager for command 1111 (QMGMT_CMD), access level
READ: reason: READ authorization policy allows access by anyone
3/30 14:51:55 (pid:32568) Calling authenticate(FS,KERBEROS,GSI) in
qmgmt_receivers
3/30 14:51:55 (pid:32568) HANDSHAKE: in handshake(my_methods =
'FS,KERBEROS,GSI')
3/30 14:51:55 (pid:32568) HANDSHAKE: handshake() - i am the server
3/30 14:51:55 (pid:32568) HANDSHAKE: client sent (methods == 100)
3/30 14:51:55 (pid:32568) HANDSHAKE: i picked (method == 4)
3/30 14:51:55 (pid:32568) HANDSHAKE: client received (method == 4)
3/30 14:51:55 (pid:32568) FS: client template is /tmp/FS_XXXXXXXXX
3/30 14:51:55 (pid:32568) FS: client filename is /tmp/FS_XXX7autYz
3/30 14:51:55 (pid:32568) AUTHENTICATE_FS: used dir /tmp/FS_XXX7autYz,
status: 1
3/30 14:51:55 (pid:32568) Authentication was a Success.
3/30 14:51:55 (pid:32568) ZKM: setting default map to condor@xxxxxxxxxxxxxxx
3/30 14:51:55 (pid:32568) ZKM: post-map: current user is 'condor'
3/30 14:51:55 (pid:32568) ZKM: post-map: current domain is 'central.manager'
3/30 14:51:55 (pid:32568) ZKM: post-map: current FQU is
'condor@xxxxxxxxxxxxxxx'
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx
from host central.manager for queue management, access level WRITE:
reason: cached result for WRITE; see first case for the full reason
3/30 14:51:55 (pid:32568) DC_INVALIDATE_KEY: removed key id
sienna:32568:1238419915:404.
3/30 14:51:55 (pid:32568) DC_INVALIDATE_KEY: removed key id
sienna:32568:1238419915:405.
3/30 14:51:55 (pid:32568) Shadow pid 8272 for job 8.18 exited with
status 107
3/30 14:51:55 (pid:32568) SECMAN: command 443 RELEASE_CLAIM to startd at
<execute.host:48432> from UDP port 45734 (non-blocking).
3/30 14:51:55 (pid:32568) SECMAN: using session
marsantes:2701:1238411954:8 for {<execute.host:48432>,<443>}.
3/30 14:51:55 (pid:32568) SECMAN: UDP, m_have_session == 1,
peer_can_negotiate == 1
3/30 14:51:55 (pid:32568) SECMAN: startCommand succeeded.
3/30 14:51:55 (pid:32568) Sent RELEASE_CLAIM to startd at
<execute.host:48432>
3/30 14:51:55 (pid:32568) Match record (slot1@xxxxxxxxxxxx
<execute.host:48432> for adam@xxxxxxxxxxxxxxx, 8.18) deleted
3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level DAEMON
for execute.host now 1
3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level WRITE
for execute.host now 1
3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level READ
for execute.host now 3
3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level READ
for execute.host now 2
3/30 14:51:55 (pid:32568) Null parameter --- match not deleted