Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Strange condor eviction
- Date: Tue, 24 Jan 2006 08:33:49 -0600 (CST)
- From: Steven Timm <timm@xxxxxxxx>
- Subject: [Condor-users] Strange condor eviction
I have a user who is reporting that his job is getting evicted
after running for several hours. This is puzzling since I have
PREEMPT and PREEMPTION_REQUIREMENTS set to FALSE across my cluster,
e.g. if a user starts they should get to finish. The user reports
(and I confirm) that this same job has been evicted a couple times
in a row now. Nevertheless it restarted on another node after
the eviction and is still running now. It doesn't appear to be
a case of memory. This is condor 6.7.13
Here's the log of the job:
[root@fnpcsrv1 condor_log]# more lia_fd_05_11_1_4.log
000 (53837.000.000) 01/23 10:49:58 Job submitted from host:
<131.225.167.42:3908
2>
...
001 (53837.000.000) 01/23 10:50:03 Job executing on host:
<131.225.167.170:32772
...
006 (53837.000.000) 01/23 10:50:11 Image size of job updated: 6360
...
006 (53837.000.000) 01/23 11:10:11 Image size of job updated: 154944
...
004 (53837.000.000) 01/23 17:22:45 Job was evicted.
(0) Job was not checkpointed.
Usr 0 04:52:00, Sys 0 00:07:16 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (53837.000.000) 01/23 18:11:30 Job executing on host:
<131.225.167.156:32772
...
006 (53837.000.000) 01/23 18:31:40 Image size of job updated: 150152
...
006 (53837.000.000) 01/23 22:51:39 Image size of job updated: 150552
...
006 (53837.000.000) 01/23 23:31:39 Image size of job updated: 153320
...
006 (53837.000.000) 01/24 03:51:41 Image size of job updated: 155168
...
006 (53837.000.000) 01/24 05:11:39 Image size of job updated: 158732
As you can see, after the job was evicted, it restarted on another
node and seems to be going fine now.
--------------------------------------
From the node in question, here's the appropriate section of the StartLog
1/23 10:49:59 DaemonCore: Command received via UDP from host
<131.225.167.42:191
77>
1/23 10:49:59 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
1/23 10:49:59 vm1: match_info called
1/23 10:49:59 vm1: Received match <131.225.167.170:32772>#1137606512#551
1/23 10:49:59 vm1: State change: match notification protocol successful
1/23 10:49:59 vm1: Changing state: Unclaimed -> Matched
1/23 10:49:59 DaemonCore: Command received via TCP from host
<131.225.167.42:177
17>
1/23 10:49:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler
(command_request_claim)
1/23 10:49:59 vm1: Request accepted.
1/23 10:49:59 vm1: Remote owner is rubin@xxxxxxxx
1/23 10:49:59 vm1: State change: claiming protocol successful
1/23 10:49:59 vm1: Changing state: Matched -> Claimed
1/23 10:50:03 DaemonCore: Command received via TCP from host
<131.225.167.42:177
23>
1/23 10:50:03 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler
(command_activate_claim)
1/23 10:50:03 vm1: Got activate_claim request from shadow
(<131.225.167.42:17723
)
1/23 10:50:03 vm1: Remote job ID is 53837.0
1/23 10:50:03 vm1: Got universe "VANILLA" (5) from request classad
1/23 10:50:03 vm1: State change: claim-activation protocol successful
1/23 10:50:03 vm1: Changing activity: Idle -> Busy
<clip>
1/23 17:22:44 vm1: State change: claim lease expired (condor_schedd gone?)
1/23 17:22:44 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Killi
ng
1/23 17:22:44 vm2: State change: claim lease expired (condor_schedd gone?)
1/23 17:22:44 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Killi
ng
1/23 17:22:44 vm2: State change: No preempting claim, returning to owner
1/23 17:22:44 vm2: Changing state and activity: Preempting/Killing ->
Owner/Idle
1/23 17:22:44 vm2: State change: IS_OWNER is false
1/23 17:22:44 vm2: Changing state: Owner -> Unclaimed
1/23 17:22:45 DaemonCore: Command received via TCP from host
<131.225.167.42:384
24>
1/23 17:22:45 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
1/23 17:22:45 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
1/23 17:22:45 Starter pid 18109 exited with status 0
1/23 17:22:45 vm1: State change: starter exited
1/23 17:22:45 vm1: State change: No preempting claim, returning to owner
1/23 17:22:45 vm1: Changing state and activity: Preempting/Killing ->
Owner/Idle
1/23 17:22:45 vm1: State change: IS_OWNER is false
1/23 17:22:45 vm1: Changing state: Owner -> Unclaimed
1/23 17:26:51 State change: RunBenchmarks is TRUE
1/23 17:26:51 vm2: Changing activity: Idle -> Benchmarking
1/23 17:26:55 State change: benchmarks completed
1/23 17:26:55 vm2: Changing activity: Benchmarking -> Idle
1/23 17:26:55 State change: RunBenchmarks is TRUE
1/23 17:26:55 vm1: Changing activity: Idle -> Benchmarking
1/23 17:27:00 State change: benchmarks completed
1/23 17:27:00 vm1: Changing activity: Benchmarking -> Idle
At 17:22 the job was evicted from vm1.
---------------------------------------
Here's the log from StarterLog.vm1
1/23 10:50:03 ******************************************************
1/23 10:50:03 ** condor_starter (CONDOR_STARTER) STARTING UP
1/23 10:50:03 ** /opt/condor-6.7.13/sbin/condor_starter
1/23 10:50:03 ** $CondorVersion: 6.7.13 Nov 7 2005 $
1/23 10:50:03 ** $CondorPlatform: I386-LINUX_RH9 $
1/23 10:50:03 ** PID = 18109
1/23 10:50:03 ******************************************************
1/23 10:50:03 Using config file: /etc/condor/condor_config
1/23 10:50:03 Using local config files:
/opt/condor-6.7.13/etc/group_params.conf
ig /opt/condor-6.7.13/local/condor_config.local
1/23 10:50:03 DaemonCore: Command Socket at <131.225.167.170:37303>
1/23 10:50:03 Done setting resource limits
1/23 10:50:03 Communicating with shadow <131.225.167.42:17722>
1/23 10:50:03 Submitting machine is "fngp-osg.fnal.gov"
1/23 10:50:03 Starting a VANILLA universe job with ID: 53837.0
1/23 10:50:03 IWD: /home/minfarm/condor_submit
1/23 10:50:03 Output file:
/home/minfarm/condor_log/lia_fd_05_11_1_4.1205.1.out
1/23 10:50:03 Error file:
/home/minfarm/condor_log/lia_fd_05_11_1_4.1205.1.err
1/23 10:50:03 About to exec /home/rubin/condor/ana_li 1205 1 R1.11
fd_05_11_1_4
1/23 10:50:03 Create_Process succeeded, pid=18118
1/23 17:22:44 Got SIGQUIT. Performing fast shutdown.
1/23 17:22:44 ShutdownFast all jobs.
1/23 17:22:44 Process exited, pid=18118, signal=9
1/23 17:22:45 Last process exited, now Starter is exiting
1/23 17:22:45 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
--------------------------------------------------------------
Nothing in the system logs indicates any problems at this time.
The SchedLog has unfortunately already rotated past this time frame.
Any ideas what might be going on?
Steve Timm
--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team