[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
- Date: Thu, 24 Jun 2004 10:32:09 -0700
- From: "Lila Klektau" <lmk@xxxxxxx>
- Subject: Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
On Wed, 23 Jun 2004 21:53:51 -0500, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
It looks like the globus-url-copy to stage out the job files is hanging
on the gatekeeper machine. We've seen this problem with globus-url-copy
in other situations, but haven't been able to determine the cause. If
you could add/change the following line to your condor_config file,
reproduce the problem, and send me the resulting gridmanager log file,
it'd be a great help in figuring out how to make condor-g better at
working around this problem:
GRIDMANAGER_DEBUG = D_FULLDEBUG
Thanks for the reply,
I've been trying to recreate the problem, with full debugging, but I'm
noticing two different outputs. In one case, the job is not attempting to
restart and netstat shows no connections on the condor-g machine to the
remote resource, but the remote resource shows CLOSE_WAIT connections with
the condor-g machine. This is the tail end of the GridmanagerLog file for
that:
6/24 10:16:47 [22515] Updating classad values for 859.0:
6/24 10:16:47 [22515] GlobusStatus = 128
6/24 10:16:47 [22515] leaving doContactSchedd()
6/24 10:16:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 10:17:37 [22515] GAHP <- 'RESULTS'
6/24 10:17:37 [22515] GAHP -> 'S' '0'
6/24 10:17:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:18:37 [22515] GAHP <- 'RESULTS'
6/24 10:18:37 [22515] GAHP -> 'S' '0'
6/24 10:18:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:18:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 10:19:37 [22515] GAHP <- 'RESULTS'
6/24 10:19:37 [22515] GAHP -> 'S' '0'
6/24 10:19:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:20:16 [22515] (859.0) doEvaluateState called: gmState
GM_SUBMITTED, globusState 128
6/24 10:20:16 [22515] (859.0) gm state change: GM_SUBMITTED ->
GM_PROBE_JOBMANAGER
6/24 10:20:16 [22515] GAHP <- 'GRAM_JOB_STATUS 25
https://mercury.uvic.ca:40053/9165/1088097296/'
6/24 10:20:16 [22515] GAHP -> 'S'
6/24 10:20:17 [22515] GAHP <- 'RESULTS'
6/24 10:20:17 [22515] GAHP -> 'S' '1'
6/24 10:20:17 [22515] GAHP -> '25' '0' '0' '128'
6/24 10:20:17 [22515] (859.0) doEvaluateState called: gmState
GM_PROBE_JOBMANAGER, globusState 128
6/24 10:20:17 [22515] (859.0) gm state change: GM_PROBE_JOBMANAGER ->
GM_SUBMITTED
6/24 10:20:37 [22515] GAHP <- 'RESULTS'
6/24 10:20:37 [22515] GAHP -> 'S' '0'
6/24 10:20:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:20:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 10:21:37 [22515] GAHP <- 'RESULTS'
6/24 10:21:37 [22515] GAHP -> 'S' '0'
6/24 10:21:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:22:37 [22515] GAHP <- 'RESULTS'
6/24 10:22:37 [22515] GAHP -> 'S' '0'
6/24 10:22:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:22:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 10:23:37 [22515] GAHP <- 'RESULTS'
6/24 10:23:37 [22515] GAHP -> 'S' '0'
6/24 10:23:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:24:37 [22515] GAHP <- 'RESULTS'
6/24 10:24:37 [22515] GAHP -> 'S' '0'
6/24 10:24:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:24:40 [22515] CheckProxies called
6/24 10:24:40 [22515] will call CheckProxies again in 600 seconds
6/24 10:24:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 10:25:17 [22515] (859.0) doEvaluateState called: gmState
GM_SUBMITTED, globusState 128
6/24 10:25:17 [22515] (859.0) gm state change: GM_SUBMITTED ->
GM_PROBE_JOBMANAGER
6/24 10:25:17 [22515] GAHP <- 'GRAM_JOB_STATUS 26
https://mercury.uvic.ca:40053/9165/1088097296/'
6/24 10:25:17 [22515] GAHP -> 'S'
6/24 10:25:17 [22515] GAHP <- 'RESULTS'
6/24 10:25:17 [22515] GAHP -> 'S' '1'
6/24 10:25:17 [22515] GAHP -> '26' '0' '0' '128'
6/24 10:25:17 [22515] (859.0) doEvaluateState called: gmState
GM_PROBE_JOBMANAGER, globusState 128
6/24 10:25:17 [22515] (859.0) gm state change: GM_PROBE_JOBMANAGER ->
GM_SUBMITTED
6/24 10:25:37 [22515] GAHP <- 'RESULTS'
6/24 10:25:37 [22515] GAHP -> 'S' '0'
6/24 10:25:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:26:37 [22515] GAHP <- 'RESULTS'
6/24 10:26:37 [22515] GAHP -> 'S' '0'
6/24 10:26:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:26:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 10:27:37 [22515] GAHP <- 'RESULTS'
6/24 10:27:37 [22515] GAHP -> 'S' '0'
6/24 10:27:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:28:37 [22515] GAHP <- 'RESULTS'
6/24 10:28:37 [22515] GAHP -> 'S' '0'
6/24 10:28:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:28:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
If I let it keep going, similar messages will just be repeated.
However, in other cases a job restart is attempted (this is the one I have
noticed many times before, where netstat shows connections on both sides
and a new log file on remote resource is created every minute). This is
the corresponding tail end of the GridmanagerLog file:
6/24 08:39:37 [18272] GAHP <- 'RESULTS'
6/24 08:39:37 [18272] GAHP -> 'S' '0'
6/24 08:39:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:40:20 [18272] (814.0) doEvaluateState called: gmState
GM_SUBMITTED, globusState 128
6/24 08:40:20 [18272] (814.0) gm state change: GM_SUBMITTED ->
GM_PROBE_JOBMANAGER
6/24 08:40:20 [18272] GAHP <- 'GRAM_JOB_STATUS 59
https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:40:20 [18272] GAHP -> 'S'
6/24 08:40:20 [18272] GAHP <- 'RESULTS'
6/24 08:40:20 [18272] GAHP -> 'S' '1'
6/24 08:40:20 [18272] GAHP -> '59' '0' '0' '128'
6/24 08:40:20 [18272] (814.0) doEvaluateState called: gmState
GM_PROBE_JOBMANAGER, globusState 128
6/24 08:40:20 [18272] (814.0) gm state change: GM_PROBE_JOBMANAGER ->
GM_SUBMITTED
6/24 08:40:37 [18272] GAHP <- 'RESULTS'
6/24 08:40:37 [18272] GAHP -> 'S' '0'
6/24 08:40:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:40:52 [18272] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 08:41:37 [18272] GAHP <- 'RESULTS'
6/24 08:41:37 [18272] GAHP -> 'S' '0'
6/24 08:41:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:42:37 [18272] GAHP <- 'RESULTS'
6/24 08:42:37 [18272] GAHP -> 'S' '0'
6/24 08:42:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:42:52 [18272] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 08:43:37 [18272] GAHP <- 'RESULTS'
6/24 08:43:37 [18272] GAHP -> 'S' '0'
6/24 08:43:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:44:37 [18272] GAHP <- 'RESULTS'
6/24 08:44:37 [18272] GAHP -> 'S' '0'
6/24 08:44:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:44:40 [18272] CheckProxies called
6/24 08:44:40 [18272] (re)caching proxy 1
6/24 08:44:40 [18272] GAHP <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u505'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] GAHP <- 'USE_CACHED_PROXY 1'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] will call CheckProxies again in 600 seconds
6/24 08:44:40 [18272] (814.0) doEvaluateState called: gmState
GM_SUBMITTED, globusState 128
6/24 08:44:40 [18272] (814.0) gm state change: GM_SUBMITTED ->
GM_REFRESH_PROXY
6/24 08:44:40 [18272] GAHP <- 'GRAM_JOB_REFRESH_PROXY 60
https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] GAHP <- 'RESULTS'
6/24 08:44:40 [18272] GAHP -> 'S' '1'
6/24 08:44:40 [18272] GAHP -> '60' '10'
6/24 08:44:40 [18272] (814.0) doEvaluateState called: gmState
GM_REFRESH_PROXY, globusState 128
6/24 08:44:40 [18272] (814.0) gmState GM_REFRESH_PROXY, globusState 128:
refresh_credentials() returned Globus error 10
6/24 08:44:40 [18272] (814.0) gm state change: GM_REFRESH_PROXY ->
GM_STOP_AND_RESTART
6/24 08:44:40 [18272] GAHP <- 'GRAM_JOB_SIGNAL 61
https://mercury.uvic.ca:40023/30935/1088091305/ 9 NULL'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] GAHP <- 'RESULTS'
6/24 08:44:40 [18272] GAHP -> 'S' '1'
6/24 08:44:40 [18272] GAHP -> '61' '0' '0' '128'
6/24 08:44:40 [18272] (814.0) doEvaluateState called: gmState
GM_STOP_AND_RESTART, globusState 128
6/24 08:44:40 [18272] (814.0) gm state change: GM_STOP_AND_RESTART ->
GM_RESTART
6/24 08:44:40 [18272] GAHP <- 'GRAM_JOB_REQUEST 62
mercury.uvic.ca:2119/jobmanager-pbs https://gcgate01.phys.uvic.ca:43460/ 1
&(rsl_substitution=(GRIDMANAGER_GASS_URL\ https://gcgate01.phys.uvic.ca:43461))(restart=https://mercury.uvic.ca:40023/30935/1088091305/)(remote_io_url=$(GRIDMANAGER_GASS_URL))(proxy_timeout=240)'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:41 [18272] GAHP <- 'RESULTS'
6/24 08:44:41 [18272] GAHP -> 'S' '1'
6/24 08:44:41 [18272] GAHP -> '62' '124'
'https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState GM_RESTART,
globusState 128
6/24 08:44:41 [18272] (814.0) gm state change: GM_RESTART -> GM_INIT
6/24 08:44:41 [18272] (814.0) gm state change: GM_INIT -> GM_REGISTER
6/24 08:44:41 [18272] GAHP <- 'GRAM_JOB_CALLBACK_REGISTER 63
https://mercury.uvic.ca:40023/30935/1088091305/
https://gcgate01.phys.uvic.ca:43460/'
6/24 08:44:41 [18272] GAHP -> 'S'
6/24 08:44:41 [18272] GAHP <- 'RESULTS'
6/24 08:44:41 [18272] GAHP -> 'S' '1'
6/24 08:44:41 [18272] GAHP -> '63' '0' '130' '4'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState GM_REGISTER,
globusState 128
6/24 08:44:41 [18272] (814.0) globus state change: STAGE_OUT -> FAILED
6/24 08:44:41 [18272] (814.0) gm state change: GM_REGISTER ->
GM_STDIO_UPDATE
6/24 08:44:41 [18272] GAHP <- 'GRAM_JOB_SIGNAL 64
https://mercury.uvic.ca:40023/30935/1088091305/ 7
&(remote_io_url=https://gcgate01.phys.uvic.ca:43461)(invalid=bad)(invalid=bad)'
6/24 08:44:41 [18272] GAHP -> 'S'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState
GM_STDIO_UPDATE, globusState 4
6/24 08:44:41 [18272] in doContactSchedd()
6/24 08:44:41 [18272] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
6/24 08:44:41 [18272] SEC_DEBUG_PRINT_KEYS is undefined, using default
value of False
6/24 08:44:41 [18272] AUTHENTICATE_FS: used file /tmp/qmgr_QPq9At, status:
1
6/24 08:44:41 [18272] Updating classad values for 814.0:
6/24 08:44:41 [18272] leaving doContactSchedd()
6/24 08:44:41 [18272] GAHP <- 'RESULTS'
6/24 08:44:41 [18272] GAHP -> 'S' '1'
6/24 08:44:41 [18272] GAHP -> '64' '94' '0' '4'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState
GM_STDIO_UPDATE, globusState 4
6/24 08:44:41 [18272] (814.0) gm state change: GM_STDIO_UPDATE ->
GM_SUBMITTED
6/24 08:44:41 [18272] (814.0) gm state change: GM_SUBMITTED -> GM_FAILED
6/24 08:44:41 [18272] (814.0) gm state change: GM_FAILED -> GM_RESTART
6/24 08:44:52 [18272] DaemonCore::IsPidAlive(): kill returned EPERM,
assuming pid 3366 is alive.
6/24 08:45:37 [18272] GAHP <- 'RESULTS'
6/24 08:45:37 [18272] GAHP -> 'S' '0'
6/24 08:45:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:45:41 [18272] (814.0) doEvaluateState called: gmState GM_RESTART,
globusState 4
6/24 08:45:41 [18272] GAHP <- 'GRAM_JOB_REQUEST 65
mercury.uvic.ca:2119/jobmanager-pbs https://gcgate01.phys.uvic.ca:43460/ 1
&(rsl_substitution=(GRIDMANAGER_GASS_URL\ https://gcgate01.phys.uvic.ca:43461))(restart=https://mercury.uvic.ca:40023/30935/1088091305/)(remote_io_url=$(GRIDMANAGER_GASS_URL))(proxy_timeout=240)'
6/24 08:45:41 [18272] GAHP -> 'S'
6/24 08:45:41 [18272] GAHP <- 'RESULTS'
6/24 08:45:41 [18272] GAHP -> 'S' '1'
6/24 08:45:41 [18272] GAHP -> '65' '124'
'https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:45:41 [18272] (814.0) doEvaluateState called: gmState GM_RESTART,
globusState 4
6/24 08:45:41 [18272] (814.0) gm state change: GM_RESTART -> GM_INIT
6/24 08:45:41 [18272] (814.0) gm state change: GM_INIT -> GM_REGISTER
6/24 08:45:41 [18272] GAHP <- 'GRAM_JOB_CALLBACK_REGISTER 66
https://mercury.uvic.ca:40023/30935/1088091305/
https://gcgate01.phys.uvic.ca:43460/'
6/24 08:45:41 [18272] GAHP -> 'S'
6/24 08:45:41 [18272] GAHP <- 'RESULTS'
6/24 08:45:41 [18272] GAHP -> 'S' '1'
6/24 08:45:41 [18272] GAHP -> '66' '0' '130' '4'
6/24 08:45:41 [18272] (814.0) doEvaluateState called: gmState GM_REGISTER,
globusState 4
6/24 08:45:41 [18272] (814.0) gm state change: GM_REGISTER ->
GM_STDIO_UPDATE
6/24 08:45:41 [18272] GAHP <- 'GRAM_JOB_SIGNAL 67
https://mercury.uvic.ca:40023/30935/1088091305/ 7
&(remote_io_url=https://gcgate01.phys.uvic.ca:43461)(invalid=bad)(invalid=bad)'
6/24 08:45:41 [18272] GAHP -> 'S'
6/24 08:45:42 [18272] GAHP <- 'RESULTS'
6/24 08:45:42 [18272] GAHP -> 'S' '1'
6/24 08:45:42 [18272] GAHP -> '67' '94' '0' '4'
6/24 08:45:42 [18272] (814.0) doEvaluateState called: gmState
GM_STDIO_UPDATE, globusState 4
6/24 08:45:42 [18272] (814.0) gm state change: GM_STDIO_UPDATE ->
GM_SUBMITTED
6/24 08:45:42 [18272] (814.0) gm state change: GM_SUBMITTED -> GM_FAILED
6/24 08:45:42 [18272] (814.0) gm state change: GM_FAILED -> GM_RESTART
I would have sent whole log files, but the problem only appears when
multiple jobs are run at once, so log files get quite big and they
exceeded the maximum size allowed for posting to this list. Let me know
if it would help to have them and I'll send them offline.
-Lila Klektau