Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor-g jobs failing - stuck in STAGE_OUT
- Date: Mon, 21 Jun 2004 10:34:41 -0700
- From: "Lila Klektau" <lmk@xxxxxxx>
- Subject: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
HI,
We're using condor-g to submit to 3 different grid resources. For two of
these resources, jobs submit and run fine. However, for one of these
resources a significant amount of jobs get stuck in the STAGE_OUT status.
All the resources themselves are running pbs. For jobs that get stuck,
pbs indicates that they have successfully finished executing. I think
it's a problem of two or more jobs finishing at the same time, as when I
schedule test jobs that end at different times, I don't notice the problem.
We've tried running jobs against the problem site from different instances
of condor-g - some work fine and others don't (it's worked with 6.6.1 and
6.6.3, but we've encountered problems with a different 6.6.3 instance and
with 6.6.5). All sites are using globus 2.4.3.
In all the condor_config files, we have:
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 1
GRID_MONITOR = $(SBIN)/grid_monitor.sh
ENABLE_GRID_MONITOR = True
the GridmanagerLog file shows this:
6/21 09:59:23 [11400] (694.0) doEvaluateState called: gmState
GM_SUBMITTED, globusState 128
6/21 09:59:23 [11400] (694.0) doEvaluateState called: gmState
GM_PROBE_JOBMANAGER, globusState 128
6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState
GM_SUBMITTED, globusState 128
6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState
GM_REFRESH_PROXY, globusState 128
6/21 10:03:06 [11400] (694.0) gmState GM_REFRESH_PROXY, globusState
128: refresh_credentials() returned Globus error 10
6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState
GM_STOP_AND_RESTART, globusState 128
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState
GM_RESTART, globusState 128
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState
GM_REGISTER, globusState 128
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState
GM_STDIO_UPDATE, globusState 4
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState
GM_STDIO_UPDATE, globusState 4
and then the following lines repeated every minute:
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState
GM_RESTART, globusState 4
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState
GM_RESTART, globusState 4
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState
GM_REGISTER, globusState 4
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState
GM_STDIO_UPDATE, globusState 4
Here's the error from the original gram_job_mgr file on remote resource:
6/21 09:59:23 globus_gram_job_manager_query_callback() not a literal
URI match
6/21 09:59:23 JM : in globus_l_gram_job_manager_query_callback,
query=status
6/21 09:59:23 JM : reply: (status=128 failure code=0 (Success))
6/21 09:59:23 JM : sending reply:
protocol-version: 2
status: 128
failure-code: 0
job-failure-code: 0
6/21 09:59:23 -------------------
For every time the 4 lines in the GridmanagerLog file are repeated, a new
gram_job_mgr file is created on the remote resource which tries to restart
the job, but fails with the following error:
6/21 10:03:06 JM: State lock file is locked, old jm is still alive
Processes still running on the remote resource are:
gcprod05 29693 0.0 0.1 5288 3336 ? S 09:54 0:00
globus-job-manager -conf
/usr/pkg/src/globus-toolkit-2.4.3/etc/globus-job-manager.conf -type pbs
-rdn jobmanager-pbs -machine-type unknown -publish-jobs
gcprod05 31268 0.0 0.1 5196 3776 ? S 09:55 0:00
/usr/bin/perl
/usr/pkg/src/globus-toolkit-2.4.3/libexec/globus-job-manager-script.pl -m
pbs -f /tmp/gram_d6KDpT -c stage_out
gcprod05 31320 0.0 0.1 10664 2996 ? S 09:55 0:00
/usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy
file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data
https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31321 0.0 0.1 10664 2996 ? S 09:55 0:00
/usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy
file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data
https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31322 0.0 0.1 10664 2996 ? S 09:55 0:00
/usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy
file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data
https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31323 0.0 0.1 10664 2996 ? S 09:55 0:00
/usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy
file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data
https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31329 0.0 0.1 10664 2996 ? S 09:55 0:00
/usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy
file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data
https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
Processes still running on the condor-g resource are:
gcprod05 11400 3366 0 09:53 ? 00:00:00 condor_gridmanager -f -C
(Owner=?="gcprod05"&&x509userproxysubject=?="/C=CA/O=Grid/OU=phys.uvic.ca/CN=Lila_Klektau/CN=proxy/CN=proxy/CN=proxy")
-S /tmp/condor_g_scratch.0x83b9890.3366
gcprod05 11401 11400 0 09:53 ? 00:00:04
/opt/condor-6.6.5/sbin/gahp_server
gcprod05 11858 1 0 09:53 ? 00:00:00 globus-job-manager -conf
/home/globus/globus-2.4.3//etc/globus-job-manager.conf -type condorg -rdn
jobmanager-condorg -machine-type unknown -publish-jobs
netstat on remote resource shows this:
tcp 0 0 mercury.uvic.ca:40033 gcgate01.phys.UVi:35280
TIME_WAIT
tcp 0 0 mercury.uvic.ca:40033 gcgate01.phys.UVi:35279
TIME_WAIT
tcp 1 0 mercury.u:gsigatekeeper gcgate01.phys.UVi:35078
CLOSE_WAIT
netstat on condor-g resource shows this:
tcp 0 0 gcgate01.phys.UVi:35275 mercury.u:gsigatekeeper
TIME_WAIT
When jobs do get stuck, the only way to fix things is to ssh to the remote
resource and explicitly kill the processes still running, then to manually
remove jobs from condor and all the log files. We didn't notice jobs
hanging with globus until condor-g was introduced into the submission
process.
Has this problem been encountered before? Do you know if there are any
patches available for it?
Thanks for any help
-Lila