Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-g jobs failing - stuck in STAGE_OUT

Date: Mon, 21 Jun 2004 10:34:41 -0700
From: "Lila Klektau" <lmk@xxxxxxx>
Subject: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT

HI,

We're using condor-g to submit to 3 different grid resources. For two of these resources, jobs submit and run fine. However, for one of these resources a significant amount of jobs get stuck in the STAGE_OUT status. All the resources themselves are running pbs. For jobs that get stuck, pbs indicates that they have successfully finished executing. I think it's a problem of two or more jobs finishing at the same time, as when I schedule test jobs that end at different times, I don't notice the problem.

We've tried running jobs against the problem site from different instances of condor-g - some work fine and others don't (it's worked with 6.6.1 and 6.6.3, but we've encountered problems with a different 6.6.3 instance and with 6.6.5). All sites are using globus 2.4.3.

In all the condor_config files, we have:
   GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 1
   GRID_MONITOR = $(SBIN)/grid_monitor.sh
   ENABLE_GRID_MONITOR = True

the GridmanagerLog file shows this: 6/21 09:59:23 [11400] (694.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128 6/21 09:59:23 [11400] (694.0) doEvaluateState called: gmState GM_PROBE_JOBMANAGER, globusState 128 6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128 6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState GM_REFRESH_PROXY, globusState 128 6/21 10:03:06 [11400] (694.0) gmState GM_REFRESH_PROXY, globusState 128: refresh_credentials() returned Globus error 10 6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState GM_STOP_AND_RESTART, globusState 128 6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_RESTART, globusState 128 6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_REGISTER, globusState 128 6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4 6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4 and then the following lines repeated every minute: 6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_RESTART, globusState 4 6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_RESTART, globusState 4 6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_REGISTER, globusState 4 6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4

Here's the error from the original gram_job_mgr file on remote resource: 6/21 09:59:23 globus_gram_job_manager_query_callback() not a literal URI match 6/21 09:59:23 JM : in globus_l_gram_job_manager_query_callback, query=status 6/21 09:59:23 JM : reply: (status=128 failure code=0 (Success)) 6/21 09:59:23 JM : sending reply: protocol-version: 2 status: 128 failure-code: 0 job-failure-code: 0 6/21 09:59:23 -------------------

For every time the 4 lines in the GridmanagerLog file are repeated, a new gram_job_mgr file is created on the remote resource which tries to restart the job, but fails with the following error: 6/21 10:03:06 JM: State lock file is locked, old jm is still alive

Processes still running on the remote resource are: gcprod05 29693 0.0 0.1 5288 3336 ? S 09:54 0:00 globus-job-manager -conf /usr/pkg/src/globus-toolkit-2.4.3/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs -machine-type unknown -publish-jobs gcprod05 31268 0.0 0.1 5196 3776 ? S 09:55 0:00 /usr/bin/perl /usr/pkg/src/globus-toolkit-2.4.3/libexec/globus-job-manager-script.pl -m pbs -f /tmp/gram_d6KDpT -c stage_out gcprod05 31320 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data gcprod05 31321 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data gcprod05 31322 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data gcprod05 31323 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data gcprod05 31329 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data

Processes still running on the condor-g resource are: gcprod05 11400 3366 0 09:53 ? 00:00:00 condor_gridmanager -f -C (Owner=?="gcprod05"&&x509userproxysubject=?="/C=CA/O=Grid/OU=phys.uvic.ca/CN=Lila_Klektau/CN=proxy/CN=proxy/CN=proxy") -S /tmp/condor_g_scratch.0x83b9890.3366 gcprod05 11401 11400 0 09:53 ? 00:00:04 /opt/condor-6.6.5/sbin/gahp_server gcprod05 11858 1 0 09:53 ? 00:00:00 globus-job-manager -conf /home/globus/globus-2.4.3//etc/globus-job-manager.conf -type condorg -rdn jobmanager-condorg -machine-type unknown -publish-jobs

netstat on remote resource shows this: tcp 0 0 mercury.uvic.ca:40033 gcgate01.phys.UVi:35280 TIME_WAIT tcp 0 0 mercury.uvic.ca:40033 gcgate01.phys.UVi:35279 TIME_WAIT tcp 1 0 mercury.u:gsigatekeeper gcgate01.phys.UVi:35078 CLOSE_WAIT

netstat on condor-g resource shows this: tcp 0 0 gcgate01.phys.UVi:35275 mercury.u:gsigatekeeper TIME_WAIT

When jobs do get stuck, the only way to fix things is to ssh to the remote resource and explicitly kill the processes still running, then to manually remove jobs from condor and all the log files. We didn't notice jobs hanging with globus until condor-g was introduced into the submission process.

Has this problem been encountered before? Do you know if there are any patches available for it?

Thanks for any help

-Lila

Follow-Ups:
- Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
  - From: Jaime Frey

Prev by Date: RE: [Condor-users] jobs wait in idle mode unecessarily
Next by Date: [Condor-users] MPI, Windows and non dedicated resources...
Previous by thread: Re: [Condor-users] Any working MW implementations
Next by thread: Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] condor-g jobs failing - stuck in STAGE_OUT