Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor-G/Globus Problem
- Date: Thu, 13 Oct 2005 12:26:30 -0400 (EDT)
- From: "James E. Dobson" <James.E.Dobson@xxxxxxxxxxxxx>
- Subject: [Condor-users] Condor-G/Globus Problem
Hi,
I have a problem with jobs going into "globus error 7" state after a while
of succesful running. There is a very long proxy in place:
[edsan@bellows-falls edsan]$ grid-proxy-info -timeleft
7128652
Yet the jobs going into UNKNOWN state after a while:
[edsan@bellows-falls edsan]$ condor_q -globus |grep pbs
2373.0 edsan UNKNOWN condor pbs-01.grid.dartmo
/afs/northstar.dar
2374.0 edsan UNKNOWN condor pbs-01.grid.dartmo
/afs/northstar.dar
2395.0 edsan UNKNOWN condor pbs-01.grid.dartmo
/afs/northstar.dar
[edsan@bellows-falls edsan]$ condor_q -l 2395.0
...
LastHoldReason = "Globus error 7: authentication with the remote server
failed"
...
The job directory is delete so it looks like the job is done:
[edsan@bellows-falls edsan]$ globus-job-status
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
DONE
[edsan@bellows-falls edsan]$ globus-job-get-output
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
Invalid job id.
On the Gatekeeper itself (also running Condor) the jobs appear to be still
running:
[jed@pbs-01 jed]$ condor_q
-- Submitter: pbs-01.grid.dartmouth.edu : <129.170.30.146:32787> :
pbs-01.grid.dartmouth.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
29.0 grid 10/12 12:40 0+23:40:00 R 0 413.6 data
30.0 grid 10/12 12:42 0+23:38:22 R 0 414.0 data
38.0 grid 10/12 16:29 0+19:50:56 R 0 399.6 data
3 jobs; 0 idle, 3 running, 0 held
But the job directory doesn't exist:
[jed@pbs-01 jed]$ condor_q -l 38.0 |grep ^Err
Err =
"/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987/stderr"
[jed@pbs-01 jed]$ sudo ls -ld
/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987
Password:
ls: /home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987: No
such file or directory
Has anyone seen this before? Any clue what is causing it? We have some
long running jobs that are getting hit this communication or
authentication and directory deleting problem....
[jed@pbs-01 jed]$ /opt/vdt/vdt/bin/vdt-version
You have installed the complete VDT version 1.3.5:
Condor/Condor-G 6.7.6
Globus Toolkit, pre web-services, client 3.2.1
Globus Toolkit, pre web-services, server 3.2.1
[edsan@bellows-falls edsan]$ condor_version
$CondorVersion: 6.6.7 Oct 11 2004 $
$CondorPlatform: I386-LINUX_RH9 $
Thanks,
-jed