Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor-G fun
- Date: Fri, 9 Nov 2007 11:36:32 -0000
- From: "Kewley, J \(John\)" <j.kewley@xxxxxxxx>
- Subject: [Condor-users] Condor-G fun
I have run a few jobs using condor-g now, but even though the jobs
sometimes
run OK, I frequently get the following error. When the error occurs, the
jobs never
seem to recover (although I give up after about 40 mins):
---------------
020 (023.000.000) 11/09 11:08:03 Detected Down Globus Resource
RM-Contact: <grid-resource>/jobmanager-fork
...
026 (023.000.000) 11/09 11:08:03 Detected Down Grid Resource
GridResource: gt2 <grid-resource>/jobmanager-fork
---------------
The 2 most obvious reasons for this are:
a) Machine is down
b) Machine never existed (i.e. name spelled wrong)
Since I can cut and paste the machine name and successfully run Grid
jobs
to that machine, any ideas what else it can be?
I have ruled out the following (or believe I have):
1. Firewall issue (this has now been opened) since this would prevent
the
globus-job-runs running, and in any case, I'd get an error about
inability to
transfer files.
2. Don't have valid proxy - since globus commands work
3. condor daemons can't see firewall settings - since re-run with them
in the
environment, and some jobs do run.
Are there any DEBUG settings I can do to get further info?
condor-q -anal
doesn't help for non-matchmaking condor-g
Below is the submit file (hopefully there is a bug in there somewhere)
Cheers
JK
---------------------
# Maybe I should try with "old" syntax, using globus universe
universe = grid
# Just try with for for now
grid_resource = gt2 <grid-resource>/jobmanager-fork
notification = never
# This exists in /bin on all grid resources I use
executable = /bin/hostname
transfer_executable = false
# No common storage
SHOULD_TRANSFER_FILES = YES
WHEN_TO_TRANSFER_OUTPUT = ON_EXIT
# Do these make sense in combination with previous 2 file transfer
settings?
stream_input = false
stream_error = false
stream_output = false
output = glob$(PROCESS).out
error = glob$(PROCESS).err
log = glob.log
# Is something like this needed or should I just omit it?
REQUIREMENTS = (OpSys == LINUX && (Arch != "Windows51"))
queue