Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] my jobs won't run on my pool :(
- Date: Fri, 21 Oct 2005 16:38:39 +0200
- From: Nicolas GUIOT <nicolas.guiot@xxxxxxx>
- Subject: [Condor-users] my jobs won't run on my pool :(
Hi everyone
I have problems running some jobs on my pool : here are 3 examples : 2 from the net, one from our lab.
1st job : uname.sh (from http://condor.optena.com/display/CONDOR/mail/3277)
___________
guiot@chagall:~/tmp/TestCondor/JobPerso$ more uname.cmd
Universe = vanilla
Executable = uname.sh
Output = uname.out
Error = uname.err
log = uname.log
queue
guiot@chagall:~/tmp/TestCondor/JobPerso$ more uname.sh
#!/bin/bash
# Print the machine name we ran on
uname -n
guiot@chagall:~/tmp/TestCondor/JobPerso$
--> job is not running :
Overview of the uname.log file :
...
001 (110.000.000) 10/21 15:39:03 Job executing on host: <193.49.27.11:34130>
...
007 (110.000.000) 10/21 15:39:03 Shadow exception!
Error from starter on vrubel.galaxy.ibpc.fr: Failed to execute '/ibpc/chagall/guiot/tmp/TestCondor
/JobPerso/uname.sh condor_exec.exe': Permission denied
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
$tail /scratch/condor/log/SchedLog
10/21 15:18:59 (pid:3971) Starting add_shadow_birthdate(108.0)
10/21 15:18:59 (pid:3971) Started shadow for job 108.0 on "<193.49.27.11:34130>", (shadow pid = 21834)
10/21 15:18:59 (pid:3971) Shadow pid 21834 for job 108.0 exited with status 4
10/21 15:18:59 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:19:01 (pid:3971) Starting add_shadow_birthdate(108.0)
10/21 15:19:01 (pid:3971) Started shadow for job 108.0 on "<193.49.27.11:34130>", (shadow pid = 21838)
10/21 15:19:02 (pid:3971) Shadow pid 21838 for job 108.0 exited with status 4
10/21 15:19:02 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:19:02 (pid:3971) Match for cluster 108 has had 5 shadow exceptions, relinquishing.
10/21 15:19:02 (pid:3971) Sent RELEASE_CLAIM to startd on <193.49.27.11:34130>
10/21 15:19:02 (pid:3971) Match record (<193.49.27.11:34130>, 108, 0) deleted
10/21 15:19:03 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/21 15:19:03 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
____________________________________________
2nd job : foo.condor (from http://www.csit.fsu.edu/~burkardt/f_src/condor/)
__________
guiot@chagall:~/tmp/TestCondor/JobPerso$ more foo.condor
universe = vanilla
executable = foo.csh
log = foo.log
output = foo.out
queue
guiot@chagall:~/tmp/TestCondor/JobPerso$ more foo.csh
#!/bin/csh
#
date
echo " "
echo "FOO.CSH."
echo " A simple shell script that shows off."
#
foreach i (10 20 40)
echo $i
end
#
echo "Current directory is " $PWD "."
#
echo " "
echo "FOO.CSH."
echo " Normal end of execution."
echo " "
date
guiot@chagall:~/tmp/TestCondor/JobPerso$
-->
here is an overview on the foo.log file :
...
001 (109.000.000) 10/21 15:35:08 Job executing on host: <193.49.27.11:34130>
...
007 (109.000.000) 10/21 15:35:08 Shadow exception!
Error from starter on vrubel.galaxy.ibpc.fr: Failed to execute '/ibpc/chagall/guiot/tmp/TestCondor
/JobPerso/foo.csh condor_exec.exe': Permission denied
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
$tail /scratch/condor/log/SchedLog
10/21 15:35:10 (pid:3971) Starting add_shadow_birthdate(109.0)
10/21 15:35:10 (pid:3971) Started shadow for job 109.0 on "<193.49.27.11:34130>", (shadow pid = 22188)
10/21 15:35:10 (pid:3971) Shadow pid 22188 for job 109.0 exited with status 4
10/21 15:35:10 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:35:12 (pid:3971) Starting add_shadow_birthdate(109.0)
10/21 15:35:12 (pid:3971) Started shadow for job 109.0 on "<193.49.27.11:34130>", (shadow pid = 22191)
10/21 15:35:13 (pid:3971) Shadow pid 22191 for job 109.0 exited with status 4
10/21 15:35:13 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:35:13 (pid:3971) Match for cluster 109 has had 5 shadow exceptions, relinquishing.
10/21 15:35:13 (pid:3971) Sent RELEASE_CLAIM to startd on <193.49.27.11:34130>
10/21 15:35:13 (pid:3971) Match record (<193.49.27.11:34130>, 109, 0) deleted
10/21 15:35:13 (pid:3971) DaemonCore: Command received via TCP from host <193.49.27.11:36789>
10/21 15:35:13 (pid:3971) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/21 15:35:13 (pid:3971) Got VACATE_SERVICE from <193.49.27.11:36789>
10/21 15:35:14 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/21 15:35:14 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
___________________________________________________________________
these jobs were submitted as user "guiot" (me) : the daemons were started as user root : where did I miss the permission thing ?
3nd job : This is a bit different : this is a job one of my user had as a script, that I tried to convert into a condor_submit format file
___________
Original shell file :
guiot@chagall:/run_cns_30$ more refine.csh
#!/bin/csh
## results will be stored here
setenv NEWIT /place/to/store/the/results
## project path
setenv RUN /path/to/the/project
## individual run.cns is stored here
setenv RUN_CNS /place/where/run.cns/is/located
## command line
/cns_solve_1.1/intel-i686-linux_g77/bin/cns_solve < /path/to/the/project/run1/cns/protocols/refine.inp >! refine.out
touch done
guiot@chagall:/run_cns_30$
____________
1st TEST : submit a shell file :
guiot@chagall:~/tmp/TestCondor/JobPerso$ more Benjamin1.cmd
####################
#
# Test du prog de Benjamin
#
####################
Universe = vanilla
Executable = /run_cns_30/refine.csh
error = Benjamin1.err
Log = Benjamin1.log
queue
guiot@chagall:~/tmp/TestCondor/JobPerso$
it runs , but in only a few seconds, and doesn't make any computation (should last around 10 hours...)
___________
I tried a 2nd test with this submit file :
guiot@chagall:~/tmp/TestCondor/JobPerso$ more Benjamin2.cmd
####################
#
# Test du prog de Benjamin
#
####################
Universe = vanilla
Executable = /run_cns_30/refine.csh
environment = NEWIT=/place/to/store/the/results;RUN=/path/to/the/project;RUN_CNS=/place/where/run.cns/is/located
arguments = < /path/to/the/project/run1/cns/protocols/refine.inp
error = Benjamin2.err
Log = Benjamin2.log
queue
guiot@chagall:~/tmp/TestCondor/JobPerso$
Exactly the same behavior : "runs" for a few seconds, but still no results.
________________________________
So : what could be the reason I can't run any job on my cluster ?
I've run some other job perfectly fine (the example that come with the condor install package, thoses ones : http://www.usc.edu/hpcc/systems/condorv.php in standard and vanilla universe ), but Why can't I run my OWN (useful...) jobs ?
Thanks in advance for your help
Nicolas GUIOT
-----------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE
Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
------------------------------------------------