[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor_qsub condor_q is not working in "sentinel scripts"



Dear all,

I have been trying to combine the brain imaging software fsl with condor. I've downloaded condor_qsub and fsl_sub from the Neurodebian repository and am trying to get it to run on a mac now (condor installed using macports).Â

I've gotten so far (after changing the scripts to run on mac) that all jobs are queued and the first jobs run. The other jobs never run. However, if I manually release them in the correct order, they run well - the final results is the same as running the fsl program directly without conder.

In trying to track down the problem further, I've noticed the following: the script condor_qsub makes "sentinel" files, e.g. cluster4158_sentinel.2gVTg6RTrthSUL. An example content is shown below. Within these files condor_q is called to check whether the first job has run before the 2nd one is started and so on. Using 'echo' commands I can see that the sentinel is running, however, the call to condor_q always produces the output '0'.Â

Surprisingly, when I run the sentinel script from the command line directly, condor_q produces the expected output as it does when I type condor_q in the command line directly. I am really stuck now because I have no idea what to test next - what could be producing the effect that this script when it is run from within condor would run in general, but that condor_q would not run.

I'd be very grateful for any pointers you could give me.
Many thanks
Jacquie

Here is the script:
#!/bin/sh
hold_jids=" 4157"
dep_job="4158"
clean_error() {
printf "$1\n"
condor_rm $dep_job
exit 1
}
echo "in Sentinel sentinelfile" >> /tmp/sentinel_$dep_job.log

# as long as there are relevant job in the queue wait and try again
while [ $(condor_q -long -attributes Owner $hold_jids | grep "jacquelinescholl" | wc -l) -ge 0 ]; do
  Âecho Â"In while loop" >> /tmp/sentinel_$dep_job.log
  Âstoredval=$(condor_q -long -attributes Owner $hold_jids | grep "jacquelinescholl" | wc -l)
  Âownval=$(condor_q -long -attributes Owner $dep_job | grep "jacquelinescholl" | wc -l)
  Âtvals=$(condor_q)
  Âecho "JIDS is $hold_jids" >> /tmp/sentinel_$dep_job.log
  Âecho "Ownval is $ownval" >> /tmp/sentinel_$dep_job.log
  Âecho Â"StoreVal Âsleep is $storedval" >> /tmp/sentinel_$dep_job.log
  Âecho "$tvals" >> /tmp/sentinel_$dep_job.log
  Âsleep 10
done
Â

Here is the script output when it runs automatically, started by condor:
in Sentinel sentinelfile
In while loop
JIDS is Â4157
Ownval is    Â0
StoreVal Âsleep is    Â0

In while loop
JIDS is Â4157
Ownval is    Â0
StoreVal Âsleep is    Â0

And it continues like this until aborted.

When I start it from the command line (by going into the folder and typing ./cluster4158_sentinel.2gVTg6RrthSUL, it says:
in Sentinel sentinelfile
In while loop
JIDS is Â4157
Ownval is    Â1
StoreVal Âsleep is    Â1

-- Submitter: users-mac-pro-2.local : <127.0.0.1:51199> : users-mac-pro-2.local
ÂID Â Â ÂOWNER Â Â Â Â Â ÂSUBMITTED Â Â RUN_TIME ST PRI SIZE CMD Â Â Â Â Â Â ÂÂ
4157.0 Â jacquelinescho Â1/29 15:59 Â 0+00:01:00 R Â0 Â 1.2 Âbash /usr/local/fs
4158.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:00 H Â0 Â 1.2 Âbash /usr/local/fs
4159.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:55 R Â0 Â 0.0 Âcluster4158_sentin
4160.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:00 H Â0 Â 1.2 Âsh /data/jscholl/S
4161.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:55 R Â0 Â 0.0 Âcluster4160_sentin
4162.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:00 H Â0 Â 1.2 Âbash /usr/local/fs
4163.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:55 R Â0 Â 0.0 Âcluster4162_sentin
4164.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:00 H Â0 Â 1.2 Âbash /usr/local/fs
4165.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:55 R Â0 Â 0.0 Âcluster4164_sentin
4166.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:00 H Â0 Â 1.2 Âbash /usr/local/fs
4167.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:55 R Â0 Â 0.0 Âcluster4166_sentin
4168.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:00 H Â0 Â 1.2 Âbash /usr/local/fs
4169.0 Â jacquelinescho Â1/29 15:59 Â 0+00:00:55 R Â0 Â 0.0 Âcluster4168_sentin

[i've cut off the rest here]