Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] Dagman script problem.
- Date: 28 Nov 2003 12:16:43 +0000
- From: Paul Wilson <p.b.wilson@xxxxxxxxx>
- Subject: [condor-users] Dagman script problem.
Hi
I'm trying out Dagman to shorten some jobs, and having problems getting
condor to run any of my pre/post scripts.
The first job in the DAG runs fine, then the whole thing stops at the
first post script.
Whether it's post A or pre B doesnt' make anydifference.
The following is an example of ome of the post scripts to run between
each job.
-rwxr-xr-x 1 zhimei users 178 Nov 27 17:04 copyPreStep2.sh
#!/bin/sh
cp /home/step1/REVCON /home/step2/CONFIG
cp /home/step1/REVIVE /home/step2/REVOLD
cp /home/step1/FIELD /home/step2/FIELD
cp /home/step1/CONTROL /home/step2/CONTROL
/usr/bin/perl preStep1.pl
chmod a+w /home/zhimei/test-step/step2/*
exit 1
This script works just fine when run manually from the command line. The
perl line executes a perl script which edits the CONTROL file.
Condor however cannot run it. Here is the error from the dagman.out
file. Job A runs fine, then there's an error at the job A post script:
11/27 16:33:13 Submitting Job A ...
11/27 16:33:14 assigned Condor ID (4597.0.0)
11/27 16:33:15 Event: ULOG_SUBMIT for Job A (4597.0.0)
11/27 16:33:15 0/2 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
11/27 16:33:40 Event: ULOG_EXECUTE for Job A (4597.0.0)
11/27 16:33:50 Event: ULOG_IMAGE_SIZE for Job A (4597.0.0)
11/27 16:33:55 Event: ULOG_JOB_TERMINATED for Job A (4597.0.0)
11/27 16:33:55 Job A completed successfully.
11/27 16:54:17 Running POST script of Job A...
11/27 16:54:17 0/2 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
11/27 16:54:22 Event: ULOG_POST_SCRIPT_TERMINATED for Job A (4599.0.0)
11/27 16:54:22 POST Script of Job A failed with status 1
11/27 16:54:22 0/2 done, 1 failed, 0 submitted, 0 ready, 0 pre, 0 post
11/27 16:54:22 ERROR: the following job(s) failed:
11/27 16:54:22 ---------------------- Job ----------------------
11/27 16:54:22 Node Name: A
11/27 16:54:22 NodeID: 0
11/27 16:54:22 Node Status: STATUS_ERROR
11/27 16:54:22 Error: POST Script failed with status 1
11/27 16:54:22 Job Submit File: /home/step1/dl_1.sub
11/27 16:54:22 POST Script: /home/step2/copyPreStep2.sh
11/27 16:54:22 Condor Job ID: (4599.0.0)
11/27 16:54:22 Q_PARENTS: <END>
11/27 16:54:22 Q_WAITING: <END>
11/27 16:54:22 Q_CHILDREN: 1, <END>
11/27 16:54:22 --------------------------------------- <END>
11/27 16:54:22 Writing Rescue DAG file...
11/27 16:54:22 **** condor_scheduniv_exec.4598.0 (condor_DAGMAN) EXITING
WITH STATUS 1
Here's the dagman submit file:
Job A /home/zhimei/test-step/step1/dl_1.sub
Job B /home/zhimei/test-step/step2/dl_2.sub
Script POST A /home/zhimei/test-step/step2/copyPreStep2.sh
Script POST B /home/zhimei/test-step/step3/copyPreStep3.sh
PARENT A CHILD B
Here's an example submit file:
Universe = vanilla
Requirements = OpSys == "WINNT50" && ARCH == "INTEL"
transfer_input_files = CONFIG FIELD CONTROL
executable = /home/zhimei/bin/dlpoly_new.exe
Error = dlDAG.err.$(cluster)
Output = dlDAG.out.$(cluster)
Log = /home/DlDAG.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
TransferFiles = ON_EXIT
Initialdir = /home/step1
notification = NEVER
queue
I think I'm missing something fundamental here- about Condor, dagman and
how to get my scripts to work from Condor.
Any ideas?
Paul.