Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAG Windows Problem : Error: Unable to monitor node job log file
- Date: Tue, 20 Jul 2010 12:35:29 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] DAG Windows Problem : Error: Unable to monitor node job log file
On Tue, 20 Jul 2010, Sassy Natan wrote:
One initial question: what version of Condor are you running?
Does Someone knows maybe how to over come this?
I have a simple DAG job file looks like this:
*JOB A A.job
JOB B B.job*
*PARENT A CHILD B *
Job A and Job B can run on the Windows Condor Cluster without any problem.
Here is how A.Job looks like:
*universe = vanilla
transfer_files=always
This looks like you are getting should_transfer_files and
when_to_transfer_output confused. I think you want:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
I don't think this is the cause of the DAGMan problem, but you might as
well fix it...
requirements =
executable = U:\runA.bat
Arguments =
output =A.out
log = A.log
error = A.err
notification = Error
initialdir = U:
run_as_owner = True
load_profile = True
queue 4
*
Now when runing the DAG job using condor_submit_dag.exe DAG.job I get the
following error:
7/20/10 10:20:25 WARNING: ProcessId not confirmed unique
You can ignore this warning.
07/20/10 10:20:25 Bootstrapping...
07/20/10 10:20:25 Number of pre-completed nodes: 0
07/20/10 10:20:25 Registering condor_event_timer...
07/20/10 10:20:26 Sleeping for one second for log file consistency
07/20/10 10:20:27 DAGMan::Job:8001:ERROR: Unable to monitor log file for
node A|ReadMultipleUserLogs:9004:Error getting file ID in
monitorLogFile()|ReadMultipleUserLogs:9004:Error initializing log file
U:\A.log|MultiLogFiles:9001:Error (2, No such file or directory) opening
file U:\A.log for creation or truncation
This is the real problem.
07/20/10 10:20:27 Of 2 nodes total:
07/20/10 10:20:27 Done Pre Queued Post Ready Un-Ready Failed
07/20/10 10:20:27 === === === === === === ===
07/20/10 10:20:27 0 0 0 0 0 2 0
07/20/10 10:20:27 ERROR: a cycle exists in the DAG
DAGMan just thinks a cycle exists because of the previous error.
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27 Node Name: A
07/20/10 10:20:27 Noop: false
07/20/10 10:20:27 NodeID: 0
07/20/10 10:20:27 Node Status: STATUS_ERROR
07/20/10 10:20:27 Node return val: -1003
07/20/10 10:20:27 Error: Unable to monitor node job log file
07/20/10 10:20:27 Job Submit File: A.job
07/20/10 10:20:27 Condor Job ID: [not yet submitted]
07/20/10 10:20:27 Q_PARENTS: <END>
07/20/10 10:20:27 Q_WAITING: <END>
07/20/10 10:20:27 Q_CHILDREN: B, <END>
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27 Node Name: B
07/20/10 10:20:27 Noop: false
07/20/10 10:20:27 NodeID: 1
07/20/10 10:20:27 Node Status: STATUS_READY
07/20/10 10:20:27 Node return val: -1
07/20/10 10:20:27 Job Submit File: B.job
07/20/10 10:20:27 Condor Job ID: [not yet submitted]
07/20/10 10:20:27 Q_PARENTS: A, <END>
07/20/10 10:20:27 Q_WAITING: A, <END>
07/20/10 10:20:27 Q_CHILDREN: <END>
07/20/10 10:20:27 --------------------------------------- <END>
07/20/10 10:20:27 Aborting DAG...
07/20/10 10:20:27 Writing Rescue DAG to dag.dag.rescue001...
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxJobs limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxIdle limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of node category
throttles
07/20/10 10:20:27 Note: 0 total PRE script deferrals because of -MaxPre
limit (0)
07/20/10 10:20:27 Note: 0 total POST script deferrals because of -MaxPost
limit (0)
I found thie https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=831
But it doesn't say much. Can someone please drop a comment on this?
This Job is part of a hadoop cluster that I'm trying to build.
Here are a couple of things to try, just to help diagnose the problem:
1) Create the log files for your jobs before you start the DAG. (You
shouldn't have to do this, but given the error message I'd like to see
whether things work if you do it.) You can just create zero-size files,
or whatever is easiest.
2) Try removing the initialdir specification in the submit files, and just
submit the DAG from the U: directory. I don't think this will make any
difference, but it would be interesting to find out for sure.
Kent Wenger
Condor Team