[Condor-users] Fwd: DAG Windows Problem : Error: Unable to monitor node job log file

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Does DAG only worked in a Linux Env?

---------- Forwarded message ----------
From: Sassy Natan <sassyn@xxxxxxxxx>
Date: Tue, Jul 20, 2010 at 10:35 AM
Subject: DAG Windows Problem : Error: Unable to monitor node job log file
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>

Hi All,

Does Someone knows maybe how to over come this?

I have a simple DAG job file looks like this:

JOB A A.job
JOB B B.job

PARENT A CHILD B

Job A and Job B can run on the Windows Condor Cluster without any problem.

Here is how A.Job looks like:

universe = vanilla
transfer_files=always
requirements =
executable = U:\runA.bat
Arguments =
output =A.out
log = A.log
error = A.err
notification = Error
initialdir = U:
run_as_owner = True
load_profile = True
queue 4

Now when runing the DAG job using condor_submit_dag.exe DAG.job I get the following error:

7/20/10 10:20:25 WARNING: ProcessId not confirmed unique
07/20/10 10:20:25 Bootstrapping...
07/20/10 10:20:25 Number of pre-completed nodes: 0
07/20/10 10:20:25 Registering condor_event_timer...
07/20/10 10:20:26 Sleeping for one second for log file consistency
07/20/10 10:20:27 DAGMan::Job:8001:ERROR: Unable to monitor log file for node A|ReadMultipleUserLogs:9004:Error getting file ID in monitorLogFile()|ReadMultipleUserLogs:9004:Error initializing log file U:\A.log|MultiLogFiles:9001:Error (2, No such file or directory) opening file U:\A.log for creation or truncation
07/20/10 10:20:27 Of 2 nodes total:
07/20/10 10:20:27 Done     Pre   Queued    Post   Ready   Un-Ready   Failed
07/20/10 10:20:27   ===     ===      ===     ===     ===        ===      ===
07/20/10 10:20:27     0       0        0       0       0          2        0
07/20/10 10:20:27 ERROR: a cycle exists in the DAG
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27       Node Name: A
07/20/10 10:20:27            Noop: false
07/20/10 10:20:27          NodeID: 0
07/20/10 10:20:27     Node Status: STATUS_ERROR
07/20/10 10:20:27 Node return val: -1003
07/20/10 10:20:27           Error: Unable to monitor node job log file
07/20/10 10:20:27 Job Submit File: A.job
07/20/10 10:20:27   Condor Job ID: [not yet submitted]
07/20/10 10:20:27       Q_PARENTS: <END>
07/20/10 10:20:27       Q_WAITING: <END>
07/20/10 10:20:27      Q_CHILDREN: B, <END>
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27       Node Name: B
07/20/10 10:20:27            Noop: false
07/20/10 10:20:27          NodeID: 1
07/20/10 10:20:27     Node Status: STATUS_READY
07/20/10 10:20:27 Node return val: -1
07/20/10 10:20:27 Job Submit File: B.job
07/20/10 10:20:27   Condor Job ID: [not yet submitted]
07/20/10 10:20:27       Q_PARENTS: A, <END>
07/20/10 10:20:27       Q_WAITING: A, <END>
07/20/10 10:20:27      Q_CHILDREN: <END>
07/20/10 10:20:27 --------------------------------------- <END>
07/20/10 10:20:27 Aborting DAG...
07/20/10 10:20:27 Writing Rescue DAG to dag.dag.rescue001...
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxJobs limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxIdle limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of node category throttles
07/20/10 10:20:27 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
07/20/10 10:20:27 Note: 0 total POST script deferrals because of -MaxPost limit (0)

I found thie https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=831

But it doesn't say much. Can someone please drop a comment on this?

This Job is part of a hadoop cluster that I'm trying to build.

Thank you

Sassy

Mailing List Archives

Authenticated access

[Condor-users] Fwd: DAG Windows Problem : Error: Unable to monitor node job log file