Hello everybody. I am having some troubles when I try to submit a .dag file to condor throught python. That I want to do is send a job from a non shared file sistem, I use the service of condor to try it. If I send the job .dag from console (condor_submit_dag) everything execute in the good way but when I try to execute my python script all the files are sent good I but in my dagman.out appears the error "failed while reading from pipe." I have seen two things that are strange: the first is that all the files sent throught SOAP have strange permisions: -rw------- 1 condor condor 7180 2013-03-24 10:00 bucle -rw-r--r-- 1 condor condor 0 2013-03-24 10:06 _bucleA.log -rw------- 1 condor condor 141 2013-03-24 10:00 bucleA.submit -rw------- 1 condor condor 141 2013-03-24 10:00 bucleB.submit -rw------- 1 condor condor 141 2013-03-24 10:00 bucleC.submit -rw------- 1 condor condor 118 2013-03-24 10:00 bucle.dag -rw-r--r-- 1 condor condor 520 2013-03-24 10:06 bucle.dagman.log -rw-r--r-- 1 condor condor 9402 2013-03-24 10:06 bucle.dagman.out -rw------- 1 condor condor 29 2013-03-24 10:06 bucle.dagman.stdout -rw-r--r-- 1 condor condor 338 2013-03-24 10:06 bucle.dag.rescue001 -rw------- 1 condor condor 141 2013-03-24 10:00 bucleD.submit -rw------- 1 condor condor 0 2013-03-24 10:05 bucle.stderr bucle dont have the execute permision? probably is ok because there is only one way to send files throught SOAP. The second strange thing is that I did a condor_q -long for check and compare (with meld) the jobs launched from condor_submit_dag and my python script, and I didnt found significative changes. condor_submit_dag Arguments = "-f -l . -Debug 3 -Lockfile bucle.lock -AutoRescue 1 -DoRescueFrom 0 -Dag bucle.dag -CsdVersion $CondorVersion:' '7.4.4' 'Oct' '14' '2010' 'BuildID:' '279383' '$" BufferBlockSize = 32768 BufferSize = 524288 ClusterId = 752 Cmd = "/opt/condor/current/bin/condor_dagman" CommittedTime = 0 CompletionDate = 1364115965 CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL5 $" CondorVersion = "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $" CoreSize = -1 CumulativeSuspensionTime = 0 CurrentHosts = 0 EnteredCurrentStatus = 1364115965 Env = "_CONDOR_DAGMAN_LOG=bucle.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0; DAGMAN_PROHIBIT_MULTI_JOBS=True" Err = "bucle.stderr" ExitBySignal = FALSE ExitCode = 1 ExitStatus = 1 FilesRetrieved = FALSE getenv = TRUE GlobalJobId = "c-head.micluster.com#752.0#1364115625" ImageSize = 0 ImageSize_RAW = 0 In = "/dev/null" Iwd = "/home/condor/hosts/c-head/spool/cluster752.proc0.subproc0" JobCurrentStartDate = 1364115903 JobFinishedHookDone = 1364115965 JobNotification = 0 JobPrio = 0 JobRunCount = 1 JobStartDate = 1364115903 JobStatus = 4 JobUniverse = 7 KillSig = "SIGTERM" LastJobStatus = 2 LastSuspensionTime = 0 LeaveJobInQueue = FilesRetrieved =?= FALSE LocalSysCpu = 0.000000 LocalUserCpu = 0.000000 MaxHosts = 1 MinHosts = 1 NiceUser = FALSE NumCkpts = 0 NumCkpts_RAW = 0 NumJobStarts = 1 NumRestarts = 0 NumSystemHolds = 0 > =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >= 0 && ExitCode <= 2)) OrigMaxHosts = 1 Out = "bucle.dagman.stdout" Owner = "usuario" PeriodicHold = FALSE PeriodicRelease = FALSE PeriodicRemove = FALSE ProcId = 0 QDate = 1364115625 RemoteSysCpu = 0.000000 RemoteUserCpu = 0.000000 RemoteWallClockTime = 62.000000 Requirements = TRUE RootDir = "/" ServerTime = 1364117955 ShouldTransferFiles = "YES" StageInFinish = 1 StageInStart = 1 TotalSuspensions = 0 TransferFiles = "ONEXIT" TransferInput = "bucleD.submit,bucle.dag,bucle,bucleA.submit,bucleB.submit,bucleC.submit" UserLog = "bucle.dagman.log" User = "usuario@xxxxxxxxxxxxx" WantCheckpoint = FALSE WantRemoteIO = TRUE WantRemoteSyscalls = FALSE WhenToTransferOutput = "ON_EXIT" python: Arguments = "-f -l . -Debug 3 -Lockfile bucle.lock -AutoRescue 1 -DoRescueFrom 0 -Dag bucle.dag -CsdVersion $CondorVersion:' '7.4.4' 'Oct' '14' '2010' 'BuildID:' '279383' '$" BufferBlockSize = 32768 BufferSize = 524288 ClusterId = 752 Cmd = "/opt/condor/current/bin/condor_dagman" CommittedTime = 0 CompletionDate = 0 CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL5 $" CondorVersion = "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $" CoreSize = -1 CumulativeSuspensionTime = 0 CurrentHosts = 1 EnteredCurrentStatus = 1364115902 Env = "_CONDOR_DAGMAN_LOG=bucle.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0; DAGMAN_PROHIBIT_MULTI_JOBS=True" Err = "bucle.stderr" ExitBySignal = FALSE ExitStatus = 0 FilesRetrieved = FALSE getenv = TRUE GlobalJobId = "c-head.micluster.com#752.0#1364115625" ImageSize = 0 ImageSize_RAW = 0 In = "/dev/null" Iwd = "/home/condor/hosts/c-head/spool/cluster752.proc0.subproc0" JobCurrentStartDate = 1364115903 JobNotification = 0 JobPrio = 0 JobRunCount = 1 JobStartDate = 1364115903 JobStatus = 2 JobUniverse = 7 KillSig = "SIGTERM" LastJobStatus = 1 LastSuspensionTime = 0 LeaveJobInQueue = FilesRetrieved =?= FALSE LocalSysCpu = 0.000000 LocalUserCpu = 0.000000 MaxHosts = 1 MinHosts = 1 NiceUser = FALSE NumCkpts = 0 NumCkpts_RAW = 0 NumJobStarts = 1 NumRestarts = 0 NumSystemHolds = 0 > =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >= 0 && ExitCode <= 2)) OrigMaxHosts = 1 Out = "bucle.dagman.stdout" Owner = "usuario" PeriodicHold = FALSE PeriodicRelease = FALSE PeriodicRemove = FALSE ProcId = 0 QDate = 1364115625 RemoteSysCpu = 0.000000 RemoteUserCpu = 0.000000 RemoteWallClockTime = 0.000000 Requirements = TRUE RootDir = "/" ServerTime = 1364115938 ShadowBday = 1364115903 ShouldTransferFiles = "YES" StageInFinish = 1 StageInStart = 1 TotalSuspensions = 0 TransferFiles = "ONEXIT" TransferInput = "bucleD.submit,bucle.dag,bucle,bucleA.submit,bucleB.submit,bucleC.submit" UserLog = "bucle.dagman.log" User = "usuario@xxxxxxxxxxxxx" WantCheckpoint = FALSE WantRemoteIO = TRUE WantRemoteSyscalls = FALSE WhenToTransferOutput = "ON_EXIT" I will post the bucle.dagman.out: 03/24 10:05:03 ****************************************************** 03/24 10:05:03 ** condor_scheduniv_exec.752.0 (CONDOR_DAGMAN) STARTING UP 03/24 10:05:03 ** /exports/condor/condor-7.4.4/bin/condor_dagman 03/24 10:05:03 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 03/24 10:05:03 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON 03/24 10:05:03 ** $CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $ 03/24 10:05:03 ** $CondorPlatform: I386-LINUX_RHEL5 $ 03/24 10:05:03 ** PID = 2174 03/24 10:05:03 ** Log last touched time unavailable (No such file or directory) 03/24 10:05:03 ****************************************************** 03/24 10:05:03 Using config source: /home/condor/condor_config 03/24 10:05:03 Using local config sources: 03/24 10:05:03 /opt/condor/current/etc/condor_config.local 03/24 10:05:03 /opt/condor/etc/condor_config.cluster 03/24 10:05:03 /opt/condor/etc/condor_config.c-head 03/24 10:05:03 DaemonCore: Command Socket at <192.168.1.20:9320> 03/24 10:05:03 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 03/24 10:05:03 DAGMAN_DEBUG_CACHE_ENABLE setting: False 03/24 10:05:03 DAGMAN_SUBMIT_DELAY setting: 0 03/24 10:05:03 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 03/24 10:05:03 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 03/24 10:05:03 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 03/24 10:05:03 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 03/24 10:05:03 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 03/24 10:05:03 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 03/24 10:05:03 DAGMAN_RETRY_NODE_FIRST setting: 0 03/24 10:05:03 DAGMAN_MAX_JOBS_IDLE setting: 0 03/24 10:05:03 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 03/24 10:05:03 DAGMAN_MUNGE_NODE_NAMES setting: 1 03/24 10:05:03 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 03/24 10:05:03 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 03/24 10:05:03 DAGMAN_ABORT_DUPLICATES setting: 1 03/24 10:05:03 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 03/24 10:05:03 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 03/24 10:05:03 DAGMAN_AUTO_RESCUE setting: 1 03/24 10:05:03 DAGMAN_MAX_RESCUE_NUM setting: 100 03/24 10:05:03 DAGMAN_DEFAULT_NODE_LOG setting: null 03/24 10:05:03 ALL_DEBUG setting: 03/24 10:05:03 DAGMAN_DEBUG setting: 03/24 10:05:03 argv[0] == "condor_scheduniv_exec.752.0" 03/24 10:05:03 argv[1] == "-Debug" 03/24 10:05:03 argv[2] == "3" 03/24 10:05:03 argv[3] == "-Lockfile" 03/24 10:05:03 argv[4] == "bucle.lock" 03/24 10:05:03 argv[5] == "-AutoRescue" 03/24 10:05:03 argv[6] == "1" 03/24 10:05:03 argv[7] == "-DoRescueFrom" 03/24 10:05:03 argv[8] == "0" 03/24 10:05:03 argv[9] == "-Dag" 03/24 10:05:03 argv[10] == "bucle.dag" 03/24 10:05:03 argv[11] == "-CsdVersion" 03/24 10:05:03 argv[12] == "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $" 03/24 10:05:03 Default node log file is: </home/condor/hosts/c-head/spool/cluster752.proc0.subproc0/bucle.dag.nodes.log> 03/24 10:05:03 DAG Lockfile will be written to bucle.lock 03/24 10:05:03 DAG Input file is bucle.dag 03/24 10:05:03 Parsing 1 dagfiles 03/24 10:05:03 Parsing bucle.dag ... 03/24 10:05:03 Dag contains 4 total jobs 03/24 10:05:03 Sleeping for 12 seconds to ensure ProcessId uniqueness 03/24 10:05:15 Bootstrapping... 03/24 10:05:15 Number of pre-completed nodes: 0 03/24 10:05:15 Registering condor_event_timer... 03/24 10:05:16 Sleeping for one second for log file consistency 03/24 10:05:17 Submitting Condor Node A job(s)... 03/24 10:05:17 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:17 failed while reading from pipe. 03/24 10:05:17 Read so far: 03/24 10:05:17 ERROR: submit attempt failed 03/24 10:05:17 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:17 Job submit try 1/6 failed, will try again in >= 1 second. 03/24 10:05:17 Of 4 nodes total: 03/24 10:05:17 Done Pre Queued Post Ready Un-Ready Failed 03/24 10:05:17 === === === === === === === 03/24 10:05:17 0 0 0 0 1 3 0 03/24 10:05:22 Sleeping for one second for log file consistency 03/24 10:05:23 Submitting Condor Node A job(s)... 03/24 10:05:23 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:23 failed while reading from pipe. 03/24 10:05:23 Read so far: 03/24 10:05:23 ERROR: submit attempt failed 03/24 10:05:23 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:23 Job submit try 2/6 failed, will try again in >= 2 seconds. 03/24 10:05:28 Sleeping for one second for log file consistency 03/24 10:05:29 Submitting Condor Node A job(s)... 03/24 10:05:29 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:29 failed while reading from pipe. 03/24 10:05:29 Read so far: 03/24 10:05:29 ERROR: submit attempt failed 03/24 10:05:29 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:29 Job submit try 3/6 failed, will try again in >= 4 seconds. 03/24 10:05:34 Sleeping for one second for log file consistency 03/24 10:05:35 Submitting Condor Node A job(s)... 03/24 10:05:35 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:35 failed while reading from pipe. 03/24 10:05:35 Read so far: 03/24 10:05:35 ERROR: submit attempt failed 03/24 10:05:35 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:35 Job submit try 4/6 failed, will try again in >= 8 seconds. 03/24 10:05:46 Sleeping for one second for log file consistency 03/24 10:05:47 Submitting Condor Node A job(s)... 03/24 10:05:47 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:47 failed while reading from pipe. 03/24 10:05:47 Read so far: 03/24 10:05:47 ERROR: submit attempt failed 03/24 10:05:47 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:05:47 Job submit try 5/6 failed, will try again in >= 16 seconds. 03/24 10:06:04 Sleeping for one second for log file consistency 03/24 10:06:05 Submitting Condor Node A job(s)... 03/24 10:06:05 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:06:05 failed while reading from pipe. 03/24 10:06:05 Read so far: 03/24 10:06:05 ERROR: submit attempt failed 03/24 10:06:05 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit 03/24 10:06:05 Job submit failed after 6 tries. 03/24 10:06:05 Shortcutting node A retries because of submit failure(s) 03/24 10:06:05 Of 4 nodes total: 03/24 10:06:05 Done Pre Queued Post Ready Un-Ready Failed 03/24 10:06:05 === === === === === === === 03/24 10:06:05 0 0 0 0 0 3 1 03/24 10:06:05 ERROR: the following job(s) failed: 03/24 10:06:05 ---------------------- Job ---------------------- 03/24 10:06:05 Node Name: A 03/24 10:06:05 NodeID: 0 03/24 10:06:05 Node Status: STATUS_ERROR 03/24 10:06:05 Node return val: -1 03/24 10:06:05 Error: Job submit failed 03/24 10:06:05 Job Submit File: bucleA.submit 03/24 10:06:05 Condor Job ID: [not yet submitted] 03/24 10:06:05 Q_PARENTS: <END> 03/24 10:06:05 Q_WAITING: <END> 03/24 10:06:05 Q_CHILDREN: B, C, <END> 03/24 10:06:05 --------------------------------------- <END> 03/24 10:06:05 Aborting DAG... 03/24 10:06:05 Writing Rescue DAG to bucle.dag.rescue001... 03/24 10:06:05 Note: 0 total job deferrals because of -MaxJobs limit (0) 03/24 10:06:05 Note: 0 total job deferrals because of -MaxIdle limit (0) 03/24 10:06:05 Note: 0 total job deferrals because of node category throttles 03/24 10:06:05 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 03/24 10:06:05 Note: 0 total POST script deferrals because of -MaxPost limit (0) 03/24 10:06:05 **** condor_scheduniv_exec.752.0 (condor_DAGMAN) pid 2174 EXITING WITH STATUS 1 If anyone could help me saying what I am doing wrong, I will apreciate it. Fernando. |