| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] multiple dags and schedd dieing
- Date: Thu, 2 Jun 2005 14:58:20 -0500
- From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
- Subject: Re: [Condor-users] multiple dags and schedd dieing
Pawel,
There is no problem running multiple concurrent DAGMan jobs, provided 
they each have a unique DAG input file.
Your problem below is two-fold:
1) For some reason you don't have SMTP_SERVER defined in your config 
file.  If you set this correctly, your problem should go away.
2) While the condor_master is cooly reporting this config problem and 
moving on, the condor_schedd is over-reacting and exiting, which is 
dumb.  I'm considering this a bug and will fix it for the next stable & 
development releases.
Thanks for the report!
-Peter
On Jun 2, 2005, at 2:33 PM, Pawel.Micun@xxxxxxxxxxxxxxxx wrote:
Hello,
Should I be able to have 2 DAGMan jobs executing at the same time? I 
assumed I should, as I couldn't find any info to the contrary.
If I'm wrong then ignore my rambling....
I have 2 separate dags, and I submit one after another with 
condor_submit_dag. Everything starts up good, I see two separate
jobs with condor_dagman.exe executing, each is processing its dag and 
submitting more jobs.
When the first DAGMan job finishes, it takes out schedd with it.  
$CondorVersion: 6.6.9 Mar 10 2005 $
$CondorPlatform: INTEL-WINNT40 $
dagman.out of first dag
6/2 14:39:49 POST Script of Job F completed successfully.
6/2 14:39:49 Of 154 nodes total:
6/2 14:39:49  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
6/2 14:39:49   ===     ===      ===     ===     ===        ===      ===
6/2 14:39:49   154       0        0       0       0          0        0
6/2 14:39:49 All jobs Completed!
6/2 14:39:49 **** condor_scheduniv_exec.2563.0 (condor_DAGMAN) EXITING 
WITH STATUS 0
MasterLog
6/2 14:39:47 ProcAPI: pid # 3356 was not found
6/2 14:39:48 ProcAPI: pid # 3388 was not found
6/2 14:39:49 MASTER_TIMEOUT_MULTIPLIER is undefined, using default 
value of 0
6/2 14:39:49 The SCHEDD (pid 5060) exited with status 4
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed 
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed 
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed 
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed 
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
....
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:39:49 ProcAPI: pid # 5268 was not found
6/2 14:39:49 ProcAPI: pid # 1716 was not found
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:44:58 Procfamily: ERROR: Could not open pid 4756 (err=87). 
Maybe it exited already?
6/2 14:44:58 ProcAPI: pid # 4756 was not found
6/2 14:45:06 Procfamily: ERROR: Could not open pid 2324 (err=87). 
Maybe it exited already?
6/2 14:45:06 ProcAPI: pid # 2324 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 2968 (err=87). 
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 2968 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4788 (err=87). 
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4788 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 5056 (err=87). 
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 5056 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4704 (err=87). 
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4704 was not found
6/2 14:45:10 Sending obituary for "C:\Condor/bin/condor_schedd.exe"
6/2 14:45:10 Trying to email, but SMTP_SERVER not specified in config 
file
6/2 14:45:10 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds
SchedLog
6/2 14:39:49 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default 
value of 0
6/2 14:39:49 scheduler universe job (2563.0) pid 1716 exited with 
status 0
6/2 14:39:49 Writing record to user 
logfile=G:\models\PFP\tasks\dag\ALL.dag.dagman.log owner=user
6/2 14:39:49 init_user_ids: want user 'user@machine', current is 
'(null)@(null)'
6/2 14:39:49 init_user_ids: Already have handle for user@machine, so 
returning.
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 ENABLE_USERLOG_LOCKING is undefined, using default value 
of True
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 Unknown user notification selection
6/2 14:39:49         Notify user with subject: Condor Job 2563.0
6/2 14:39:49 Trying to email, but SMTP_SERVER not specified in config 
file
6/2 14:39:49 ERROR "Could not open mail to user!" at line 5543 in file 
..\src\condor_schedd.V6\schedd.C
Eventually master restarts schedd, but this causes havoc with already 
running jobs.
Any help appreciated,
Pawel
--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685