Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] multiple dags and schedd dieing
- Date: Thu, 2 Jun 2005 14:58:20 -0500
- From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
- Subject: Re: [Condor-users] multiple dags and schedd dieing
Pawel,
There is no problem running multiple concurrent DAGMan jobs, provided
they each have a unique DAG input file.
Your problem below is two-fold:
1) For some reason you don't have SMTP_SERVER defined in your config
file. If you set this correctly, your problem should go away.
2) While the condor_master is cooly reporting this config problem and
moving on, the condor_schedd is over-reacting and exiting, which is
dumb. I'm considering this a bug and will fix it for the next stable &
development releases.
Thanks for the report!
-Peter
On Jun 2, 2005, at 2:33 PM, Pawel.Micun@xxxxxxxxxxxxxxxx wrote:
Hello,
Should I be able to have 2 DAGMan jobs executing at the same time? I
assumed I should, as I couldn't find any info to the contrary.
If I'm wrong then ignore my rambling....
I have 2 separate dags, and I submit one after another with
condor_submit_dag. Everything starts up good, I see two separate
jobs with condor_dagman.exe executing, each is processing its dag and
submitting more jobs.
When the first DAGMan job finishes, it takes out schedd with it.
$CondorVersion: 6.6.9 Mar 10 2005 $
$CondorPlatform: INTEL-WINNT40 $
dagman.out of first dag
6/2 14:39:49 POST Script of Job F completed successfully.
6/2 14:39:49 Of 154 nodes total:
6/2 14:39:49 Done Pre Queued Post Ready Un-Ready Failed
6/2 14:39:49 === === === === === === ===
6/2 14:39:49 154 0 0 0 0 0 0
6/2 14:39:49 All jobs Completed!
6/2 14:39:49 **** condor_scheduniv_exec.2563.0 (condor_DAGMAN) EXITING
WITH STATUS 0
MasterLog
6/2 14:39:47 ProcAPI: pid # 3356 was not found
6/2 14:39:48 ProcAPI: pid # 3388 was not found
6/2 14:39:49 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
6/2 14:39:49 The SCHEDD (pid 5060) exited with status 4
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed
with err=87
6/2 14:39:49 Should never happen: ComparePidAge(5060) failed
....
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:39:49 ProcAPI: pid # 5268 was not found
6/2 14:39:49 ProcAPI: pid # 1716 was not found
6/2 14:39:49 ProcAPI: pid # 5060 was not found
6/2 14:44:58 Procfamily: ERROR: Could not open pid 4756 (err=87).
Maybe it exited already?
6/2 14:44:58 ProcAPI: pid # 4756 was not found
6/2 14:45:06 Procfamily: ERROR: Could not open pid 2324 (err=87).
Maybe it exited already?
6/2 14:45:06 ProcAPI: pid # 2324 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 2968 (err=87).
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 2968 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4788 (err=87).
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4788 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 5056 (err=87).
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 5056 was not found
6/2 14:45:10 Procfamily: ERROR: Could not open pid 4704 (err=87).
Maybe it exited already?
6/2 14:45:10 ProcAPI: pid # 4704 was not found
6/2 14:45:10 Sending obituary for "C:\Condor/bin/condor_schedd.exe"
6/2 14:45:10 Trying to email, but SMTP_SERVER not specified in config
file
6/2 14:45:10 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds
SchedLog
6/2 14:39:49 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
6/2 14:39:49 scheduler universe job (2563.0) pid 1716 exited with
status 0
6/2 14:39:49 Writing record to user
logfile=G:\models\PFP\tasks\dag\ALL.dag.dagman.log owner=user
6/2 14:39:49 init_user_ids: want user 'user@machine', current is
'(null)@(null)'
6/2 14:39:49 init_user_ids: Already have handle for user@machine, so
returning.
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 ENABLE_USERLOG_LOCKING is undefined, using default value
of True
6/2 14:39:49 TokenCache contents:
user@machine
6/2 14:39:49 Unknown user notification selection
6/2 14:39:49 Notify user with subject: Condor Job 2563.0
6/2 14:39:49 Trying to email, but SMTP_SERVER not specified in config
file
6/2 14:39:49 ERROR "Could not open mail to user!" at line 5543 in file
..\src\condor_schedd.V6\schedd.C
Eventually master restarts schedd, but this causes havoc with already
running jobs.
Any help appreciated,
Pawel
--
Peter Couvares University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
pfc@xxxxxxxxxxx 1210 W. Dayton St. Rm #4241
(608) 265-8936 Madison, WI 53706-1685