Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] multiple dags and schedd dieing

Date: Thu, 2 Jun 2005 14:58:20 -0500
From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
Subject: Re: [Condor-users] multiple dags and schedd dieing

Pawel,

There is no problem running multiple concurrent DAGMan jobs, provided they each have a unique DAG input file.

Your problem below is two-fold:

1) For some reason you don't have SMTP_SERVER defined in your config file. If you set this correctly, your problem should go away.

2) While the condor_master is cooly reporting this config problem and moving on, the condor_schedd is over-reacting and exiting, which is dumb. I'm considering this a bug and will fix it for the next stable & development releases.

Thanks for the report!

-Peter


On Jun 2, 2005, at 2:33 PM, Pawel.Micun@xxxxxxxxxxxxxxxx wrote:

Hello,
Should I be able to have 2 DAGMan jobs executing at the same time? I assumed I should, as I couldn't find any info to the contrary. If I'm wrong then ignore my rambling....

I have 2 separate dags, and I submit one after another with condor_submit_dag. Everything starts up good, I see two separate jobs with condor_dagman.exe executing, each is processing its dag and submitting more jobs. When the first DAGMan job finishes, it takes out schedd with it.
$CondorVersion: 6.6.9 Mar 10 2005 $
$CondorPlatform: INTEL-WINNT40 $
dagman.out of first dag 6/2 14:39:49 POST Script of Job F completed successfully. 6/2 14:39:49 Of 154 nodes total: 6/2 14:39:49 Done Pre Queued Post Ready Un-Ready Failed 6/2 14:39:49 === === === === === === === 6/2 14:39:49 154 0 0 0 0 0 0 6/2 14:39:49 All jobs Completed! 6/2 14:39:49 **** condor_scheduniv_exec.2563.0 (condor_DAGMAN) EXITING WITH STATUS 0

MasterLog 6/2 14:39:47 ProcAPI: pid # 3356 was not found 6/2 14:39:48 ProcAPI: pid # 3388 was not found 6/2 14:39:49 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 6/2 14:39:49 The SCHEDD (pid 5060) exited with status 4 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed 6/2 14:39:49 CSysinfo::GetProcessBirthday() - OpenProcess() failed with err=87 6/2 14:39:49 Should never happen: ComparePidAge(5060) failed .... 6/2 14:39:49 ProcAPI: pid # 5060 was not found 6/2 14:39:49 ProcAPI: pid # 5268 was not found 6/2 14:39:49 ProcAPI: pid # 1716 was not found 6/2 14:39:49 ProcAPI: pid # 5060 was not found 6/2 14:44:58 Procfamily: ERROR: Could not open pid 4756 (err=87). Maybe it exited already? 6/2 14:44:58 ProcAPI: pid # 4756 was not found 6/2 14:45:06 Procfamily: ERROR: Could not open pid 2324 (err=87). Maybe it exited already? 6/2 14:45:06 ProcAPI: pid # 2324 was not found 6/2 14:45:10 Procfamily: ERROR: Could not open pid 2968 (err=87). Maybe it exited already? 6/2 14:45:10 ProcAPI: pid # 2968 was not found 6/2 14:45:10 Procfamily: ERROR: Could not open pid 4788 (err=87). Maybe it exited already? 6/2 14:45:10 ProcAPI: pid # 4788 was not found 6/2 14:45:10 Procfamily: ERROR: Could not open pid 5056 (err=87). Maybe it exited already? 6/2 14:45:10 ProcAPI: pid # 5056 was not found 6/2 14:45:10 Procfamily: ERROR: Could not open pid 4704 (err=87). Maybe it exited already? 6/2 14:45:10 ProcAPI: pid # 4704 was not found 6/2 14:45:10 Sending obituary for "C:\Condor/bin/condor_schedd.exe" 6/2 14:45:10 Trying to email, but SMTP_SERVER not specified in config file 6/2 14:45:10 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds

SchedLog 6/2 14:39:49 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0 6/2 14:39:49 scheduler universe job (2563.0) pid 1716 exited with status 0 6/2 14:39:49 Writing record to user logfile=G:\models\PFP\tasks\dag\ALL.dag.dagman.log owner=user 6/2 14:39:49 init_user_ids: want user 'user@machine', current is '(null)@(null)' 6/2 14:39:49 init_user_ids: Already have handle for user@machine, so returning. 6/2 14:39:49 TokenCache contents: user@machine 6/2 14:39:49 ENABLE_USERLOG_LOCKING is undefined, using default value of True 6/2 14:39:49 TokenCache contents: user@machine 6/2 14:39:49 Unknown user notification selection 6/2 14:39:49 Notify user with subject: Condor Job 2563.0 6/2 14:39:49 Trying to email, but SMTP_SERVER not specified in config file 6/2 14:39:49 ERROR "Could not open mail to user!" at line 5543 in file ..\src\condor_schedd.V6\schedd.C

Eventually master restarts schedd, but this causes havoc with already running jobs.
Any help appreciated,
Pawel

--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685

References:
- [Condor-users] multiple dags and schedd dieing
  - From: Pawel . Micun

Prev by Date: Re: [Condor-users] can't get Condor-G job to run
Next by Date: Re: [Condor-users] state file location in CondorG
Previous by thread: [Condor-users] multiple dags and schedd dieing
Next by thread: Re: [Condor-users] multiple dags and schedd dieing
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] multiple dags and schedd dieing