Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problem with multiple schedds in 7.7+
- Date: Thu, 16 Aug 2012 07:33:58 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Problem with multiple schedds in 7.7+
Nice find!
This looks like a bug (and regression) IMHO.
src/condor_utils/param_info.in:
[JOB_QUEUE_LOG]
default=$(SPOOL)/job_queue.log
src/condor_schedd.V6/schedd_main.cpp:
// Initialize the job queue
char *job_queue_param_name = param("JOB_QUEUE_LOG");
if (job_queue_param_name == NULL) {
// the default place for the job_queue.log is in spool
job_queue_name.sprintf( "%s/job_queue.log", Spool);
} else {
job_queue_name = job_queue_param_name; // convert char * to MyString
free(job_queue_param_name);
}
Because of the default the Spool/job_queue.log code won't be hit.
$ env _CONDOR_MATT.SPOOL=/tmp strace -e open condor_schedd -t -f
-local-name matt 2>&1 | grep -e spool -e tmp
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY)
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC,
0644) = 7
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY)
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC,
0644) = 7
08/16/12 07:27:27 (pid:14896) initLocalStarterDir:
/home/matt/Documents/CondorInstallation/spool/local_univ_execute already
exists, deleting old contents
open("/tmp/spool_version", O_RDONLY) = 11
open("/home/matt/Documents/CondorInstallation/spool/job_queue.log",
O_RDWR) = 11
A workaround is to set JOB_QUEUE_LOG=
$ env _CONDOR_MATT.SPOOL=/tmp _CONDOR_JOB_QUEUE_LOG= strace -e open
condor_schedd -t -f -local-name matt 2>&1 | grep -e spool -e tmp
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY)
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC,
0644) = 7
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY)
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC,
0644) = 7
08/16/12 07:28:31 (pid:14908) initLocalStarterDir:
/home/matt/Documents/CondorInstallation/spool/local_univ_execute already
exists, deleting old contents
open("/tmp/spool_version", O_RDONLY) = 11
open("/tmp/job_queue.log", O_RDWR) = -1 ENOENT (No such file or
directory)
open("/tmp/job_queue.log", O_RDWR|O_CREAT|O_EXCL, 0600) = 11
Note, SCHEDD_ADDRESS_FILE also has a default (defined in condor_config)
of $(SPOOL)/.schedd_address
Best,
matt
On 08/13/2012 11:15 AM, John Weigand wrote:
Matt,
You were correct in the problem being the job_queue.log.
A JOB_QUEUE_LOG attribute was introduced in Condor 7.7.5
.. ticket 2598 https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2598
http://research.cs.wisc.edu/condor/manual/v7.7/3_3Configuration.html#16343
Prior to the introduction of this feature a job_queue.log was always
maintained in the spool directory of each schedd. With this change, it
appears (either a bug or by desire), the job queue log of each additional
schedd must be defined explicitly.
SCHEDD.SCHEDDJOBS2.JOB_QUEUE_LOG =
$(SCHEDD.SCHEDDJOBS2.SPOOL)/job_queue.log
If not explicitely stated, only 1 job_queue.log is used. Hence, all
jobs are
assigned to all schedd queues on a restart.
John Weigand
On 6/4/2012 7:57 PM, Matthew Farrellee wrote:
On 05/21/2012 09:37 AM, John Weigand wrote:
There appears to be a change in behavior in Condor when multiple schedds
are defined. I have tested this with 7.7.5 and 7.8. It does not occur
in 7.6.6 and prior.
Test condition:
1. 3 schedds are defined
2. I submit 1 job.
3. condor_q -g shows 1 schedd queue with the job
4. I restart condor
5. condor_q -g shows the same job in all 3 schedd queues and treats
them as independent jobs.
I use the same configuration for all 3 versions of Condor for the
secondary schedds:
SCHEDDJOBS2 = $(SCHEDD)
SCHEDDJOBS2_ARGS = -local-name scheddjobs2
SCHEDD.SCHEDDJOBS2.SCHEDD_NAME = schedd_jobs2
SCHEDD.SCHEDDJOBS2.SCHEDD_LOG =
$(LOG)/SchedLog.$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME)
SCHEDD.SCHEDDJOBS2.LOCAL_DIR =
$(LOCAL_DIR)/$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME)
SCHEDD.SCHEDDJOBS2.EXECUTE = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/execute
SCHEDD.SCHEDDJOBS2.LOCK = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/lock
SCHEDD.SCHEDDJOBS2.PROCD_ADDRESS =
$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/procd_pipe
SCHEDD.SCHEDDJOBS2.SPOOL = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/spool
SCHEDD.SCHEDDJOBS2.SCHEDD_ADDRESS_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_address
SCHEDD.SCHEDDJOBS2.SCHEDD_DAEMON_AD_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_classad
SCHEDDJOBS2_LOCAL_DIR_STRING = "$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)"
SCHEDD.SCHEDDJOBS2.SCHEDD_EXPRS = LOCAL_DIR_STRING
DAEMON_LIST = $(DAEMON_LIST), SCHEDDJOBS2
:
(same for schedd3)
:
DC_DAEMON_LIST = + SCHEDDJOBS2 SCHEDDJOBS3
This works in 7.6.6 and prior, just not in 7.7.5 and 7.8.
Any ideas?
John Weigand
First thought it somehow all the Schedds are using the same spool.
When you
restart them they should log something like "About to rotate ClassAd log
/var/lib/condor/spool/job_queue.log". Make sure they're all processing a
different job_queue.log.
Do you happen to have a wallaby dump of your configuration to share?
Best,
matt