To add more material to this troubleshooting after running condor_status -l, it extracted the configuration for all the machines in my pool, after revising each computer node's configuration, I noticed one thing the only computer that it is able to run the jobs is the only on that includes the requirement "localCredd = centralmanager.domain.com:9620" in its config, which is one of the requirement listed below as part of the job's requirements. Then I reviewed all the configuration files of all the computer nodes and they all have the following setting:
##Specify a remote credd server here
# Credd_Host = $(CONDOR_HOST):$(CREDD_PORT), I commented this entry and substituted with
Credd_Host = centralmanager.domain.com:$(CREDD_PORT) To kind of force the registering of my centralmanager.
I will try tomorrow to comment this line on all the nodes and leave the first Credd_host line that I changed initially. Let's see what happen.
It's there another way to change this setting in the local configuration file?
Please advice?
Alex
From: condor-users-bounces@xxxxxxxxxxx on behalf of Alas, Alex [FEDI] Sent: Mon 12/8/2008 6:32 PM To: Condor-Users Mail List Subject: Re: [Condor-users] Problems matching jobs.
More to add on this troubleshooting: Intentionally I mistyped the submission file, this due to the inability of running condor_q –better in order to obtain all the requirements of my job. I got the message below. As you can see I never stipulate in my description file the requirement about the amount of memory. Where are these settings coming from? Any input will be much appreciated. Alex
Please see below: Submitting job(s) ERROR: Parse error in _expression_: Requirements = (((Arch == "INTEL" && OpSys == "WINNT51") || (Arch == " INTEL" && OpSys == "WINNT52"))) && (Disk >= DiskUsage) && ( (Memory * 1024) >= ImageSize )&& (HasFileTransfer) && (HasWindowsRunAsOwner && (LocalCredd =?= "centralmanager.domain.com:9620")) ^^^ Error in submit file
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Alas, Alex [FEDI]
Again, hello to all of you, In addition to my previous e-mail I ran the condor_q –analyze and the results are: 084.049: Run analysis summary. Of 20 machines, 19 are rejected by your job's requirements 0 reject your job because of their own requirements 1 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job When I run the condor_status I have the following results: C:\WINDOWS\system32>condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
Computer1.domain.com WINNT51 INTEL Unclaimed Idle 0.060 1022 0+00:45:03 Computer2.domain.com WINNT51 INTEL Unclaimed Idle 0.230 1022 0+00:00:49 slot1@xxxxxxxxxxxxxxxx WINNT51 INTEL Unclaimed Idle 0.000 1022 5+22:33:03 slot2@xxxxxxxxxxxxxxxx WINNT51 INTEL Unclaimed Idle 0.030 1022 0+02:30:05 slot1@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+20:21:17 slot2@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 0+00:20:05 slot3@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+20:21:19 slot4@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+20:21:20 slot1@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+21:24:31 slot2@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+21:28:45 slot3@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 0+02:30:06 slot4@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+21:33:45 slot1@xxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+20:26:28 slot2@xxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 0+00:25:05 slot3@xxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+20:26:30 slot4@xxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 2+20:26:31 slot1@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 0+03:35:41 slot2@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 0+03:35:42 slot3@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.050 511 0+03:35:43 slot4@xxxxxxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 511 0+00:25:07
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/WINNT51 4 0 0 4 0 0 0 INTEL/WINNT52 16 0 0 16 0 0 0
Total 20 0 0 20 0 0 0 Unfortunately, I am not a condor expert to fully understand what this error message is trying to tell me or what could be the best wayt to interpret it. Also when I tried to run condor_q –better I got the following message: Sorry, the -better-analyze option is not available on this platform. Due to the message, I know now there is something wrong on my job’s requirements that is preventing the job to match other nodes but I don’t know what? If anyone had experienced a similar issue and know more less how to get it to work, I really would appreciate your input, Alex
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Alas, Alex [FEDI]
Hello to all of you, I have a little issue with a type of job I am trying to submit. I have a condor pool of 20 nodes. I initially upgrade all the pool to version 7.05 but after reading all the issues that version was having with pre-empting jobs I decide to downgrade the central manager to version 7.01. The description file is the following way: ######################################################################################### # Description file for Batch File for TESTING purposes ######################################################################################### universe = vanilla requirements = (Arch == "INTEL" && OpSys == "WINNT51") || \ (Arch == "INTEL" && OpSys == "WINNT52") getenv = True notify_user=usename@xxxxxxxxxx initialdir = c:\condor\execute_bk should_transfer_files = YES when_to_transfer_output = ON_EXIT Transfer_input_files = c:\windows\system32\systeminfo.exe run_as_owner = true executable = Batch4testv2.bat output = Batch4testv3.out.$(Process) error = Batch4testv3.err.$(Process) log = Batch4testv3.log queue 10 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
If the job is submitted like that It will only run on one machine, if I omit the run as owner line, it will run fine on all the different nodes. Not a problem as I said after removing the line. But this condor project was originally implemented to run jobs over network shares. For that I configured the pool to have a credd_host (which is the central manager) and the I created a condoruser with some reading and limited right to run those jobs. I set the condor_pool and the condoruser credentials\passwords on all the different computers set as execute machines. When I run the condor_store_cred query –c and condor_store_cred query –u condoruser all the computers come back saying: A credential is stored and is valid. The description file is attached next. When I try to run this type of jobs it will only run on one computer, the same computer as the other jobs. If I remove the line RUN_AS_OWNER, the central manager will try to match the job with all the pool’s nodes but it will error out due to saying: Logon failure: unknown user name or bad password. Anyone has any ideas what log should I look into to find answers or any suggestions to solve this issue are more than welcome, Thanks in advance for your input, Alex
################################################### ## DESCRIPTION FILE FOR CONDOR JOBS ## PREPARED BY ALEX ALAS ###################################################
UNIVERSE = VANILLA REQUIREMENTS = (Arch == "INTEL" && OpSys == "WINNT51") || \ (Arch == "INTEL" && OpSys == "WINNT52") GETENV = TRUE NOTIFY_USER = username@xxxxxxxxxx INITIALDIR = c:\condor\execute_bk SHOULD_TRANSFER_FILES = YES WHEN_TO_TRANSFER_OUTPUT = ON_EXIT TRANSFER_INPUT_FILES = \\fileserver\Sharedfolder1\Sharedfolder2\Sharedfolder3\lasEnvelop.exe RUN_AS_OWNER = TRUE EXECUTABLE = \\fileserver\Sharedfolder1\Sharedfolder2\Sharedfolder3\Batchfile_lasEnvelop1.bat OUTPUT = Batchfile_lasEnvelop1.out.$(Process) ERROR = Batchfile_lasEnvelop1.err.$(Process) LOG = Batchfile_lasEnvelop1.log QUEUE 25
Respectfully, Alex Alas Systems Administrator Tel. 301-948-8550 x219 Fax 301-963-2064 E-mail: aalas@xxxxxxxxxxxxx 7320 Executive Way, Frederick, MD 21704 Website: http://www.fugroearthdata.com
|