[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel Jobs



On Fri, 23 May 2025 13:30:49 +0000
Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

> Hi Enrico,
> 
> What does condor_q <Job ID> -af Scheduler say? This value is the one
> that the dedicated Schedd will when searching for EP's assigned with
> the DedicatedScheduler attribute.
> 

Thanks you. I solved. The problem was that the scheduler has two network interfaces. One connected to a condor.net private network and the other connected to the public network.

The CE machine has a public name registered on the DNS and when tried to submit the job as DedicatedScheduler it used the public name. By setting the variable NETWORK_HOSTNAME = ce.condor.net, everything started working.

> -Cole Bollig
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf
> of Enrico Morelli <morelli@xxxxxxxxxxxxx> Sent: Friday, May 23, 2025
> 5:40 AM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Parallel Jobs
> 
> Dear all,
> 
> I'm trying to configure my Linux HTCondor 10.x cluster to be able to
> run parallel jobs. My scheduler server for every job (sequential or
> parallel) is ce.condor.net. Following the guide I wrote the following
> configuration in three nodes:
> 
> EXECUTE = /home/condor/execute/
> Filesystem_Domain = condor.net
> Uid_Domain = condor.net
> TRUST_UID_DOMAIN = True
> ## Make a single partitionable slot
> SLOT_TYPE_1 = cpus=100%,ram=100%
> SLOT_TYPE_1_PARTITIONABLE = True
> NUM_SLOTS_TYPE_1 = 1
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxx"
> STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
> START = True
> SUSPEND   = False
> CONTINUE  = True
> PREEMPT   = False
> KILL      = False
> WANT_SUSPEND   = False
> WANT_VACATE    = False
> RANK      = Scheduler =?= $(DedicatedScheduler)
> 
> If from the schedule I run the command:
> 
> # condor_status -const '!isUndefined(DedicatedScheduler)' -format
> "%s\t" Machine -format "%s\n" DedicatedScheduler
> 
> I've the following output:
> 
> wn1.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn1.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn1.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn1.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn1.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn2.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn2.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn2.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn2.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn2.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn3.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn3.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn3.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> wn3.condor.net  DedicatedScheduler@xxxxxxxxxxxxx
> 
> 
> But submitting the following test job:
> 
> universe = parallel
> executable = /bin/sleep
> arguments = 30
> machine_count = 2
> log = log
> should_transfer_files = IF_NEEDED
> when_to_transfer_output = ON_EXIT
> request_cpus   = 1
> request_memory = 1024M
> request_disk   = 10240K
> 
> queue
> 
> 
> In the SchedLog I've:
> 
> (D_ALWAYS:2) Trying to find 2 resource(s) for dedicated job 21084581.0
> (D_ALWAYS) Skipping job 21084581.0 because it requests more nodes (2)
> than exist in the pool (0)
> 
> Where I wrong?
> 
> Thanks
> --
> -----------------------------------------------------------
>   Enrico Morelli
>   System Administrator | Programmer | Web Developer
> 
>   CERM - Polo Scientifico
>   via Sacconi, 6 - 50019 Sesto Fiorentino (FI) - ITALY
> ------------------------------------------------------------
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a subject: Unsubscribe
> 
> Join us in June at Throughput Computing 25:
> https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!I1QogcW7xe2VYVVyFS96nz1EbVWcu3oyjT-g3RgZdJ8dySrznMbrgxy38_73rus8wsduXuMpzurStKrBzYAkRw$
> 
> The archives can be found at:
> https://www-auth.cs.wisc.edu/lists/htcondor-users/



-- 
-----------------------------------------------------------
  Enrico Morelli
  System Administrator | Programmer | Web Developer

  CERM - Polo Scientifico
  via Sacconi, 6 - 50019 Sesto Fiorentino (FI) - ITALY
------------------------------------------------------------