Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_off broken?
- Date: Wed, 27 Nov 2013 11:59:16 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor_off broken?
On 11/27/2013 9:29 AM, Zachary Miller wrote:
On Wed, Nov 27, 2013 at 01:05:40PM +0100, Pek Daniel wrote:
So now I have some more information:
condor_off command and friends won't work if the hostname is set to
condorworker02 on the machine. It has to be set to condorworker02.domain.tld.
The question: why is that?
In my *opinion*, this should work.
But clearly, it does not. I will need to investigate the code, but my general
feeling is that at some point, the tool (condor_off in this case) gets clever
and "promotes" the short host name to the long host name. Then, as you can see
in the collector, it doesn't match, and you get your "Daemon not found" error.
This is a known issue that should be improved in the code. For related
info/background see
1. https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3694
(in particular, remark by "tannenba", the second remark on the ticket),
2. https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=636
regards,
Todd
Cheers,
-zach
2013/11/26 Pek Daniel <pekdaniel@xxxxxxxxx>
OK, the problem a bit more detailed:
I'm using this version:
[root@lxbrb1815 ~]# condor_version
$CondorVersion: 8.1.2 Oct 19 2013 BuildID: 189797 $
$CondorPlatform: x86_64_RedHat5 $
Here's a snippet from condor_status -master output:
[root@condormaster1 ~]# condor_status -master
Name
condormaster1
condormaster2
condorworker02
lxbrb1815.domain.tld
...
I have physical nodes and VMs as startd nodes. Physical nodes have more
than one core, so more than one jobslots, while VMs have only one core.
Here's a snippet from condor_status -startd | head:
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
condorworker02 LINUX X86_64 Claimed Busy 0.000 490
0+00:03:13
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.060 1991
0+00:11:51
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:13
...
As you can see, condorworker02 is a VM, while lxbrb1815.domain.tld is a
physical node with a lot of cores. And that's the only difference. The
config file is exactly the same for both cases, and the condor version as
well.
Now, my questions:
- Why I see the slotID@xxxxxxxxxxxxxxxxxxxxxx in case of physical nodes and
just the hostname in case of VMs?
- Why can't I query the status of a VM but it's working in case of a
physical node:
[root@condormaster1 ~]# condor_status -startd lxbrb1815
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.060 1991
0+00:11:51
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:13
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:14
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:15
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:16
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:17
slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:18
slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:11
Total Owner Claimed Unclaimed Matched Preempting
Backfill
X86_64/LINUX 8 0 0 8 0 0
0
Total 8 0 0 8 0 0
0
[root@condormaster1 ~]# condor_status -startd lxbrb1815.domain.tld
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.060 1991
0+00:11:51
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:13
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:14
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:15
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:16
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:17
slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:18
slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1991
0+00:12:11
Total Owner Claimed Unclaimed Matched Preempting
Backfill
X86_64/LINUX 8 0 0 8 0 0
0
Total 8 0 0 8 0 0
0
[root@condormaster1 ~]# condor_status -startd condorworker02
[root@condormaster1 ~]# condor_status -startd condorworker02.domain.tld
[root@condormaster1 ~]#
- Why can't I send condor_off command to VMs but it's working fine in case
of physical nodes:
[root@condormaster1 ~]# condor_off -startd lxbrb1815
Sent "Kill-Daemon" command for "startd" to master lxbrb1815.domain.tld
[root@condormaster1 ~]# condor_off -startd condorworker02
Can't find address for master condorworker02.domain.tld
Perhaps you need to query another pool.
Thanks,
Daniel
2013/11/26 Zachary Miller <zmiller@xxxxxxxxxxx>
On Tue, Nov 26, 2013 at 11:37:48AM +0100, Pek Daniel wrote:
> I'm trying to "deactivate" some startd machines:
> [root@cm1 ~]# condor_status
> Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
>
> condorworker01 LINUX X86_64 Unclaimed Idle 0.000 2006
5+16:16:41
> condorworker03 LINUX X86_64 Unclaimed Idle 0.000 490
0+00:21:47
> slot1@lxbrl2305 LINUX X86_64 Unclaimed Idle 1.000 1991
4+18:20:46
> slot2@lxbrl2305 LINUX X86_64 Unclaimed Idle 1.000 1991
4+18:21:07
> slot3@lxbrl2305 LINUX X86_64 Unclaimed Idle 1.000 1991
4+18:21:08
> slot4@lxbrl2305 LINUX X86_64 Unclaimed Idle 1.000 1991
4+18:21:09
> slot5@lxbrl2305 LINUX X86_64 Unclaimed Idle 1.000 1991
4+18:21:10
> slot6@lxbrl2305 LINUX X86_64 Unclaimed Idle 0.960 1991 4
+18:21:11
> slot7@lxbrl2305 LINUX X86_64 Unclaimed Idle 0.000 1991
4+18:21:12
> slot8@lxbrl2305 LINUX X86_64 Unclaimed Idle 0.000 1991
4+18:21:05
> Total Owner Claimed Unclaimed Matched Preempting
Backfill
>
> X86_64/LINUX 10 0 0 10 0 0
0
>
> Total 10 0 0 10 0 0
0
>
> [root@condormaster1 ~]# condor_off -startd -graceful condorworker01
> Can't find address for master condorworker01
Hmmm. What does "condor_status -master" have to say?
Cheers,
-zach
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685