[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring a CE/Schedd



Hi Brian,

I presume it must be ARC as when I turned ARC off the messages didn't appear.

Just for testing purposes I attempted to open it up so that everything was available on SCHEDD.ALLOW_WRITE/READ

SCHEDD.ALLOW_READ = *@cern.ch/ce-iain.cern.ch, *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,*@cern.ch/ce-iain.cern.ch, *@cern.ch/ce501.cern.ch
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = GSI,KERBEROS,FS

Yes that is indeed the ce's own ip.

I took a look in /tmp and there are no FS* files listed. Is it condor that creates them? Could it be a permissions issue? (See [1] for an example)

I have CONDOR_IDS set to 0.0 as well.

I'm not aware of any /tmp filesytem trickery, it's a virtual node.[2]

In the [common] section of arc.conf it defines the following:

x509_user_key="/etc/grid-security/hostkey.pem"
x509_user_cert="/etc/grid-security/hostcert.pem"
x509_cert_dir="/etc/grid-security/certificates"
gridmap="/etc/grid-security/voms-grid-mapfile"

Those files all have the correct permissions on them.

Thanks Iain.

[1]
03/27/15 10:14:32 (pid:25852) Received a superuser command
03/27/15 10:14:32 (pid:25852) condor_write(): Socket closed when trying to write 13 bytes to <128.142.132.67:32931>, fd is 15, errno=104 Connection reset by peer
03/27/15 10:14:32 (pid:25852) Buf::write(): condor_write() failed
03/27/15 10:14:32 (pid:25852) AUTHENTICATE: handshake failed!
03/27/15 10:14:32 (pid:25852) DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX8epCBM)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.

[2]
~]# cat /proc/mounts 
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,seclabel,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=8154428k,nr_inodes=2038607,mode=755 0 0
devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,seclabel,relatime 0 0
/dev/mapper/VolGroup00-LogVol00 / ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0
none /selinux selinuxfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=8154428k,nr_inodes=2038607,mode=755 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/vda1 /boot ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
AFS /afs afs rw,relatime 0 0

________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
Sent: 26 March 2015 15:30
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Configuring a CE/Schedd

Hi Iain,

So, if this is in the schedd log, it's likely Arc trying to contact the Schedd?

That means we're looking at the ALLOW_READ statement and the SEC_*_AUTHENTICATION_METHODS (DEFAULT or READ).

However, since the authentication itself failed, it's probably not ALLOW_READ.

Picking apart the error message:

>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed:

I assume this is localhost, right?

>> AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|

This is a curious one.  If Arc and the schedd are on the same filesystem, they should be able to communicate via /tmp.  Are you using any "filesystem magic" that might make the schedd and arc have unique /tmp mounts?

>> AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|

This is probably expected.

>> AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.


Possibly Arc doesn't have X509_USER_PROXY set either?

Brian

> On Mar 26, 2015, at 5:14 AM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>
> Hi Brain,
>
> The messages are from /var/log/condor/SchedLog.
>
> I figured it might have something to do with local permissions to certificate files etc.
>
> It seems to be something unique to the scheduler daemon as I'm not experiencing this on the collector or the worker nodes. I've included a dump of all the schedd values in the condor config.(*)
>
> The machine is definitely configured with a host certificate as it uses it when communicating with other infrastructure and the ARC CE also uses it.
>
> The permissions of the certificate and key also match up with the other machines.
>
> Thanks, Iain
>
> (*)
> ~]# condor_config_val -expand -dump | grep SCHEDD
> ALLOW_NEGOTIATOR_SCHEDD = central-manager@xxxxxxx/*.cern.ch
> COLLECTOR.ALLOW_ADVERTISE_SCHEDD = computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch
> DAEMON_LIST = MASTER, SHARED_PORT, SCHEDD
> GRIDMANAGER_CONTACT_SCHEDD_DELAY = 5
> MAX_NUM_SCHEDD_LOG = 10
> MAX_SCHEDD_LOG = 104857600
> SCHEDD = /usr/sbin/condor_schedd
> SCHEDD.ALLOW_READ =  *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
> SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch
> SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = GSI,KERBEROS,FS
> SCHEDD_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address
> SCHEDD_BACKUP_SPOOL =
> SCHEDD_CRON_NAME =
> SCHEDD_DAEMON_AD_FILE = /var/lib/condor/spool/.schedd_classad
> SCHEDD_DEBUG = D_PID
> SCHEDD_INTERVAL =
> SCHEDD_JOB_QUEUE_LOG_FLUSH_DELAY = 5
> SCHEDD_LOG = /var/log/condor/SchedLog
> SCHEDD_MAX_FILE_DESCRIPTORS = 4096
> SCHEDD_MIN_INTERVAL = 5
> SCHEDD_NAME =
> SCHEDD_PREEMPTION_RANK =
> SCHEDD_PREEMPTION_REQUIREMENTS =
> SCHEDD_QUERY_WORKERS = 6
> SCHEDD_ROUND_ATTR_ProportionalSetSizeKb = 25%
> SCHEDD_ROUND_ATTR_ResidentSetSize = 25%
> SCHEDD_SEND_VACATE_VIA_TCP = false
> SCHEDD_SUPER_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address.super
> SCHEDDS = schedd@xxxxxxx/*.cern.ch
> SETTABLE_ATTRS_ADVERTISE_SCHEDD =
> STATISTICS_WINDOW_QUANTUM_SCHEDD =
>
> ________________________________________
> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
> Sent: 25 March 2015 18:37
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] Configuring a CE/Schedd
>
> Hi Iain,
>
>> From the message:
>
>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
>
> The important part of the message is:
>
> "the remote (client) side was not able to acquire its credentials”
>
> This indicates that the schedd isn’t using its certificate (or isn’t configured with one).
>
> Is this from a shadow log?  If so, you don't want to be using any of these methods - you should be using match auth for your setup.  Perhaps that's something which got lost in the merge?
>
> Brian
>
>> On Mar 24, 2015, at 1:48 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>>
>> Hi,
>>
>> I'm in the process of finalizing our CE/Schedd setup for our pool, we're using Puppet.
>>
>> I had the CE working and acting as a scheduler with a manual config and decided to move it to the HEP-Puppet/htcondor module.
>>
>> This is the output I get in SchedLog(*), I've removed the ip but it's the machine's own ip in all instances.
>>
>> After this it just proceeds to spam condor_write errors until it fills the log file and starts a new one.
>>
>> The ce is in the certificate mapfile along with all the other hosts and apart from the ordering of hostnames a vimdiff shows no difference between the security config file for this and the one that the central manager uses.
>>
>> Has anyone else experienced this issue?
>>
>> Thanks, Iain
>>
>> (*)
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
>> 03/24/15 19:12:33 -------- Begin starting jobs --------
>> 03/24/15 19:12:33 -------- Done starting jobs --------
>> 03/24/15 19:13:14 Received a superuser command
>> 03/24/15 19:13:14 This process has a valid certificate & key
>> 03/24/15 19:13:14 Failed to read end of message from <MACHINE_IP:34711>; 1280 untouched bytes.
>> 03/24/15 19:13:14 condor_write(): Socket closed when trying to write 13 bytes to <MACHINE_IP:34711>, fd is 15, errno=104 Connection reset by peer
>> 03/24/15 19:13:14 Buf::write(): condor_write() failed
>> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711> in non-blocking mode
>> 03/24/15 19:13:14 IO: EOF reading packet header
>> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711>
>> 03/24/15 19:13:14 IO: EOF reading packet header
>> 03/24/15 19:13:14 AUTHENTICATE: handshake failed!
>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/