Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Condor_ssh_to_job not working with Shared Port across WAN
- Date: Mon, 15 Oct 2018 21:32:24 +0000
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Condor_ssh_to_job not working with Shared Port across WAN
On 10/12/2018 12:33 PM, Edgar M Fajardo Hernandez wrote:
> Hi Todd,
>
> Thank for the answer. But I thought there was some rework con
> condor_ssh_to_job. Because I still cannot do it over the WAN (root or
> non root)
Hi,
Let me see if I understand -
In your initial email on Oct 11, you could not get condor_ssh_to_job or condor_tail to work.
Then after I suggested you look at
https://lists.cs.wisc.edu/archive/htcondor-users/2018-February/msg00104.shtml
you now have condor_tail working, but condor_ssh_to_job is still not working for you.
Perhaps the problem is you trying to condor_ssh_to_job into a Singularity container? Because condor_ssh_to_job
will work with both vanilla jobs and Docker universe jobs (as of HTCondor v8.7.7), but support for condor_ssh_to_job into
a Singularity container has still not been released.
regards,
Todd
>
> [1032] alarmstr@uclhc-1 ~$ condor_ssh_to_job -debug 856033.29
> 10/12/18 10:32:56 SharedPortClient: sent connection request to schedd at
> <192.5.19.13:9615> for shared port id 1256425_f007_4
> 10/12/18 10:32:56 SharedPortClient: sent connection request to local
> schedd for shared port id 1256425_f007_4
> 10/12/18 10:32:56 Response for GET_JOB_CONNECT_INFO:
> StarterIpAddr =
> "<169.228.131.243:39092?CCBID=169.228.130.106:9622%3faddrs%3d169.228.130.106-9622+[--1]-9622#1251&PrivNet=cabinet-0-0-11.t2.ucsd.edu
> <http://cabinet-0-0-11.t2.ucsd.edu>&addrs=169.228.131.243-39092+[--1]-39092&noUDP>"
> RemoteHost =
> "slot1_7@glidein_113404_107989154@cabinet-0-0-11.t2.ucsd.edu
> <mailto:glidein_113404_107989154@xxxxxxxxxxxxxxxxxxxxxxxxxx>"
> Result = true
> ServerTime = 1539365576
> CondorVersion = "$CondorVersion: 8.6.12 Jul 31 2018 BuildID: 446077 $"
>
> 10/12/18 10:32:56 Got connect info for starter
> <169.228.131.243:39092?CCBID=169.228.130.106:9622%3faddrs%3d169.228.130.106-9622+[--1]-9622#1251&PrivNet=cabinet-0-0-11.t2.ucsd.edu
> <http://cabinet-0-0-11.t2.ucsd.edu>&addrs=169.228.131.243-39092+[--1]-39092&noUDP>
> 10/12/18 10:32:56 No shared_port cookie available; will fall back to
> using on-disk $(DAEMON_SOCKET_DIR)
> 10/12/18 10:32:56 No shared_port cookie available; will fall back to
> using on-disk $(DAEMON_SOCKET_DIR)
> 10/12/18 10:32:56 Executing ssh command: ssh -oUser=cuser2
> -oIdentityFile=/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/ssh_key
> -oStrictHostKeyChecking=yes
> -oUserKnownHostsFile=/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/known_hosts -oGlobalKnownHostsFile=/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/known_hosts
> -oProxyCommand="condor_ssh_to_job"' '"-debug"' '"-proxy"'
> '"/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/fdpass"
> condor-job.cabinet-0-0-11.t2.ucsd.edu
> <http://condor-job.cabinet-0-0-11.t2.ucsd.edu>
> 10/12/18 10:32:56 Passed ssh connection to ssh proxy.
> 10/12/18 10:32:56 Setting up ssh proxy on file descriptor 4
> ssh_exchange_identification: Connection closed by remote host
> 10/12/18 10:32:57 Attempting to remove
> /tmp/alarmstr.condor_ssh_to_job_8d60cdd6 as unknown user
>
> Do I can do condor_tail
>
>
> Edgar M Fajardo Hernandez
> emfajardohernandez@xxxxxxxxxxxxxxxx
> <mailto:emfajardohernandez@xxxxxxxxxxxxxxxx>
>
>
>
>> On Oct 11, 2018, at 3:23 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx
>> <mailto:tannenba@xxxxxxxxxxx>> wrote:
>>
>> On 10/11/2018 3:48 PM, Edgar M Fajardo Hernandez wrote:
>>>> It would seem to me the starter is not aware that the Submit host is
>>>> in the shared Port since it is trying to connect back to it on the
>>>> ephemeral ports rather than on the Shared Port port 9615
>>>>
>>>> Condor_Tail shows similar error:
>>>>
>>>> [1327] dantrim@uclhc-1 ~$ condor_tail -debug 856006.142
>>>> 10/10/18 13:27:22 Requesting GoAhead from the transfer queue manager.
>>>> 10/10/18 13:27:22 Received GoAhead from the transfer queue manager.
>>>> 10/10/18 13:27:22 CCBClient: received failure message from CCB server
>>>> collector 169.228.130.106:9647?addrs=169.228.130.106-9647+[--1]-9647
>>>> in response to request for reversed connection to starter at
>>>> <169.228.132.166:2574>: failed to connect
>>>> 10/10/18 13:27:22 Failed to reverse connect to starter at
>>>> <169.228.132.166:2574> via CCB.
>>>> Failed to peek at file from starter: Failed to connect to starter
>>>>
>>>> However it works when I run it as root:
>>>>
>> [snip]
>>>>
>>>> Any ideas here to try?
>>>>
>>>>
>>
>> Yep.
>>
>> My guess is you are encountering the same issue as back in Feb.
>>
>> Refer to
>>
>> https://lists.cs.wisc.edu/archive/htcondor-users/2018-February/msg00104.shtml
>>
>> for solutions.
>>
>> Best regards,
>> Todd
>>
>
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685