[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor_ssh_to_job not working with Shared Port across WAN



On 10/12/2018 12:33 PM, Edgar M Fajardo Hernandez wrote:
> Hi Todd,
> 
> Thank for the answer. But I thought there was some rework con 
> condor_ssh_to_job. Because I still cannot do it over the WAN (root or 
> non root)

Hi,

Let me see if I understand - 

In your initial email on Oct 11, you could not get condor_ssh_to_job or condor_tail to work.

Then after I suggested you look at 
  
 https://lists.cs.wisc.edu/archive/htcondor-users/2018-February/msg00104.shtml

you now have condor_tail working, but condor_ssh_to_job is still not working for you.

Perhaps the problem is you trying to condor_ssh_to_job into a Singularity container?  Because condor_ssh_to_job 
will work with both vanilla jobs and Docker universe jobs (as of HTCondor v8.7.7), but support for condor_ssh_to_job into 
a Singularity container has still not been released.

regards,
Todd



> 
> [1032] alarmstr@uclhc-1 ~$ condor_ssh_to_job -debug 856033.29
> 10/12/18 10:32:56 SharedPortClient: sent connection request to schedd at 
> <192.5.19.13:9615> for shared port id 1256425_f007_4
> 10/12/18 10:32:56 SharedPortClient: sent connection request to local 
> schedd for shared port id 1256425_f007_4
> 10/12/18 10:32:56 Response for GET_JOB_CONNECT_INFO:
> StarterIpAddr = 
> "<169.228.131.243:39092?CCBID=169.228.130.106:9622%3faddrs%3d169.228.130.106-9622+[--1]-9622#1251&PrivNet=cabinet-0-0-11.t2.ucsd.edu 
> <http://cabinet-0-0-11.t2.ucsd.edu>&addrs=169.228.131.243-39092+[--1]-39092&noUDP>"
> RemoteHost = 
> "slot1_7@glidein_113404_107989154@cabinet-0-0-11.t2.ucsd.edu 
> <mailto:glidein_113404_107989154@xxxxxxxxxxxxxxxxxxxxxxxxxx>"
> Result = true
> ServerTime = 1539365576
> CondorVersion = "$CondorVersion: 8.6.12 Jul 31 2018 BuildID: 446077 $"
> 
> 10/12/18 10:32:56 Got connect info for starter 
> <169.228.131.243:39092?CCBID=169.228.130.106:9622%3faddrs%3d169.228.130.106-9622+[--1]-9622#1251&PrivNet=cabinet-0-0-11.t2.ucsd.edu 
> <http://cabinet-0-0-11.t2.ucsd.edu>&addrs=169.228.131.243-39092+[--1]-39092&noUDP>
> 10/12/18 10:32:56 No shared_port cookie available; will fall back to 
> using on-disk $(DAEMON_SOCKET_DIR)
> 10/12/18 10:32:56 No shared_port cookie available; will fall back to 
> using on-disk $(DAEMON_SOCKET_DIR)
> 10/12/18 10:32:56 Executing ssh command: ssh -oUser=cuser2 
> -oIdentityFile=/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/ssh_key 
> -oStrictHostKeyChecking=yes 
> -oUserKnownHostsFile=/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/known_hosts -oGlobalKnownHostsFile=/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/known_hosts 
> -oProxyCommand="condor_ssh_to_job"' '"-debug"' '"-proxy"' 
> '"/tmp/alarmstr.condor_ssh_to_job_8d60cdd6/fdpass" 
> condor-job.cabinet-0-0-11.t2.ucsd.edu 
> <http://condor-job.cabinet-0-0-11.t2.ucsd.edu>
> 10/12/18 10:32:56 Passed ssh connection to ssh proxy.
> 10/12/18 10:32:56 Setting up ssh proxy on file descriptor 4
> ssh_exchange_identification: Connection closed by remote host
> 10/12/18 10:32:57 Attempting to remove 
> /tmp/alarmstr.condor_ssh_to_job_8d60cdd6 as unknown user
> 
> Do I can do condor_tail
> 
> 
> Edgar M Fajardo Hernandez
> emfajardohernandez@xxxxxxxxxxxxxxxx 
> <mailto:emfajardohernandez@xxxxxxxxxxxxxxxx>
> 
> 
> 
>> On Oct 11, 2018, at 3:23 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx 
>> <mailto:tannenba@xxxxxxxxxxx>> wrote:
>>
>> On 10/11/2018 3:48 PM, Edgar M Fajardo Hernandez wrote:
>>>> It would seem to me the starter is not aware that the Submit host is
>>>> in the shared Port since it is trying to connect back to it on the
>>>> ephemeral ports rather than on the Shared Port port 9615
>>>>
>>>> Condor_Tail shows similar error:
>>>>
>>>> [1327] dantrim@uclhc-1 ~$ condor_tail -debug 856006.142
>>>> 10/10/18 13:27:22 Requesting GoAhead from the transfer queue manager.
>>>> 10/10/18 13:27:22 Received GoAhead from the transfer queue manager.
>>>> 10/10/18 13:27:22 CCBClient: received failure message from CCB server
>>>> collector 169.228.130.106:9647?addrs=169.228.130.106-9647+[--1]-9647
>>>> in response to request for reversed connection to starter at
>>>> <169.228.132.166:2574>: failed to connect
>>>> 10/10/18 13:27:22 Failed to reverse connect to starter at
>>>> <169.228.132.166:2574> via CCB.
>>>> Failed to peek at file from starter: Failed to connect to starter
>>>>
>>>> However it works when I run it as root:
>>>>
>> [snip]
>>>>
>>>> Any ideas here to try?
>>>>
>>>>
>>
>> Yep.
>>
>> My guess is you are encountering the same issue as back in Feb.
>>
>> Refer to
>>
>> https://lists.cs.wisc.edu/archive/htcondor-users/2018-February/msg00104.shtml
>>
>> for solutions.
>>
>> Best regards,
>> Todd
>>
> 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685