I can successfully add the machine using bosco_cluser -a mpotts@poi pbsÂÂÂÂ (note, I used a fully qualified address, I just removed the trailing section for this post)
However, when I run bosco_cluster -t mpotts@poi, the test fails with the following--
Testing ssh to mpotts@xxxxxxxxxxxx! Testing bosco submission...Passed!Submission and log files for this job are in /data/users/mpotts/condor-scratch/bosco-test/boscotest.2IHjF
Waiting for jobmanager to accept job...PassedChecking for submission to remote pbs cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:07/18/19 15:31:29 [38206] Gahp Server (pid=38233) exited with status 127 unexpectedly
07/18/19 15:31:31 [38206] gahp server not up yet, delaying ping 07/18/19 15:31:31 [38206] No jobs left, shutting down 07/18/19 15:31:31 [38206] Got SIGTERM. Performing graceful shutdown.07/18/19 15:31:31 [38206] **** condor_gridmanager (condor_GRIDMANAGER) pid 38206 EXITING WITH STATUS 0
The gridmanager log on the submitting machine shows this-- 07/18/19 15:31:23 ****************************************************** 07/18/19 15:31:23 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP 07/18/19 15:31:23 ** /s4data/users/mpotts/condor/sbin/condor_gridmanager07/18/19 15:31:23 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1) 07/18/19 15:31:23 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
07/18/19 15:31:23 ** $CondorVersion: 8.8.4 Jul 09 2019 BuildID: 474941 $ 07/18/19 15:31:23 ** $CondorPlatform: x86_64_RedHat6 $ 07/18/19 15:31:23 ** PID = 38206 07/18/19 15:31:23 ** Log last touched 7/18 15:26:32 07/18/19 15:31:23 ******************************************************07/18/19 15:31:23 Using config source: /s4data/users/mpotts/condor/etc/condor_config
07/18/19 15:31:23 Using local config sources:07/18/19 15:31:23 /data/users/mpotts/condor-scratch/config/condor_config.bosco_routing 07/18/19 15:31:23 /data/users/mpotts/condor-scratch/config/condor_config.factory
07/18/19 15:31:23 /data/users/mpotts/condor-scratch/condor_config.local07/18/19 15:31:23 config Macros = 91, Sorted = 91, StringBytes = 3166, TablesBytes = 3340
07/18/19 15:31:23 CLASSAD_CACHING is ENABLED 07/18/19 15:31:23 Daemon Log is logging: D_ALWAYS D_ERROR07/18/19 15:31:23 SharedPortEndpoint: waiting for connections to named socket 12713_355e_10 07/18/19 15:31:23 DaemonCore: command socket at <127.0.0.1:11000?addrs=127.0.0.1-11000&noUDP&sock=12713_355e_10> 07/18/19 15:31:23 DaemonCore: private command socket at <127.0.0.1:11000?addrs=127.0.0.1-11000&noUDP&sock=12713_355e_10>
07/18/19 15:31:26 [38206] Found job 20.0 --- inserting 07/18/19 15:31:26 [38206] gahp server not up yet, delaying ping07/18/19 15:31:26 [38206] (20.0) doEvaluateState called: gmState GM_INIT, remoteState 0
07/18/19 15:31:26 [38206] GAHP server pid = 38233 07/18/19 15:31:29 [38206] Failed to read GAHP server version 07/18/19 15:31:29 [38206] (20.0) Error starting GAHP07/18/19 15:31:29 [38206] Gahp Server (pid=38233) exited with status 127 unexpectedly
07/18/19 15:31:31 [38206] gahp server not up yet, delaying ping 07/18/19 15:31:31 [38206] No jobs left, shutting down 07/18/19 15:31:31 [38206] Got SIGTERM. Performing graceful shutdown.07/18/19 15:31:31 [38206] **** condor_gridmanager (condor_GRIDMANAGER) pid 38206 EXITING WITH STATUS 0
Does anyone have any suggestions on how to get this to connect? I was able to successfully connect to a slurm-based resource using the same approach, so I am not sure what is going on or how to debug.
Thanks! -Mark -- Mark A. Potts, Ph.D. Sr. HPC Software Developer RedLine Performance Solutions, LLC Phone 202-744-9469 Mark.Potts@xxxxxxxx mpotts@xxxxxxxxxxxxxxx
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature