[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job & (remote) DAG





On 15 Jul 2025, at 23:23, Ben Jones via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hi,

I'm curious if this is not so much a consequence of a condor upgrade so much as of a machine reboot or new installation. Perhaps the AP does not have the kerberos ticket stashed, and thus falls back to FS auth.  If you submit a vanilla universe job locally on the AP, that should have the side effect of stashing the ticket there, and keeping it stashed.  Does that fix the problem?



We never submit jobs directly on the AP, only dagman does that. There are tickets on the AP, 100% certain (condor_submit will fail if the credmon doesnât create the .cc). And this happens on all schedds of this version and above. I could maybe downgrade one tomorrow to compare.  

Two APs, one downgraded, same config

[bejones@aiadm84 condor]$ condor_status -schedd -af Name CondorVersion
babybird01.cern.ch $CondorVersion: 23.0.16 2024-10-09 BuildID: 760880 PackageID: 23.0.16-1 $
babybird02.cern.ch $CondorVersion: 24.0.7 2025-04-22 BuildID: 803687 PackageID: 24.0.7-1 GitSHA: 51b71b5c $

DAG condor_ssh_to_job doesnât work on v24:

[bejones@aiadm84 condor]$ _condor_CREDD_HOST=babybird02.cern.ch _condor_SCHEDD_HOST=babybird02.cern.ch condor_q  -nobatch


-- Schedd: babybird02.cern.ch : <188.184.96.224:9618?... @ 07/16/25 12:33:02
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 188.0   bejones         7/16 12:28   0+00:03:55 R  0    3.0 long.sh
 190.0   bejones         7/16 12:30   0+00:02:37 R  0    0.7 condor_dagman -p 0 -f -l . -Lockfile test2.dag.lock -Dag test2.dag -MaxIdle 1000 -CsdVersion $CondorVersion:' '24.0.7' '2025-04-22' 'BuildID:' '803687' 'PackageID:' '24.0.7-1' 'GitSHA:' '51b71b5c' '$ -dagman /usr/bin
 191.0   bejones         7/16 12:30   0+00:02:15 R  0    0.0 long.sh

Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for bejones: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended

[bejones@aiadm84 condor]$ _condor_CREDD_HOST=babybird02.cern.ch _condor_SCHEDD_HOST=babybird02.cern.ch condor_ssh_to_job 188.0
Welcome to slot1_2@xxxxxxxxxxxxxxxxxxx!
Your condor job is running with pid(s) 3760851.
[bejones@b9jantest03 dir_3760847]$
logout
Connection to condor-job.b9jantest03.cern.ch closed.
[bejones@aiadm84 condor]$ _condor_CREDD_HOST=babybird02.cern.ch _condor_SCHEDD_HOST=babybird02.cern.ch condor_ssh_to_job 191.0
bejones is not authorized for access to the starter for job 191.0

DAG condor_ssh_to_job works on v23:

[bejones@aiadm84 condor]$ _condor_CREDD_HOST=babybird01.cern.ch _condor_SCHEDD_HOST=babybird01.cern.ch condor_q


-- Schedd: babybird01.cern.ch : <188.185.87.2:9618?... @ 07/16/25 12:34:37
OWNER   BATCH_NAME      SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
bejones ID: 932        7/16 12:26      _      1      _      1 932.0
bejones test.dag+933   7/16 12:26      _      1      _      2 934.0

Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for bejones: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended

[bejones@aiadm84 condor]$ _condor_CREDD_HOST=babybird01.cern.ch _condor_SCHEDD_HOST=babybird01.cern.ch condor_ssh_to_job 932.0
Welcome to slot1_1@xxxxxxxxxxxxxxxxxxx!
Your condor job is running with pid(s) 3760490.
[bejones@b9jantest03 dir_3760486]$
logout
Connection to condor-job.b9jantest03.cern.ch closed.
[bejones@aiadm84 condor]$ _condor_CREDD_HOST=babybird01.cern.ch _condor_SCHEDD_HOST=babybird01.cern.ch condor_ssh_to_job 934.0
Welcome to slot1_1@xxxxxxxxxxxxxxxxxxxx!
Your condor job is running with pid(s) 3181247.
[bejones@b9jantest662 dir_3181244]$
logout
Connection to condor-job.b9jantest662.cern.ch closed.