|
This is a bug introduced in the 24.X series, which Iâve now identified. Iâm working on a fix.
- Jaime
On May 20, 2026, at 10:01âAM, Nicholas Peregonow via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
FNAL
is also seeing this problem after recently upgrading one of our CE's from condor 24.0.9 to condor 25.0.10 and htcondor-ce-25.0.1. I haven't seen this behavior at all on the 24 lts series.
Nick
________________________________________
From:
HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
on behalf of Antonio Delgado Peris via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent:
Wednesday, May 20, 2026 2:13 AM
To:
Cole Bollig; HTCondor-Users Mail List
Cc:
Antonio Delgado Peris
Subject:
Re: [HTCondor-users] OsUser job-router crashes
[EXTERNAL]
â This message is from an external sender
Dear
Dan, Cole,
Did
you learn anything more about the issue in this thread?
At
CERN, we have observed the same problem as soon as we have upgraded some CEs to condor-24.12.20 and htcondor-ce-24.2.0. Most jobs run and complete fine, but a few cause the Job Router to crash with Failed to find OsUser or User in job ad and the batch job
to be removed with RemoveReason = "JobRouter aborted job (by user condor)". The classad shown just before the error message indeed shows no User attribute, but the classads of the CE and batch job retrieved by history do contain that attribute, just like
other jobs of the same users (where User is equal to Owner@xxxxxxx). We have seen this happening mostly to CMS jobs but also to other users (VOs), and all of these users do run many other jobs without issues.
Based
on previous comments on this thread, we have tried to update the CE to condor-25.0.10 and htcondor-ce-25.0.1, but it hasn't helped. The same error (randomly) occurs. Maybe it's 25.10 (or other) which might help?
Do
you have any other insight about this problem?
Regards,
Antonio
________________________________
From:
HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>
Sent:
Thursday, March 26, 2026 12:51
To:
Cole Bollig <cabollig@xxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject:
Re: [HTCondor-users] OsUser job-router crashes
Hi
Cole,
We
get job submissions from a few schedd servers, but we believe that we have tracked the problem down to one running version 24.9.2. We also believe that the problem has reduced a bit, presumably because one or more large sites have updated to 25. We do see
jobs from other schedd servers running 24.9.2 that donât appear to cause problems, so we suspect that another condition may be in play. Weâll keep investigating on our side.
Thanks
for your help.
Dan
From:
Cole Bollig <cabollig@xxxxxxxx>
Date:
Wednesday, 25 March 2026 at 18:20
To:
Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject:
Re: OsUser job-router crashes
CAUTION:
This message came from outside Imperial. Do not click links or open attachments unless you recognise the sender and were expecting this email.
Hi
Dan,
Another
question: What is the version of the HTCondor Schedd that is sending jobs to your CE? Is this also v25 or something older like v24?
-Cole
Bollig
________________________________
From:
Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>
Sent:
Wednesday, March 25, 2026 6:11 AM
To:
HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc:
Cole Bollig <cabollig@xxxxxxxx>
Subject:
Re: OsUser job-router crashes
Hi
Cole,
Thanks
very much for your assistance with this.
The
additional diagnostics are as follows:
Arguments
= ""
BatchQueue
= ""
BatchRuntime
= 259200
BytesRecvd
= 28044.0
BytesSent
= 73780.0
CERequirements
= "CondorCE"
ClusterId
= 1639290
Cmd
= "DIRAC_9xq1tffd_pilotwrapper.py"
CommittedSlotTime
= 0
CommittedSuspensionTime
= 0
CommittedTime
= 0
CompletionDate
= 0
CondorCE
= 1
CpusProvisioned
= 8
CpusUsage
= 0.1141048636842512
CumulativeRemoteSysCpu
= 0.0
CumulativeRemoteUserCpu
= 0.0
CumulativeSlotTime
= 0
CumulativeSuspensionTime
= 0
CurrentHosts
= 0
DiskProvisioned
= 7671
DiskUsage
= 2250000
DiskUsage_RAW
= 2250000
EnteredCurrentStatus
= 1774431475
Environment
= "CONDORCE_COLLECTOR_HOST=<...>:9619 DIRAC_PILOT_STAMP=<...> HTCONDOR_JOBID=1328108.2 LANG=en_GB.UTF-8"
Err
= "1328108.2.err"
ExecutableSize
= 30
ExecutableSize_RAW
= 28
ExitBySignal
= false
ExitCode
= 1
ExitStatus
= 0
HoldReason
= "The job attribute OnExitHold _expression_ 'ExitCode =!= 0' evaluated to TRUE"
HoldReasonCode
= 3
HoldReasonSubCode
= 55
ImageSize
= 500000
ImageSize_RAW
= 500000
In
= "/dev/null"
Iwd
= "/var/lib/condor-ce/spool/8108/2/cluster1328108.proc2.subproc0"
JobCurrentStartDate
= 1774344542
JobCurrentStartExecutingDate
= 1774344542
JobIsRunning
= (JobStatus =!= 1) && (JobStatus =!= 5)
JobLeaseDuration
= 2400
JobMemory
= RequestMemory
JobNotification
= 0
JobPrio
= 0
JobRunCount
= 1
JobStartDate
= 1774344542
JobStatus
= 1
JobSubmitFile
= "/opt/dirac/data/HTCondor/work/HTCondorCE_uzpqa_kg.sub"
JobSubmitMethod
= 0
JobUniverse
= 5
KillSig
= "SIGTERM"
LastHoldReason
= "Spooling input data filesâ
LastHoldReasonCode
= 16
LastJobStatus
= 5
LastReleaseReason
= "Data files spooled"
LastSuspensionTime
= 0
LeaveJobInQueue
= JobStatus == 4
Managed
= "Schedd"
ManagedManager
= ""
MaxHosts
= 1
MemoryProvisioned
= 2048
MemoryUsage
= ((ResidentSetSize + 1023) / 1024)
MinHosts
= 1
MyType
= "Job"
NumCkpts
= 0
NumCkpts_RAW
= 0
NumHolds
= 1
NumHoldsByReason
= [ JobPolicy = 1 ]
NumJobCompletions
= 0
NumJobMatches
= 1
NumJobStarts
= 1
NumRestarts
= 0
NumShadowStarts
= 1
NumSystemHolds
= 0
=!= 0
>
Out
= "1328108.2.out"
Owner
= "<...>"
ProcId
= 0
QDate
= 1774431475
Rank
= 0.0
ReleaseReason
= "Data files spooled"
RemoteSysCpu
= 0.0
RemoteUserCpu
= 0.0
Remote_JobUniverse
= 5
RequestCpus
= 8
RequestDisk
= DiskUsage
RequestMemory
= 2000
Requirements
= (NumJobStarts == 0) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)
ResidentSetSize
= 47500
ResidentSetSize_RAW
= 47500
RouteName
= "Condor_Pool"
RoutedBy
= "htcondor-ce"
RoutedFromJobId
= "1328108.2"
RoutedJob
= true
SUBMIT_Cmd
= "/opt/dirac/data/HTCondor/work/DIRAC_9xq1tffd_pilotwrapper.py"
SUBMIT_UserLog
= "/opt/dirac/data/HTCondor/work/<...>/3/17/1328108.2.log"
SUBMIT_x509userproxy
= "/tmp/tmpg2sfoji_"
ScitokensFile
= "/opt/dirac/data/HTCondor/work/HTCondorCE_yxnffj4i.token"
ScratchDirFileCount
= 55899
ShadowBday
= 1774344542
ShouldTransferFiles
= "YES"
SpooledOutputFiles
= ""
StreamErr
= false
StreamOut
= false
TargetType
= "Machine"
TotalSuspensions
= 0
TransferIn
= false
TransferInputSizeMB
= 0
TransferOutput
= ""
TransferOutputRemaps
= undefined
WhenToTransferOutput
= "ON_EXIT_OR_EVICT"
orig_AuthTokenId
= "<...>"
orig_AuthTokenIssuer
= "https://urldefense.com/v3/__https://dteam-auth.cern.ch/__;!!Mak6IKo!IUycUPLi2fIlfyQIDlHq43hYRbyfoIsWezUgpg1EB-D6xreO-mibbuSOrG0Hej4jkmVBn6bDLJcIdeTEohoNDnjELx0qRQ$<https://urldefense.com/v3/__https://dteam-auth.cern.ch/__;!!Mak6IKo!LMnqUn-L_CZ4aiEhSSB6TMwhHEZLog6wq4PrnMIYB5HLfgNScR6EL8TptSoIVqqjvHeB9vhUtAcJtx143BsZK_WHxmat$>"
orig_AuthTokenScopes
= "compute.create,compute.read,compute.cancel,compute.modify"
orig_AuthTokenSubject
= "<...>"
orig_OnExitHold
= ExitCode =!= 0
orig_OnExitHoldSubCode
= 55
orig_environment
= "DIRAC_PILOT_STAMP=<...> HTCONDOR_JOBID=1328108.2"
osg_environment
= ""
remote_NodeNumber
= 1
remote_SMPGranularity
= 1
x509UserProxyEmail
= "<...>"
x509UserProxyExpiration
= 1774392262
x509UserProxyFQAN
= "<...>"
x509UserProxyFirstFQAN
= "/dteam/Role=NULL/Capability=NULL"
x509UserProxyVOName
= "dteam"
x509userproxy
= "tmpg2sfoji_"
x509userproxysubject
= "<...>"
03/25/26
09:38:20 Failed to find OsUser or User in job ad.
03/25/26
09:38:20 ERROR "Failed to initialize user ids." at line 64 in file /var/lib/condor/execute/slot1/dir_3041422/scratch/userdir/build-b0FsGA/BUILD/condor-25.0.8/src/condor_utils/set_user_priv_from_ad.cpp
From:
HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date:
Tuesday, 24 March 2026 at 21:09
To:
HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc:
Cole Bollig <cabollig@xxxxxxxx>
Subject:
Re: [HTCondor-users] OsUser job-router crashes
CAUTION:
This message came from outside Imperial. Do not click links or open attachments unless you recognise the sender and were expecting this email.
Hi
Dan,
This
issue seems different (but possibly related) to the previous threads issue. Looking at the code, this error should print out a ClassAd (key = value pairs) before the provided debug lines. Could you share those lines. Feel free to clean them up as needed (remove
usernames, hostnames, etc) or send it directly to me.
-Cole
Bollig
________________________________
From:
HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>
Sent:
Tuesday, March 24, 2026 11:43 AM
To:
htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject:
[HTCondor-users] OsUser job-router crashes
Hi,
I
know there have been some other threads about this, but we are still seeing regular and frequent job-router crashes with the following:
03/22/26
03:14:12 Failed to find OsUser or User in job ad.
03/22/26
03:14:12 ERROR "Failed to initialize user ids." at line 64 in file /var/lib/condor/execute/slot1/dir_3041422/scratch/userdir/build-b0FsGA/BUILD/condor-25.0.8/src/condor_utils/set_user_priv_from_ad.cpp
Looking
back through the list, I note that at least one other user appears to have experienced a similar issue but that it was apparently resolved when they updated to version 25. Our entire estate (worker nodes and CEâs) is now running 25 but we havenât seen any
reduction in these errors.
Any
assistance would be greatly appreciated.
Thanks,
--
Dan
Whitehouse
Imperial
College HEP
_______________________________________________
HTCondor-users
mailing list
To
unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject:
Unsubscribe
The
archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
|