[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] OsUser job-router crashes



Hi Cole,

Thanks for looking into this.

Currently, we don't have any running job with undefined User in any of our CEs. And, to the best of my knowledge, we don't have any job routes or hooks that try to modify User or Owner.

Cheers,
    Antonio

From: Cole Bollig <cabollig@xxxxxxxx>
Sent: Wednesday, May 20, 2026 16:57
To: Antonio Delgado Peris <Antonio.Delgado.Peris@xxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: OsUser job-router crashes
 
Hi Antonio, Dan,

I did some code diving, and it seems like the code that is likely blowing up was removed in v25.5.1 of HTCondor (job router code). See HTCONDOR-3364. Granted I have not confirmed this via testing. Still, I am currently unsure why the job router has ClassAds with no User attribute, which is likely the true issue. Some things to look into that may find the true culprit are the following:


-Cole Bollig

From: Antonio Delgado Peris <Antonio.Delgado.Peris@xxxxxxx>
Sent: Wednesday, May 20, 2026 2:13 AM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: OsUser job-router crashes
 
Dear Dan, Cole,

Did you learn anything more about the issue in this thread?

At CERN, we have observed the same problem as soon as we have upgraded some CEs to condor-24.12.20 and htcondor-ce-24.2.0. Most jobs run and complete fine, but a few cause the Job Router to crash with Failed to find OsUser or User in job ad and the batch job to be removed with RemoveReason = "JobRouter aborted job (by user condor)".  The classad shown just before the error message indeed shows no User attribute, but the classads of the CE and batch job retrieved by history do contain that attribute, just like other jobs of the same users (where User is equal to Owner@xxxxxxx). We have seen this happening mostly to CMS jobs but also to other users (VOs), and all of these users do run many other jobs without issues.

Based on previous comments on this thread, we have tried to update the CE to condor-25.0.10 and htcondor-ce-25.0.1, but it hasn't helped. The same error (randomly) occurs. Maybe it's 25.10 (or other) which might help?

Do you have any other insight about this problem?

Regards,
    Antonio



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>
Sent: Thursday, March 26, 2026 12:51
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] OsUser job-router crashes
 
Hi Cole,
We get job submissions from a few schedd servers, but we believe that we have tracked the problem down to one running version 24.9.2. We also believe that the problem has reduced a bit, presumably because one or more large sites have updated to 25. We do see jobs from other schedd servers running 24.9.2 that don’t appear to cause problems, so we suspect that another condition may be in play. We’ll keep investigating on our side.

Thanks for your help.

Dan 

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, 25 March 2026 at 18:20
To: Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: OsUser job-router crashes

CAUTION: This message came from outside Imperial. Do not click links or open attachments unless you recognise the sender and were expecting this email.

Hi Dan,

Another question: What is the version of the HTCondor Schedd that is sending jobs to your CE? Is this also v25 or something older like v24?

-Cole Bollig

From: Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>
Sent: Wednesday, March 25, 2026 6:11 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: OsUser job-router crashes
 
Hi Cole,
Thanks very much for your assistance with this.

The additional diagnostics are as follows:

Arguments = ""
BatchQueue = ""
BatchRuntime = 259200
BytesRecvd = 28044.0
BytesSent = 73780.0
CERequirements = "CondorCE"
ClusterId = 1639290
Cmd = "DIRAC_9xq1tffd_pilotwrapper.py"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
CondorCE = 1
CpusProvisioned = 8
CpusUsage = 0.1141048636842512
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
CurrentHosts = 0
DiskProvisioned = 7671
DiskUsage = 2250000
DiskUsage_RAW = 2250000
EnteredCurrentStatus = 1774431475
Environment = "CONDORCE_COLLECTOR_HOST=<...>:9619 DIRAC_PILOT_STAMP=<...> HTCONDOR_JOBID=1328108.2 LANG=en_GB.UTF-8"
Err = "1328108.2.err"
ExecutableSize = 30
ExecutableSize_RAW = 28
ExitBySignal = false
ExitCode = 1
ExitStatus = 0
HoldReason = "The job attribute OnExitHold _expression_ 'ExitCode =!= 0' evaluated to TRUE"
HoldReasonCode = 3
HoldReasonSubCode = 55
ImageSize = 500000
ImageSize_RAW = 500000
In = "/dev/null"
Iwd = "/var/lib/condor-ce/spool/8108/2/cluster1328108.proc2.subproc0"
JobCurrentStartDate = 1774344542
JobCurrentStartExecutingDate = 1774344542
JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5)
JobLeaseDuration = 2400
JobMemory = RequestMemory
JobNotification = 0
JobPrio = 0
JobRunCount = 1
JobStartDate = 1774344542
JobStatus = 1
JobSubmitFile = "/opt/dirac/data/HTCondor/work/HTCondorCE_uzpqa_kg.sub"
JobSubmitMethod = 0
JobUniverse = 5
KillSig = "SIGTERM"
LastHoldReason = "Spooling input data files”
LastHoldReasonCode = 16
LastJobStatus = 5
LastReleaseReason = "Data files spooled"
LastSuspensionTime = 0
LeaveJobInQueue = JobStatus == 4
Managed = "Schedd"
ManagedManager = ""
MaxHosts = 1
MemoryProvisioned = 2048
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
MinHosts = 1
MyType = "Job"
NumCkpts = 0
NumCkpts_RAW = 0
NumHolds = 1
NumHoldsByReason = [ JobPolicy = 1 ]
NumJobCompletions = 0
NumJobMatches = 1
NumJobStarts = 1
NumRestarts = 0
NumShadowStarts = 1
NumSystemHolds = 0
=!= 0
>
Out = "1328108.2.out"
Owner = "<...>"
ProcId = 0
QDate = 1774431475
Rank = 0.0
ReleaseReason = "Data files spooled"
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
Remote_JobUniverse = 5
RequestCpus = 8
RequestDisk = DiskUsage
RequestMemory = 2000
Requirements = (NumJobStarts == 0) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)
ResidentSetSize = 47500
ResidentSetSize_RAW = 47500
RouteName = "Condor_Pool"
RoutedBy = "htcondor-ce"
RoutedFromJobId = "1328108.2"
RoutedJob = true
SUBMIT_Cmd = "/opt/dirac/data/HTCondor/work/DIRAC_9xq1tffd_pilotwrapper.py"
SUBMIT_UserLog = "/opt/dirac/data/HTCondor/work/<...>/3/17/1328108.2.log"
SUBMIT_x509userproxy = "/tmp/tmpg2sfoji_"
ScitokensFile = "/opt/dirac/data/HTCondor/work/HTCondorCE_yxnffj4i.token"
ScratchDirFileCount = 55899
ShadowBday = 1774344542
ShouldTransferFiles = "YES"
SpooledOutputFiles = ""
StreamErr = false
StreamOut = false
TargetType = "Machine"
TotalSuspensions = 0
TransferIn = false
TransferInputSizeMB = 0
TransferOutput = ""
TransferOutputRemaps = undefined
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
orig_AuthTokenId = "<...>"
orig_AuthTokenIssuer = "https://dteam-auth.cern.ch/"
orig_AuthTokenScopes = "compute.create,compute.read,compute.cancel,compute.modify"
orig_AuthTokenSubject = "<...>"
orig_OnExitHold = ExitCode =!= 0
orig_OnExitHoldSubCode = 55
orig_environment = "DIRAC_PILOT_STAMP=<...> HTCONDOR_JOBID=1328108.2"
osg_environment = ""
remote_NodeNumber = 1
remote_SMPGranularity = 1
x509UserProxyEmail = "<...>"
x509UserProxyExpiration = 1774392262
x509UserProxyFQAN = "<...>"
x509UserProxyFirstFQAN = "/dteam/Role=NULL/Capability=NULL"
x509UserProxyVOName = "dteam"
x509userproxy = "tmpg2sfoji_"
x509userproxysubject = "<...>"
03/25/26 09:38:20 Failed to find OsUser or User in job ad.
03/25/26 09:38:20 ERROR "Failed to initialize user ids." at line 64 in file /var/lib/condor/execute/slot1/dir_3041422/scratch/userdir/build-b0FsGA/BUILD/condor-25.0.8/src/condor_utils/set_user_priv_from_ad.cpp



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, 24 March 2026 at 21:09
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] OsUser job-router crashes

CAUTION: This message came from outside Imperial. Do not click links or open attachments unless you recognise the sender and were expecting this email.

Hi Dan,

This issue seems different (but possibly related) to the previous threads issue. Looking at the code, this error should print out a ClassAd (key = value pairs) before the provided debug lines. Could you share those lines. Feel free to clean them up as needed (remove usernames, hostnames, etc) or send it directly to me.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 11:43 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] OsUser job-router crashes
 
Hi,
I know there have been some other threads about this, but we are still seeing regular and frequent job-router crashes with the following:
03/22/26 03:14:12 Failed to find OsUser or User in job ad.
03/22/26 03:14:12 ERROR "Failed to initialize user ids." at line 64 in file /var/lib/condor/execute/slot1/dir_3041422/scratch/userdir/build-b0FsGA/BUILD/condor-25.0.8/src/condor_utils/set_user_priv_from_ad.cpp

Looking back through the list, I note that at least one other user appears to have experienced a similar issue but that it was apparently resolved when they updated to version 25. Our entire estate (worker nodes and CE’s) is now running 25 but we haven’t seen any reduction in these errors.

Any assistance would be greatly appreciated.

Thanks,

-- 

 

Dan Whitehouse

Imperial College HEP