[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Migrating to htcondor2 -> pypi LTS supported version?



Hi Stefano,

We create the classad My.OriginalCpus:

https://github.com/dmwm/WMCore/blob/4a707ce1b289688c99d29d820de6150bcb71a215/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L628

If you look at the job1 and job2 dictionaries, you will see:

'My.OriginalCpus': '8', 'Requirements': 'MY.OriginalCpus >= 1 && stringListMember(TARGET.Arch, REQUIRED_ARCH)'

With HTCondor v1 python bindings, these classads would be created sequentiallyÂand in order, so by the time you get to Requirements, the classad My.OriginalCpus already exists.
I think htcondor v2 does the same, because only 1 job produces good results. The issue seems to come from more than one job with some variations in the number of classads for each.Â

Best regards,
Kenyi

On Fri, Oct 24, 2025 at 12:34âPM Stefano Belforte via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hi Ken,

maybe unrelate, but whyÂ

'My.DESIRED_ExtraMatchRequirements': 'MY.OriginalCpus >= 1'

instead of

'My.DESIRED_ExtraMatchRequirements': 'OriginalCpus >= 1'

?
IIUC `My.X` tells condor to add a classAd called X among the submitted job
classAds, but there will be no `My.X` classAd in there, so whatever your
_expression_ results to, it is not likely to be what you want.

Cfr. e.g. with your _expression_ for `'requestMemory'`

Stefano

On 24/10/2025 00:09, Kenyi Hurtado Anampa via HTCondor-users wrote:
Dear HTCondor team,

We found a new bug in htcondor2 that does not occur with the HTCondor v1 python bindings.
We are using HTCondor 25.0.2.

The bug seems to be related to the fact we have 2 jobs, one withÂDESIRED_ExtraMatchRequirements, and another without it.
I could not easily reproduce the bug just by adding that extra classad in one of the jobs, but we managed to find a way to reproduce the issue in an isolated environment. Could you please help us fix this?

This is the error and how to reproduce it:

$ mkdir -p /tmp/job_2221 /tmp/job_2231 ; touch /tmp/test.txt
$ python3 test4_simple.py
HTCondor2 version = $CondorVersion: 25.0.2 2025-10-09 BuildID: UW_Python_Wheel_Build $
jobAds len = 2
Traceback (most recent call last):
  File "/home/khurtado/new-condor/test4_simple.py", line 25, in <module>
    submitRes = schedd.submit(sub, itemdata=iter(jobAds))
  File "/home/khurtado/.local/lib/python3.9/site-packages/htcondor2/_schedd.py", line 646, in submit
    return _schedd_submit(self._addr, real._handle, count, spool)
htcondor2_impl.HTCondorException: Failed to create job ad, errmsg=Submit:-1:Parse error in _expression_:
	Requirements = (0MY.OriginalCpus >= 1 && stringListMember(TARGET.Arch, REQUIRED_ARCH)) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))

And this is the script (this script does submit just fine with htcondor v1 python bindings):
$ cat test4_simple.py
import classad2 as classad
import htcondor2 as htcondor
#import classad as classad
#import htcondor as htcondor

sub = htcondor.Submit("""
        universe = vanilla
        executable = test.sh
        output = out.$(Cluster)-$(Process)
        log = log.$(Cluster).log
        """)

schedd = htcondor.Schedd()

jobAds = []
job1 = {'initial_Dir': '/tmp/job_2221', 'transfer_input_files': '/tmp/test.txt', 'Arguments': 'amaltaro_DQMHarvest_RunWhitelist_Oct2025_Val_251022_213529_2035-Sandbox.tar.bz2 2221 1', 'transfer_output_files': 'Report.1.pkl,wmagentJob.log', 'My.x509userproxy': '"/data/certs/myproxy.pem"', 'My.DESIRED_Sites': '"T2_CH_CERN"', 'My.ExtDESIRED_Sites': '"T2_CH_CERN"', 'My.CMS_JobRetryCount': '1', 'My.WMAgent_RequestName': '"amaltaro_DQMHarvest_RunWhitelist_Oct2025_Val_251022_213529_2035"', 'My.CMSGroups': 'UNDEFINED', 'My.WMAgent_JobID': '2221', 'My.WMAgent_SubTaskName': '"/amaltaro_DQMHarvest_RunWhitelist_Oct2025_Val_251022_213529_2035/EndOfRunDQMHarvest"', 'My.CMS_JobType': '"Harvesting"', 'My.CMS_Type': '"test"', 'My.CMS_RequestType': '"DQMHarvest"', 'My.CMS_extendedJobType': '"Harvesting"', 'My.CMS_CampaignName': 'UNDEFINED', 'My.AllowOpportunistic': 'False', 'My.DESIRED_CMSDataset': '"/JetMET0/Run2024D-2024CDEReprocessing-v1/DQMIO"', 'My.DESIRED_CMSDataLocations': '"T1_US_FNAL,T2_CH_CERN,T2_CH_CERN_HLT,T2_CH_CERN_P5"', 'My.DESIRED_CMSPileups': 'UNDEFINED', 'My.Requestioslots': '0', 'My.RequiresGPU': '0', 'request_gpus': '0', 'My.DESIRED_GPUMemoryMB': 'UNDEFINED', 'My.DESIRED_GPUMinimumCapability': 'UNDEFINED', 'My.DESIRED_GPUMaximumCapability': 'UNDEFINED', 'My.DESIRED_GPURuntime': 'UNDEFINED', 'gpus_minimum_memory': 'UNDEFINED', 'gpus_minimum_capability': 'UNDEFINED', 'gpus_maximum_capability': 'UNDEFINED', 'gpus_minimum_runtime': 'UNDEFINED', 'My.EstimatedSingleCoreMins': '60', 'My.OriginalMaxWallTimeMins': '60', 'My.MaxWallTimeMins': 'WMCore_ResizeJob ? (EstimatedSingleCoreMins/RequestCpus + 15) : OriginalMaxWallTimeMins', 'My.OriginalMemory': '3000', 'My.ExtraMemory': '500', 'request_memory': 'OriginalMemory + ExtraMemory * (WMCore_ResizeJob ? (RequestCpus-OriginalCpus) : 0)', 'request_disk': '1000000', 'My.MinCores': '1', 'My.MaxCores': '1', 'My.OriginalCpus': '1', 'Rank': 'isUndefined(Cpus) ? 0 : ifThenElse(Cpus > MaxCores, -Cpus, Cpus)', 'My.JOB_GLIDEIN_Cpus': '"$$(Cpus:0)"', 'My.RequestResizedCpus': '(Cpus>MaxCores) ? MaxCores : ((Cpus < MinCores) ? MinCores : Cpus)', 'My.JobCpus': '((JobStatus =!= 1) && (JobStatus =!= 5) && !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) isnt error)) ? int(MATCH_EXP_JOB_GLIDEIN_Cpus) : OriginalCpus', 'request_cpus': 'WMCore_ResizeJob ? (!isUndefined(Cpus) ? RequestResizedCpus : JobCpus) : OriginalCpus', 'My.WMCore_ResizeJob': 'False', 'My.JobPrio': '50999999', 'My.PostJobPrio1': '-1', 'My.PostJobPrio2': '-827', 'My.REQUIRED_OS': '"rhel8"', 'My.CMSSW_Versions': '"CMSSW_14_0_19_patch2"', 'My.REQUIRED_ARCH': '"X86_64"', 'My.REQUIRED_MINIMUM_MICROARCH': '0', 'Requirements': 'stringListMember(TARGET.Arch, REQUIRED_ARCH)'}

job2 = {'initial_Dir': '/tmp/job_2231', 'transfer_input_files': '/tmp/test.txt', 'Arguments': 'amaltaro_SC_ProdPsi_Oct2025_Val_251022_213530_6474-Sandbox.tar.bz2 2231 1', 'transfer_output_files': 'Report.1.pkl,wmagentJob.log', 'My.x509userproxy': '"/data/certs/myproxy.pem"', 'My.DESIRED_Sites': '"T1_US_FNAL,T2_CH_CERN"', 'My.ExtDESIRED_Sites': '"T1_US_FNAL,T2_CH_CERN"', 'My.CMS_JobRetryCount': '1', 'My.WMAgent_RequestName': '"amaltaro_SC_ProdPsi_Oct2025_Val_251022_213530_6474"', 'My.CMSGroups': 'UNDEFINED', 'My.WMAgent_JobID': '2231', 'My.WMAgent_SubTaskName': '"/amaltaro_SC_ProdPsi_Oct2025_Val_251022_213530_6474/GenSimFull"', 'My.CMS_JobType': '"Production"', 'My.CMS_Type': '"test"', 'My.CMS_RequestType': '"StepChain"', 'My.CMS_extendedJobType': '"GEN,SIM,DIGI_nopileup,RECO,MINIAOD"', 'My.CMS_CampaignName': '"RelVal_Generic_Campaign"', 'My.AllowOpportunistic': 'False', 'My.DESIRED_CMSDataset': 'UNDEFINED', 'My.DESIRED_CMSDataLocations': '"T1_US_FNAL,T2_CH_CERN,T2_CH_CERN_HLT,T2_CH_CERN_P5"', 'My.DESIRED_CMSPileups': 'UNDEFINED', 'My.Requestioslots': '0', 'My.RequiresGPU': '0', 'request_gpus': '0', 'My.DESIRED_GPUMemoryMB': 'UNDEFINED', 'My.DESIRED_GPUMinimumCapability': 'UNDEFINED', 'My.DESIRED_GPUMaximumCapability': 'UNDEFINED', 'My.DESIRED_GPURuntime': 'UNDEFINED', 'My.DESIRED_ExtraMatchRequirements': 'MY.OriginalCpus >= 1', 'gpus_minimum_memory': 'UNDEFINED', 'gpus_minimum_capability': 'UNDEFINED', 'gpus_maximum_capability': 'UNDEFINED', 'gpus_minimum_runtime': 'UNDEFINED', 'My.EstimatedSingleCoreMins': '6240', 'My.OriginalMaxWallTimeMins': '780', 'My.MaxWallTimeMins': 'WMCore_ResizeJob ? (EstimatedSingleCoreMins/RequestCpus + 15) : OriginalMaxWallTimeMins', 'My.OriginalMemory': '8000', 'My.ExtraMemory': '500', 'request_memory': 'OriginalMemory + ExtraMemory * (WMCore_ResizeJob ? (RequestCpus-OriginalCpus) : 0)', 'request_disk': '1000000', 'My.MinCores': '4.0', 'My.MaxCores': '8', 'My.OriginalCpus': '8', 'Rank': 'isUndefined(Cpus) ? 0 : ifThenElse(Cpus > MaxCores, -Cpus, Cpus)', 'My.JOB_GLIDEIN_Cpus': '"$$(Cpus:0)"', 'My.RequestResizedCpus': '(Cpus>MaxCores) ? MaxCores : ((Cpus < MinCores) ? MinCores : Cpus)', 'My.JobCpus': '((JobStatus =!= 1) && (JobStatus =!= 5) && !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) isnt error)) ? int(MATCH_EXP_JOB_GLIDEIN_Cpus) : OriginalCpus', 'request_cpus': 'WMCore_ResizeJob ? (!isUndefined(Cpus) ? RequestResizedCpus : JobCpus) : OriginalCpus', 'My.WMCore_ResizeJob': 'False', 'My.JobPrio': '600000', 'My.PostJobPrio1': '-2', 'My.PostJobPrio2': '-829', 'My.REQUIRED_OS': '"rhel7"', 'My.CMSSW_Versions': '"CMSSW_12_0_0,CMSSW_12_0_0,CMSSW_12_0_0"', 'My.REQUIRED_ARCH': '"X86_64"', 'My.REQUIRED_MINIMUM_MICROARCH': '0', 'Requirements': 'MY.OriginalCpus >= 1 && stringListMember(TARGET.Arch, REQUIRED_ARCH)'}

jobAds = [job1, job2]


print("HTCondor2 version = %s" % htcondor.version())
print("jobAds len = %s" % len(jobAds))
submitRes = schedd.submit(sub, itemdata=iter(jobAds))
clusterId = submitRes.cluster()
print("ClusterId = %s" % clusterId)
print("submitRes: {0}".format(submitRes))
print("submitRes numprocs: {0}".format(submitRes.num_procs()))
print("submitRes first_proc: {0}".format(submitRes.first_proc()))
print("dir(submitRes): {0}".format(dir(submitRes)))
print("ClusterId: {}".format(submitRes.cluster()))

On Fri, Sep 26, 2025 at 5:05âPM Kenyi Hurtado Anampa <khurtado@xxxxxx> wrote:
Awesome! ThankÂyou so much.

I will let the CMS submission infrastructure and workflow management teams know.

Best regards,
Kenyi

On Fri, Sep 26, 2025 at 4:09âPM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Kenyi,

We have decided to backport the fix. It should be available in v25.0.2 (planned release on October 9th).

Cheers,
Cole Bollig

From: Kenyi Hurtado Anampa <khurtado@xxxxxx>
Sent: Thursday, September 25, 2025 1:10 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Alan Malta Rodrigues <amaltar2@xxxxxx>
Subject: Re: [HTCondor-users] Migrating to htcondor2 -> pypi LTS supported version?
Â
Thank you so much for this information, Cole.
Yes, if this could be included on 25.0.2, that would be awesome.

Best regards,
Kenyi

On Thu, Sep 25, 2025 at 1:25âPM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Kenyi,

V25.0.3 (and friends) are scheduled to be released November 13th. I was chatting with our release manager, and we will talk to Todd to see if it is feasible to cherry pick the patch into V25.0.2 which is planned to release much sooner (October 9th).

Cheers,
Cole Bollig

From: Kenyi Hurtado Anampa <khurtado@xxxxxx>
Sent: Thursday, September 25, 2025 12:08 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Alan Malta Rodrigues <amaltar2@xxxxxx>
Subject: Re: [HTCondor-users] Migrating to htcondor2 -> pypi LTS supported version?
Â
Hi Cole,

Thank you for the prompt reply! Do you have an estimated release date for version 25.0.3?

Best regards,
Kenyi

On Thu, Sep 25, 2025 at 1:06âPM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Kenyi,

I believe the fix will be released in the following versions:
V24.0.14Â
V24.14.0
V25.0.3
V25.3.0

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Kenyi Hurtado Anampa via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, September 25, 2025 12:01 PM
To: Todd L Miller <tlmiller@xxxxxxxxxxx>
Cc: Kenyi Hurtado Anampa <khurtado@xxxxxx>; Alan Malta Rodrigues <amaltar2@xxxxxx>; Kenyi Hurtado Anampa via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Migrating to htcondor2 -> pypi LTS supported version?
Â
Hi Todd,

I just installed 25.0.1 and tested this same example, with the same results. Is it possible the fix did not make it to this version?

[1]
(WMAgent-2.4.3) [cmst1@vocms0263:simple]$ python3 submit.py
HTCondor2 version: $CondorVersion: 25.0.1 2025-09-24 BuildID: UW_Python_Wheel_Build RC $
Traceback (most recent call last):
 File "/tmp/condor-gpus/simple/submit.py", line 28, in <module>
  result = htcondor.Schedd().submit(submit, itemdata=iter(jobAds))
      Â^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/site-packages/htcondor2/_schedd.py", line 646, in submit
  return _schedd_submit(self._addr, real._handle, count, spool)
     Â^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
htcondor2_impl.HTCondorException: Failed to create job ad, errmsg=Submit:-1:Parse error in _expression_:
Rank = 2 38

(WMAgent-2.4.3) [cmst1@vocms0263:simple]$ cat submit.py
#!/usr/bin/env python3
import htcondor2 as htcondor
import classad2 as classad

print(f"HTCondor2 version: {htcondor.version()}")

# Create a job description. It _must_ set `log` to create a job event log.
logFileName = "sleep.log"
submit = htcondor.Submit(
  f"""
  executable = /bin/sleep
  transfer_executable = false

  log = {logFileName}
  """
)

jobAds = []
for name in range(2):
  name = str(name)
  ad = {}
  ad['My.MyJobName'] = classad.quote(name)
  ad['Arguments'] = '1 2 3'
  ad['Rank'] = '8'
  jobAds.append(ad)

# Submit the job description, creating the job.
result = htcondor.Schedd().submit(submit, itemdata=iter(jobAds))
clusterID = result.cluster()

On Tue, Sep 16, 2025 at 3:41âPM Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:
>Â Â Â ÂYou're correct; this is a bug, most likely introduced when we added
> support for the TABLE keyword.

    Also: thanks for filing all these bug reports and putting up with
these problems during this transition! It's very helpful to us and to our
other customers to find these as early as possible.

-- ToddM

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/