Hi Brain,
I've checked the PBS Professional user manuals and this restriction [1] is mentioned in all releases starting with release 13 [2-3].
I hope this helps,
Lukas
[1]
[3]
https://www.pbsworks.com/pdfs/PBSUserGuide18.2.pdf
Hi Brian,
I was trying to build HTCondor from source first but I ended up using Tim Theisen's BoSCO build [1].
Yes, there is at least one additional incompatibility. It's related to the job description file though. In PBS Pro, you're not allowed to set the output/error file to /dev/null (-o/-e option). According to the documentation [2], users are supposed to redirect stdout/stderr in their batch scripts to avoid the creation of the corresponding files.
Best,
Lukas
[1] condor-8.6.11-opensuse_11.4-stripped.tar.gz
[2] https://www.pbsworks.com/pdfs/PBSUserGuide18.2.pdf - section 3.3.3
From: Brian Lin <blin@xxxxxxxxxxx>
Sent: Thursday, March 5, 2020 6:54:33 PM
To: Koschmieder, Lukas Michael; HTCondor-Users Mail List
Cc: Carl Edquist
Subject: Re: [HTCondor-users] Unable to find/track submitted PBS batch jobsHi Lukas,
In the latest versions of the BLAH, we gave up on auto-detection of PBS Pro vs OpenBS and instead opted to depend on the user to differentiate between the two via configuration (https://github.com/htcondor/BLAH/blob/devel/config/blah.config.template#L102-L103). Are there other qstat incompatibility issues between different versions of the same flavor of PBS?
I saw in an earlier thread that you were building HTCondor from scratch with BLAH support; where did you get your BLAH source?
Thanks,
Brian
On 3/5/20 11:46 AM, Koschmieder, Lukas Michael wrote:
Thank you Brian,
Enabling the debug log helped me to find the reason why pbs_status.py fails.
There are multiple PBS versions and the corresponding command-line interfaces are not fully compatible. pbs_status.py attempts to auto-detect the installed PBS version based on the output of "qstat --version". And this mechanism fails for newer PBS versions. For instance, PBS Professional 14.2.6.20180327140341 will be misidentified as OpenPBS.
Fixing the issue for a particular PBS version is as easy as replace a single line in pbs_status.py. But if you wanted to fix the auto-detection in general this would require some extra work because in newer PBS versions you can't rely on "qstat --version" output anymore.
Lukas
From: Brian Lin <blin@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2020 5:54:51 PM
To: Koschmieder, Lukas Michael; HTCondor-Users Mail List
Cc: Carl Edquist
Subject: Re: [HTCondor-users] Unable to find/track submitted PBS batch jobsIf you touch /var/tmp/qstat_cache_lukask/pbs_status.debug, that will enable the debug log (pbs_status.log in the same dir), which may contain more information.
- Brian
On 3/3/20 10:47 AM, Koschmieder, Lukas Michael wrote:
Thank you Brian,
This was really helpful. I've checked all your points and everything is okay except for the last one. If I run pbs_status.py manually, it will return "1ERROR: [Errno 2] No such file or directory". I still have to find the actual cause but now at least I know where to look.
acsrvcl02 lukask/condor> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
10058.acsrvcl02 bl_bda180de9040 lukask 00:00:00 R work
acsrvcl02 lukask/condor> date
Tue Mar 3 17:32:43 CET 2020
acsrvcl02 lukask/condor> python /home/condor/htcondor/libexec/glite/bin/pbs_status.py pbs/20200303/10058.acsrvcl02
1ERROR: [Errno 2] No such file or directory
Cheers,
Lukas
From: Brian Lin <blin@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2020 4:37:10 PM
To: HTCondor-Users Mail List; Koschmieder, Lukas Michael
Cc: Carl Edquist
Subject: Re: [HTCondor-users] Unable to find/track submitted PBS batch jobsHi Lukas,
What's the value of PBS_GAHP (condor_config_val -v PBS_GAHP)? I would unset it so that your setup uses the generic BATCH_GAHP (aka Bosco or BLAH), which should be set to "$(GLITE_LOCATION)/bin/batch_gahp".
What's the value of pbs_binpath (grep pbs_binpath `condor_config_val GLITE_LOCATION`/etc/batch_gahp.config)? Is it the directory that contains "qstat"?
Under the hood, the BLAH is running a qstat wrapper so you should
1) Verify that running qstat as a non-privileged user works from the host in question
2) Run the qstat wrapper for a job that's currently in the PBS queue:
$ python `condor_ce_config_val GLITE_LOCATION`/bin/pbs_status.py <BLAH JOB ID>
Where <BLAH JOB ID> has the following format: pbs/<YYYYMMDD/<PBS JOB ID>
- Brian
On 3/2/20 1:01 PM, Koschmieder, Lukas Michael wrote:
Hi,
I'm trying to set up Condor as an alternative interface to our PBS cluster.
This is my setup so far:
- I've installed Condor (BoSCO) on our PBS login/submit node.
- I've enabled MASTER, COLLECTOR, NEGOTIATOR, and SCHEDD.
- I've set GLITE_LOCATION and PBS_GAHP in condor_config.
- I've set pbs_binpath and pbs_spoolpath in GLITE_LOCATION/etc/batch_gahp.config.
With this setup, I can submit jobs to our PBS cluster using `condor_submit`. But for some reason, Condor won't be able to find/track the submitted jobs. While the actual PBS jobs will keep running (and eventually terminate), the corresponding Condor "meta jobs" will remain IDLE for a few minutes and finally change their status to HELD.
Do you have an idea what might cause this behavior or how to debug it?
Cheers,
Lukas
User LOG:
027 (001.000.000) 03/02 18:47:52 Job submitted to grid resource
GridResource: batch pbs
GridJobId: batch pbs acsrvcl02.gi.rwth-aachen.de_9618_acsrvcl02.gi.rwth-aachen.de#1.0#1583171263 pbs/20200302/10044
...
012 (001.000.000) 03/02 18:53:01 Job was held.
Error parsing classad or job not found
Code 0 Subcode 0
GrindmanagerLog.lukask (D_FULLDEBUG):
03/02/20 18:50:43 [2578688] Received CHECK_LEASES signal
03/02/20 18:50:43 [2578688] in doContactSchedd()
03/02/20 18:50:43 [2578688] querying for renewed leases
03/02/20 18:50:43 [2578688] querying for removed/held jobs
03/02/20 18:50:43 [2578688] Using constraint ((Owner=?="lukask"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
03/02/20 18:50:43 [2578688] Fetched 0 job ads from schedd
03/02/20 18:50:43 [2578688] leaving doContactSchedd()
03/02/20 18:50:45 [2578688] Evaluating periodic job policy expressions.
03/02/20 18:50:46 [2578688] GAHP[2578692] <- 'RESULTS'
03/02/20 18:50:46 [2578688] GAHP[2578692] -> 'S' '0'
03/02/20 18:50:48 [2578688] Evaluating staleness of remote job statuses.
03/02/20 18:50:58 [2578688] (1.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
03/02/20 18:50:58 [2578688] (1.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
03/02/20 18:50:58 [2578688] GAHP[2578692] <- 'BLAH_JOB_STATUS 5 pbs/20200302/10044'
03/02/20 18:50:58 [2578688] GAHP[2578692] -> 'S'
03/02/20 18:50:59 [2578688] GAHP[2578692] <- 'RESULTS'
03/02/20 18:50:59 [2578688] GAHP[2578692] -> 'R'
03/02/20 18:50:59 [2578688] GAHP[2578692] -> 'S' '1'
03/02/20 18:50:59 [2578688] GAHP[2578692] -> '5' '1' 'Error parsing classad or job not found' '0' 'N/A'
03/02/20 18:50:59 [2578688] (1.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
03/02/20 18:50:59 [2578688] (1.0) gm state change: GM_POLL_ACTIVE -> GM_SUBMITTED
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/