Hello,
condorSchedd.newJob() fails to create a new job. The scheduler logs show an unset owner:
Âpasswd_cache::cache_uid(): getpwnam("") failed: user not found
Â(1.0) Failed to find UID and GID for user . Cannot chown /path/to/spool/1/0/cluster0.proc0.subproc0 to user.
Despite the above log message, newJob() returns a successful response. The issue manifests later as an "Unknown cluster or job" error when condorSchedd.submit() is invoked.
I am probably missing something basic. What am I doing/missing that is causing this problem?
My setup is as follows:
* HTCondor version 8.7.7, built from github source in a Centos 7 docker container, with the -DWITH_CREAM:BOOL=FALSE -DWITH_GLOBUS=FALSEÂ-D_DEBUG:BOOL=TRUE cmake options
* Same Centos 7 docker container running all daemons as user "condor", with the following permissive configuration:
USE_SHARED_PORT = FALSE
SCHEDD_ARGS = -p 8080
ENABLE_SOAP = TRUE
ENABLE_WEB_SERVER = TRUE
WEB_ROOT_DIR=$(RELEASE_DIR)/lib/webservice
ALLOW_SOAP = */*
QUEUE_ALL_USERS_TRUSTED = TRUE
HOSTALLOW_WRITE = *
ALLOW_WRITE = *
ALL_DEBUG = D_FULLDEBUG
* Python client using suds:
from suds.client import Client
transaction = condor_schedd.service.beginTransaction(300)
print "Transaction", transaction
cluster = condor_schedd.service.newCluster(transaction.transaction)
print "Cluster", cluster
job = condor_schedd.service.newJob(transaction.transaction, cluster.integer)
print "Job", job
# Prints:
# Job (IntAndStatus){
#Â Âstatus =Â
#Â Â Â (Status){
#Â Â Â Â Âcode = "SUCCESS"
#Â Â Â Â Âmessage = "MESSAGE-NULL"
#Â Â Â }
#Â Âinteger = 0
# }
jobAd = condor_schedd.service.createJobTemplate(cluster.integer,Â
     job.integer, "condor", 5, "/bin/sleep", "30", "")
result = condor_schedd.service.submit(transaction.transaction,
     Âcluster.integer, job.integer, jobAd.classAd[0])
print result
# Prints:
# (RequirementsAndStatus){
#Â Âstatus =Â
#Â Â Â (Status){
#Â Â Â Â Âcode = "UNKNOWNJOB"
#Â Â Â Â Âmessage = "Unknown cluster or job"
#Â Â Â }
#Â Ârequirements = None
# }
My sources:
--------------------
My attempt at debugging:
I tried to blindly follow the log messages to the source, without much knowledge of the condor internals :). I noticed that the dummy classAd passed into
createJobSpoolDirectory()ÂÂfromÂ
createJobSpoolDirectory_PRIV_CONDOR() doesn't have the "Owner" attribute set, which
results in an empty string here. Adding the following line inÂcreateJobSpoolDirectory_PRIV_CONDOR() before calling createJobSpoolDirectory() got me past the first issue, but I ran into another issue when invoking condorSchedd.commitTransaction() later:
dummy_ad.InsertAttr(ATTR_OWNER,get_condor_username());
Did the hack above circumvent some authorization feature or did I stumble into a bug?
Thanks!
Biruk