[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] zero cputime reported by startd 10.5.0



Hi Todd,
I attach here the StarterLog file for the job,
the output of condor_config_val -summary from the WN
and output of
[root@ce06-htc condor]# condor_history -lim 1 9613046.0 -l > history.9613046.0


Notes:
1) this is an accounting issue, cputime of the job while running has no interest here. I only consider values from the history logfile
2) 9.0.17 might be out of support but every WLCG site has to provide GSI support for quite some time, thus the most recent HTCondor-CE
with GSI support (i.e. condor 9.0.17 afaik) must stay in production.
3) Because of 2. above it would be important that SCHEDD 9.0.17 can work well with 10.5 STARTD, COLLECTOR, NEGOTIATOR.

Thanks, Regards

Stefano


Il 26/06/23 20:39, Todd L Miller via HTCondor-users ha scritto:
ÂÂÂÂOne possibility is some sort of cgroup problem. Please send along the starter log for the test job.

To verify a little further i executed a known test program who just crunch integer numbers for some 5 minutes.

ÂÂÂÂAnother possibility might be that your test was hitting a sampling error. (I don't know how often the 10.5 startd checks/reports accumulated CPU usage.)

ÂÂÂÂSo I would try a wrapper script like the following to eliminate some possibilities:

#!/bin/bash
time known-test-program
time known-test-program
time known-test-program
time known-test-program
time known-test-program

; be sure to capture the output and error logs. This will verify that (a) the bash built-in `time` sees CPU usge and (b) that you're not somehow missing a sampling window.

Is that a communication problem with the 9.0.17 schedd ?

ÂÂÂÂI would be very surprised, but I'm pretty sure 9.0.x is out of support at this point.

- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


06/27/23 16:08:46 (pid:3201) ******************************************************
06/27/23 16:08:46 (pid:3201) ** condor_starter (CONDOR_STARTER) STARTING UP
06/27/23 16:08:46 (pid:3201) ** /usr/sbin/condor_starter
06/27/23 16:08:46 (pid:3201) ** SubsystemInfo: name=STARTER type=STARTER(7) class=DAEMON(1)
06/27/23 16:08:46 (pid:3201) ** Configuration: subsystem:STARTER local:slot_type_1 class:DAEMON
06/27/23 16:08:46 (pid:3201) ** $CondorVersion: 10.5.0 2023-06-05 BuildID: 650732 PackageID: 10.5.0-1 $
06/27/23 16:08:46 (pid:3201) ** $CondorPlatform: x86_64_CentOS7 $
06/27/23 16:08:46 (pid:3201) ** PID = 3201
06/27/23 16:08:46 (pid:3201) ** Log last touched 6/26 15:01:29
06/27/23 16:08:46 (pid:3201) ******************************************************
06/27/23 16:08:46 (pid:3201) Using config source: /etc/condor/condor_config
06/27/23 16:08:46 (pid:3201) Using local config sources: 
06/27/23 16:08:46 (pid:3201)    /usr/share/htc/90/00_common.conf
06/27/23 16:08:46 (pid:3201)    /usr/share/htc/90/10_security.conf
06/27/23 16:08:46 (pid:3201)    /usr/share/htc/90/py/t1_htconf.py|
06/27/23 16:08:46 (pid:3201)    /usr/share/htc/90/py/conf/wn-200-10-11-02-a.conf
06/27/23 16:08:46 (pid:3201)    /usr/share/htc/90/pyconf/htc_py.conf
06/27/23 16:08:46 (pid:3201)    /etc/condor/condor_config.local
06/27/23 16:08:46 (pid:3201) config Macros = 139, Sorted = 137, StringBytes = 5163, TablesBytes = 5092
06/27/23 16:08:46 (pid:3201) CLASSAD_CACHING is OFF
06/27/23 16:08:46 (pid:3201) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
06/27/23 16:08:46 (pid:3201) SharedPortEndpoint: waiting for connections to named socket slot1_1_19070_4a41_14850
06/27/23 16:08:46 (pid:3201) DaemonCore: command socket at <131.154.197.210:9618?addrs=131.154.197.210-9618&alias=wn-200-10-11-02-a.cr.cnaf.infn.it&noUDP&sock=slot1_1_19070_4a41_14850>
06/27/23 16:08:46 (pid:3201) DaemonCore: private command socket at <131.154.197.210:9618?addrs=131.154.197.210-9618&alias=wn-200-10-11-02-a.cr.cnaf.infn.it&noUDP&sock=slot1_1_19070_4a41_14850>
06/27/23 16:08:46 (pid:3201) Communicating with shadow <131.154.192.55:9618?addrs=131.154.192.55-9618&alias=ce06-htc.cr.cnaf.infn.it&noUDP&sock=shadow_1802436_f7c9_372316>
06/27/23 16:08:46 (pid:3201) Submitting machine is "ce06-htc.cr.cnaf.infn.it"
06/27/23 16:08:46 (pid:3201) setting the orig job name in starter
06/27/23 16:08:46 (pid:3201) setting the orig job iwd in starter
06/27/23 16:08:46 (pid:3201) Chirp config summary: IO false, Updates false, Delayed updates true.
06/27/23 16:08:46 (pid:3201) Initialized IO Proxy.
06/27/23 16:08:46 (pid:3201) Done setting resource limits
06/27/23 16:08:46 (pid:3201) Set filetransfer runtime ads to /home/condor/execute//dir_3201/.job.ad and /home/condor/execute//dir_3201/.machine.ad.
06/27/23 16:08:46 (pid:3201) File transfer completed successfully.
06/27/23 16:08:47 (pid:3201) Job 9613046.0 set to execute immediately
06/27/23 16:08:47 (pid:3201) Starting a VANILLA universe job with ID: 9613046.0
06/27/23 16:08:47 (pid:3201) Checking to see if htcondor is a writeable cgroup
06/27/23 16:08:47 (pid:3201)     Cgroup memory/htcondor is useable
06/27/23 16:08:47 (pid:3201)     Cgroup cpu,cpuacct/htcondor is useable
06/27/23 16:08:47 (pid:3201)     Cgroup freezer/htcondor is useable
06/27/23 16:08:47 (pid:3201) Current mount, /tmp, is shared.
06/27/23 16:08:47 (pid:3201) Current mount, /var, is shared.
06/27/23 16:08:47 (pid:3201) IWD: /home/condor/execute//dir_3201
06/27/23 16:08:47 (pid:3201) Output file: /home/condor/execute//dir_3201/_condor_stdout
06/27/23 16:08:47 (pid:3201) Error file: /home/condor/execute//dir_3201/_condor_stderr
06/27/23 16:08:47 (pid:3201) Renice expr "0" evaluated to 0
06/27/23 16:08:47 (pid:3201) Running job as user herd006
06/27/23 16:08:47 (pid:3201) About to exec /home/condor/execute//dir_3201/condor_exec.exe 0 0 0 10001
06/27/23 16:08:47 (pid:3201)     Cgroup memory/htcondor is useable
06/27/23 16:08:47 (pid:3201)     Cgroup cpu,cpuacct/htcondor is useable
06/27/23 16:08:47 (pid:3201)     Cgroup freezer/htcondor is useable
06/27/23 16:08:47 (pid:3205) Calling sched_setaffinity for cpus 0 
06/27/23 16:08:47 (pid:3201) Moved process 3205 to cgroup /sys/fs/cgroup/memory/htcondor/condor_home_condor_execute__slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
06/27/23 16:08:47 (pid:3201) Moved process 3205 to cgroup /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_home_condor_execute__slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
06/27/23 16:08:47 (pid:3201) Moved process 3205 to cgroup /sys/fs/cgroup/freezer/htcondor/condor_home_condor_execute__slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
06/27/23 16:08:47 (pid:3201) Create_Process succeeded, pid=3205
06/27/23 16:08:55 (pid:3201) Failed to open '.update.ad' to read update ad: No such file or directory (2).
06/27/23 16:08:55 (pid:3201) Failed to open '.update.ad' to read update ad: No such file or directory (2).
06/27/23 16:15:50 (pid:3201) Process exited, pid=3205, status=0
06/27/23 16:15:50 (pid:3201) Failed to write ToE tag to .job.ad file (13): Permission denied
06/27/23 16:15:50 (pid:3201) All jobs have exited... starter exiting
06/27/23 16:15:50 (pid:3201) **** condor_starter (condor_STARTER) pid 3201 EXITING WITH STATUS 0
# condor_config_val $CondorVersion: 10.5.0 2023-06-05 BuildID: 650732 PackageID: 10.5.0-1 $

#
# from /etc/condor/condor_config
#
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
REQUIRE_LOCAL_CONFIG_FILE = false
LOCAL_CONFIG_DIR = /usr/share/condor/config.d,/usr/share/htc/90/pyconf
RUN = $(LOCAL_DIR)/run/condor
LOG = $(LOCAL_DIR)/log/condor
LOCK = $(LOCAL_DIR)/lock/condor
LIB = $(RELEASE_DIR)/lib64/condor
INCLUDE = $(RELEASE_DIR)/include/condor
LIBEXEC = $(RELEASE_DIR)/libexec/condor
SHARE = $(RELEASE_DIR)/share/condor
PROCD_ADDRESS = $(RUN)/procd_pipe
JAVA_CLASSPATH_DEFAULT = $(SHARE) .

#
# from /usr/share/htc/90/pyconf/htc_py.conf
#
T1_SHARED = /usr/share/htc
T1_HTC = 90
T1_SHARED_DIR = $(T1_SHARED)/$(T1_HTC)
T1_SHARED_PY_CONF_DIR = $(T1_SHARED_DIR)/py/conf
T1_SHARED_SCRIPT_DIR = $(T1_SHARED_DIR)/conf/scripts
T1_SHARED_TOOL_DIR = $(T1_SHARED)/cnaf/bin

#
# from /usr/share/htc/90/00_common.conf
#
CENTRAL_MANAGER_1 = htc-1.cr.cnaf.infn.it
CENTRAL_MANAGER_2 = htc-2.cr.cnaf.infn.it
HAD_PORT = 32700
HAD_ARGS = -p $(HAD_PORT)
CONDOR_HOST = $(CENTRAL_MANAGER_1),$(CENTRAL_MANAGER_2)
HAD_LIST = $(CENTRAL_MANAGER_1):$(HAD_PORT), $(CENTRAL_MANAGER_2):$(HAD_PORT)
HAD_USE_PRIMARY = TRUE
HAD_CONNECTION_TIMEOUT = 2
CENTRAL_MANAGER = $(CONDOR_HOST)
COLLECTOR_NAME = T1_HTC_90
UID_DOMAIN = t1htc_90
TRUST_UID_DOMAIN = true
SOFT_UID_DOMAIN = true
LOCAL_DIR = /var
ENABLE_IPV6 = false
DEFAULT_DOMAIN_NAME = cr.cnaf.infn.it
SEC_DEFAULT_AUTHENTICATION_METHODS = IDTOKENS
STARTD_CRON_SHAREDFS_EXECUTABLE = 
HAS_SHAREDFS = True

#
# from /usr/share/htc/90/10_security.conf
#
ALLOW_NEGOTIATOR = condor@* condor_pool@*
SEC_DEFAULT_AUTHENTICATION = required
SEC_DEFAULT_ENCRYPTION = required
SEC_DEFAULT_INTEGRITY = required
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_READ_ENCRYPTION = OPTIONAL
SEC_READ_INTEGRITY = OPTIONAL
SECURITY_MODEL = 9.0
SEC_PASSWORD_FILE = $(SEC_PASSWORD_DIRECTORY)/pool_passwordb
SEC_DAEMON_AUTHENTICATION = REQUIRED
ALLOW_DAEMON = condor_pool@*.$(DEFAULT_DOMAIN_NAME), *.$(DEFAULT_DOMAIN_NAME)
ALLOW_READ = *.$(DEFAULT_DOMAIN_NAME)
ALLOW_WRITE = *.$(DEFAULT_DOMAIN_NAME)
ALLOW_ADMINISTRATOR = $(CONDOR_HOST), root@farm-ops.$(DEFAULT_DOMAIN_NAME)
SIGN_S3_URLS = False

#
# from /usr/share/htc/90/py/t1_htconf.py|
#
DAEMON_LIST = MASTER, STARTD
PUBLISH_OBITUARIES = False
EXECUTE = /home/condor/execute/
SPOOL = /home/condor/spool
MAXJOBRETIREMENTTIME = 86400 * 2
UPDATE_INTERVAL = $RANDOM_INTEGER(300, 750, 1)
MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(300, 750, 1)
UPDATE_OFFSET = $RANDOM_INTEGER(0,300)
MAX_DISK_USAGE_KB = 160000000
DISK_EXCEEDED = DiskUsage_RAW > $(MAX_DISK_USAGE_KB)
MEMORY = 1.2 * quantize( $(DETECTED_MEMORY), 1000 )
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,100)
t1_allow_sam = false
t1_wnprod = true
STARTD_CRON_GPFS_OK_EXECUTABLE = /usr/share/htc/scripts/hcron/check_gpfs_daemonized_listdir.py
STARTD_CRON_GPFS_OK_PERIOD = 5m
STARTD_CRON_GPFS_OK_MODE = Periodic
STARTD_CRON_MC_GRACE_EXECUTABLE = /usr/share/htc/scripts/hcron/check_mcrank.py
STARTD_CRON_MC_GRACE_PERIOD = 1m
STARTD_CRON_MC_GRACE_MODE = Periodic
STARTD_CRON_MEMCHECK_EXECUTABLE = /usr/share/htc/scripts/hcron/check_mem.py
STARTD_CRON_MEMCHECK_PERIOD = 1m
STARTD_CRON_MEMCHECK_MODE = Periodic
RANK = TotalCpus - Cpus
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%,mem=90%,auto
SLOT_TYPE_1_PARTITIONABLE = true
CONSUMPTION_POLICY = true
CONSUMPTION_CPUS = TARGET.RequestCpus
CGROUP_MEMORY_LIMIT_POLICY = soft
ASSIGN_CPU_AFFINITY = True
t1_overheat = ((t1_Ambient_T ?: 20) > 30 || ((t1_Inlet_T ?: 40) > 50) || max({t1_CPU1_T ?: 40,t1_CPU2_T ?: 40}) > 85) ?: False
t1_mc_grace = ( (TARGET.RequestCpus > 1) || ((TARGET.RequestCpus == 1) && !(MC_GRACE ?: False)) )
STARTD_CRON_JOBLIST =  GPFS_OK MC_GRACE MEMCHECK JOBCTL
STARTD_CRON_JOBCTL_EXECUTABLE = /usr/share/htc/90/conf/scripts/wn_jobs.py
STARTD_CRON_JOBCTL_PERIOD = 3m
STARTD_CRON_JOBCTL_MODE = Periodic
STARTD_ATTRS =  StartJobs t1_allow_sam t1_wnprod t1_wn_hs06 t1_wn_hepscore t1_overheat t1_mc_grace t1_MemTotal t1_MemAvailable t1_SwapFree t1_SwapTotal t1_sharectl
t1_sharectl = ( (t1_Targetcores[0] =?= 0) || (( split(AcctGroup,".")[0] =?= t1_TargetGroups[1] && RequestCpus =?= t1_Targetcores[1] ) || ( AcctGroup =?= t1_TargetGroups[0] && (t1_Targetcores[0] ?: 0) > int(split(t1_CurrentJobs ?: "none:0",":")[1]))))
StartJobs = True && (!t1_overheat) && (t1_mc_grace) && t1_sharectl
t1_wn_hs06 = 354

#
# from /usr/share/htc/90/py/conf/wn-200-10-11-02-a.conf
#
START = (TARGET.WantRoute =?= "htc_10.5")
AccountingGroup = "herd.herd006"
AcctGroup = "herd"
AcctGroupUser = "herd006"
Arguments = "0 0 0 10001"
BatchRuntime = 259200
BlockReadKbytes = 0
BlockReads = 0
BlockWriteKbytes = 0
BlockWrites = 0
BytesRecvd = 18056.0
BytesSent = 61.0
CERequirements = "MY.default_CERequirements,CondorCE"
CPUsUsage = 1.000038717904688
ClusterId = 9613046
Cmd = "htcp308"
CommittedSlotTime = 425.0
CommittedSuspensionTime = 0
CommittedTime = 425
CompletionDate = 1687875350
CondorCE = 1
CpusProvisioned = 1
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 425.0
CumulativeSuspensionTime = 0
CurrentHosts = 0
DiskProvisioned = 775753
DiskUsage = 40
DiskUsage_RAW = 40
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1687875350
Environment = "CONDORCE_COLLECTOR_HOST=ce06-htc.cr.cnaf.infn.it:9619"
Err = "htcp308_5846925.0.err"
ExecutableSize = 20
ExecutableSize_RAW = 20
ExitBySignal = false
ExitCode = 0
ExitStatus = 0
GlobalJobId = "ce06-htc.cr.cnaf.infn.it#9613046.0#1687874893"
HepScore = "$$(t1_wn_hepscore:0)"
HoldReason = undefined
HoldReasonCode = undefined
HostFactor = "$$(t1_wn_hs06:0)"
ImageSize = 150
ImageSize_RAW = 128
In = "/dev/null"
Iwd = "/var/lib/condor-ce/spool/6925/0/cluster5846925.proc0.subproc0"
JobCurrentFinishTransferInputDate = 1687874926
JobCurrentFinishTransferOutputDate = 1687875350
JobCurrentStartDate = 1687874925
JobCurrentStartExecutingDate = 1687874927
JobCurrentStartTransferInputDate = 1687874926
JobCurrentStartTransferOutputDate = 1687875350
JobFinishedHookDone = 1687875350
JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5)
JobLeaseDuration = 2400
JobMemory = RequestMemory
JobNotification = 0
JobPrio = 0
JobRunCount = 1
JobStartDate = 1687874925
JobStatus = 4
JobUniverse = 5
LastHoldReason = "Spooling input data files"
LastHoldReasonCode = 16
LastJobLeaseRenewal = 1687875350
LastJobStatus = 2
LastMatchTime = 1687874925
LastPublicClaimId = "<131.154.197.210:9618?addrs=131.154.197.210-9618&alias=wn-200-10-11-02-a.cr.cnaf.infn.it&noUDP&sock=startd_19016_f908>#1687528007#1022#..."
LastRemoteHost = "slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
LastSuspensionTime = 0
LeaveJobInQueue = false
MATCH_EXP_HepScore = "0"
MATCH_EXP_HostFactor = "354"
MATCH_EXP_numcpus = "32"
MATCH_TotalSlotCpus = 32
MATCH_t1_wn_hepscore = "0"
MATCH_t1_wn_hs06 = 354
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 1
MaxIdleJobs = 12
MaxJobs = 35
MemoryProvisioned = 2048
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
MinHosts = 1
MyType = "Job"
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 1
NumJobMatches = 1
NumJobStarts = 1
NumRestarts = 0
NumShadowStarts = 1
NumSystemHolds = 0
OnExitHold = false
OnExitRemove = true
OrigMaxHosts = 1
Out = "htcp308_5846925.0.out"
Owner = "herd006"
PeriodicHold = false
PeriodicRelease = false
Periodic_Hold = (NumJobStarts >= 1 && JobStatus =?= 1) || NumJobStarts > 1
ProcId = 0
QDate = 1687874893
Rank = 0.0
RecentBlockReadKbytes = 0
RecentBlockReads = 0
RecentBlockWriteKbytes = 0
RecentBlockWrites = 0
RecentStatsLifetimeStarter = 415
ReleaseReason = "Data files spooled"
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
RemoteWallClockTime = 425.0
Remote_JobUniverse = 5
RequestCpus = 1
RequestDisk = DiskUsage
RequestMemory = 2000
Requirements = (My.NumJobStarts == 0) && ((TARGET.Machine =?= "wn-200-10-11-02-a.cr.cnaf.infn.it"))
ResidentSetSize = 150
ResidentSetSize_RAW = 128
RootDir = "/"
RouteName = "testhtc10"
RoutedBy = "htcondor-ce"
RoutedFromJobId = "5846925.0"
RoutedJob = true
SUBMIT_Cmd = "/home/TIER1/sdalpra/htjobs/CE5/p308/htcp308"
SUBMIT_UserLog = "/home/TIER1/sdalpra/htjobs/CE5/htcp308_5846925.0.log"
ScheddHostName = "ce06-htc"
SciTokensFile = "/tmp/bt_u23031"
ScratchDirFileCount = 10
ShouldTransferFiles = "YES"
SpooledOutputFiles = ""
StartdPrincipal = "execute-side@matchsession/131.154.197.210"
StatsLifetimeStarter = 423
StreamErr = false
StreamOut = false
TargetType = "Machine"
TerminationPending = true
ToE = [ When = 1687875350; ExitCode = 0; Who = "itself"; How = "OF_ITS_OWN_ACCORD"; HowCode = 0; ExitBySignal = false ]
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferIn = false
TransferInFinished = 1687874926
TransferInStarted = 1687874926
TransferInputSizeMB = 0
TransferOutFinished = 1687875350
TransferOutStarted = 1687875350
TransferOutputRemaps = undefined
User = "herd006@t1htc_90"
WantCheckpoint = false
WantGPU = true
WantRemoteIO = true
WantRemoteSyscalls = false
WantRoute = "htc_10.5"
WhenToTransferOutput = "ON_EXIT"
numcpus = "$$(TotalSlotCpus:16)"
orig_AuthTokenId = "09047625-1519-4475-a914-0dd6a76f2dd4"
orig_AuthTokenIssuer = "https://iam-herd.cloud.cnaf.infn.it/";
orig_AuthTokenScopes = "openid,compute.create,offline_access,compute.read,compute.cancel,compute.modify"
orig_AuthTokenSubject = "6f925657-f9aa-4cb6-b264-a3b1ee78df57"
orig_OnExitHold = false
orig_environment = ""
osg_environment = ""
remote_NodeNumber = 1
remote_SMPGranularity = 1
remote_queue = ""