just another update: we have been eyeballed a node when/if its resources/slot get considered for a job. For ~45m the node's master/startd daemons are up after a reload [1.a,1.b]. Since then on the Negotiator/Collector not much happened (both in full debug). The node has not appeared in the NegotiatorLog [2]; according to the CollectorLog the collector noticed the node disappearing, when the local master got restarted, and reporting abck again shortly afterwards [3] - since then not much happened except for occasional Stard class ad updates. At the same time there are still about 3k idle jobs waiting to be considered by the matchmaker - now a number of them being idle for ~100m. Our current impression is, that the jobs and nodes both announce their wishes/offers, but both are never really get matched to each other (in a negotiation cycle) [4]. We have been trying to saturate the cluster with jobs running under the DESY accounting group, which has a nominally small share but with surplus allowed should be able to get quite a number of unallocated slots GROUP_ACCEPT_SURPLUS = TRUE but we are not sure, if/how the nagotiator actually brokers the jobs to the available resources [4.b] Cheers, Thomas [1.a] [root@batch0350 /var/log/condor]# tail -f MasterLog 02/05/20 14:45:12 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s. 02/05/20 14:45:12 DaemonCore: private command socket at <131.169.161.229:0?sock=25772_06df> 02/05/20 14:45:12 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False) 02/05/20 14:45:12 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1577238779) 02/05/20 14:45:12 Collector port not defined, will use default: 9618 02/05/20 14:45:12 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 25804 02/05/20 14:45:12 Waiting for /var/lock/condor/shared_port_ad to appear. 02/05/20 14:45:13 Found /var/lock/condor/shared_port_ad. 02/05/20 14:45:13 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 25808 02/05/20 14:45:23 Setting ready state 'Ready' for STARTD [1.b] [root@batch0350 /var/log/condor]# tail -n 20 -f StartLog 02/05/20 14:45:20 Running: /usr/bin/docker container prune -f --filter=label=org.htcondorproject=True 02/05/20 14:45:20 Failed to read results from '/usr/bin/docker container prune -f --filter=label=org.htcondorproject=True': 'No such file or directory' (2) 02/05/20 14:45:20 my_popenv: Failed to exec in child, errno=2 (No such file or directory) 02/05/20 14:45:20 my_popenv failed 02/05/20 14:45:20 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices. 02/05/20 14:45:20 CronJobList: Adding job 'NODEHEALTH' 02/05/20 14:45:20 CronJob: Initializing job 'NODEHEALTH' (/etc/condor/tests/healthcheck_wn_condor.sh) 02/05/20 14:45:20 CronJobList: Adding job 'kflops' 02/05/20 14:45:20 CronJobList: Adding job 'mips' 02/05/20 14:45:20 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops) 02/05/20 14:45:20 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips) 02/05/20 14:45:20 slot1: State change: IS_OWNER is false 02/05/20 14:45:20 slot1: Changing state: Owner -> Unclaimed 02/05/20 14:45:20 State change: RunBenchmarks is TRUE 02/05/20 14:45:20 slot1: Changing activity: Idle -> Benchmarking 02/05/20 14:45:20 BenchMgr:StartBenchmarks() 02/05/20 14:45:23 Initial update sent to collector(s) 02/05/20 14:45:23 Sending DC_SET_READY message to master <131.169.161.229:9620?addrs=131.169.161.229-9620&noUDP&sock=25772_06df> 02/05/20 14:45:47 State change: benchmarks completed 02/05/20 14:45:47 slot1: Changing activity: Benchmarking -> Idle [2] [root@condor01 condor]# condor_status -negotiator Name LastCycleEnd (Sec) Slots Submitrs Schedds Jobs Matches Rejections condor01.desy.de 2/5 15:40 2 2984 1 1 3 5 1 [root@condor01 condor]# condor_status -collector Name Machine RunningJobs IdleJobs HostsTotal DESY Condor Pool - condor01.desy.de@xxxxxxxxxxxxxxxx condor01.desy.de 2613 3690 2994 [root@condor01 condor]# grep 131.169.161.229 NegotiatorLog | grep "02/05/20" | tail [root@condor01 condor]# echo $? 0 [3] [root@condor01 condor]# grep 131.169.161.229 CollectorLog | grep "02/05/20" | tail -n 30 02/05/20 13:39:59 **** Removed(1) ad(s): "< slot1_8@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:39:59 **** Removed(1) ad(s): "< slot1_8@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:39:59 **** Removed(1) ad(s): "< slot1_2@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:39:59 **** Removed(1) ad(s): "< slot1_2@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:39:59 **** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:39:59 **** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:39:59 DaemonCore: Can't receive command request from 131.169.161.229 (perhaps a timeout?) 02/05/20 13:41:59 DaemonCore: Can't receive command request from 131.169.161.229 (perhaps a timeout?) 02/05/20 13:42:11 StartdAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 13:42:11 StartdPvtAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 14:45:09 **** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 14:45:09 **** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 14:45:09 DaemonCore: Can't receive command request from 131.169.161.229 (perhaps a timeout?) 02/05/20 14:45:09 DaemonCore: Can't receive command request from 131.169.161.229 (perhaps a timeout?) 02/05/20 14:45:23 StartdAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 14:45:23 StartdPvtAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:00:23 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:00:23 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:05:23 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:05:23 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:10:24 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:10:24 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:15:23 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:15:23 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:20:23 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:20:23 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:25:23 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:25:23 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:30:23 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" 02/05/20 15:30:23 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxx , 131.169.161.229 >" [4] Negotiator and Collector Logs as compressed tarball under https://desycloud.desy.de/index.php/s/ja7q2g5rcyqn3GC/download [4.b] 02/05/20 15:47:56 Group group_DESY - BEGIN NEGOTIATION 02/05/20 15:46:57 NEGOTIATOR_STRICT_ENFORCE_QUOTA is true, current proposed allocation for group_DESY is 3656 ... 02/05/20 15:47:56 Round 1 totals: allocated= 6169 usage= 6168 ... 02/05/20 15:47:56 group quotas: allocate-surplus (4): group group_DESY allocated surplus= 2184.66 allocated= 3655 requested= 0 ... 02/05/20 15:47:56 group quotas: groups= 9 requesting= 5 served= 5 unserved= 0 slots= 7270 requested= 6168 allocated= 6168 surplus= 10888 maxdelta= 0
AcceptedWhileDraining = false Activity = "Idle" AddressV1 = "{[ p=\"primary\"; a=\"131.169.161.229\"; port=9620; n=\"Internet\"; spid=\"25772_06df_3\"; noUDP=true; ], [ p=\"IPv4\"; a=\"131.169.161.229\"; port=9620; n=\"Internet\"; spid=\"25772_06df_3\"; noUDP=true; ]}" Arch = "X86_64" AuthenticatedIdentity = "unauthenticated@unmapped" AuthenticationMethod = undefined CanHibernate = true CheckpointPlatform = "LINUX X86_64 3.10.0-1062.7.1.el7.x86_64 normal N/A avx ssse3 sse4_1 sse4_2" ChildAccountingGroup = { } ChildActivity = { } ChildCpus = { } ChildCurrentRank = { } ChildDisk = { } ChildEnteredCurrentState = { } ChildMemory = { } ChildName = { } ChildRemoteOwner = { } ChildRemoteUser = { } ChildRetirementTimeRemaining = { } ChildState = { } ClockDay = 3 ClockMin = 930 COLLECTOR_HOST_STRING = "condor01.desy.de:9618" CondorLoadAvg = 0.0 CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $" CondorVersion = "$CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $" ConsoleIdle = 5332576 ConsumptionCpus = quantize(target.RequestCpus,{ 1 }) ConsumptionDisk = quantize(target.RequestDisk,{ 1024 }) ConsumptionMemory = quantize(target.RequestMemory,{ 128 }) CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.5) CpuBusyTime = 0 CpuCacheSize = 2048 CpuFamily = 21 CpuIsBusy = false CpuModelNumber = 2 Cpus = 64 CurrentRank = 0.0 DaemonCoreDutyCycle = 0.0001180435969121429 DaemonLastReconfigTime = 1580910313 DaemonStartTime = 1580910313 DetectedCpus = 64 DetectedMemory = 257757 DEV_RESOURCE = false Disk = 958636676 EnteredCurrentActivity = 1580910347 EnteredCurrentState = 1580910320 ExpectedMachineGracefulDrainingBadput = 0 ExpectedMachineGracefulDrainingCompletion = 1580910320 ExpectedMachineQuickDrainingBadput = 0 ExpectedMachineQuickDrainingCompletion = 1580910320 FileSystemDomain = "desy.de" GRID_RESOURCE = true HardwareAddress = "00:8c:fa:56:d7:64" has_avx = true has_sse4_1 = true has_sse4_2 = true has_ssse3 = true HasFileTransfer = true HasFileTransferPluginMethods = "file,ftp,http,data,https" HasIOProxy = true HasJava = true HasJICLocalConfig = true HasJICLocalStdin = true HasJobDeferral = true HasMPI = true HasPerFileEncryption = true HasReconnect = true HasSelfCheckpointTransfers = true HasSingularity = true HasTDP = true HasTransferInputRemaps = true HasVM = false HibernationLevel = 0 HibernationState = "NONE" HibernationSupportedStates = "S3,S4" IsLocalStartd = false IsValidCheckpointPlatform = (TARGET.JobUniverse =!= 1 || ((MY.CheckpointPlatform =!= undefined) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0)))) IsWakeAble = true IsWakeOnLanEnabled = true IsWakeOnLanSupported = true JavaMFlops = 881.762756 JavaSpecificationVersion = "1.8" JavaVendor = "Oracle Corporation" JavaVersion = "1.8.0_242" JobPreemptions = 0 JobRankPreemptions = 0 JobStarts = 0 JobUserPrioPreemptions = 0 KeyboardIdle = 44 KFlops = 998883 LastBenchmark = 1580910347 LastFetchWorkCompleted = 0 LastFetchWorkSpawned = 0 LastHeardFrom = 1580913023 LastUpdate = 1580910347 LoadAvg = 0.3 Machine = "batch0350.desy.de" MachineMaxVacateTime = 1800 MachineResources = "Cpus Memory Disk Swap" MaxJobRetirementTime = 259200 Memory = 257757 Mips = 17744 MonitorSelfAge = 2648 MonitorSelfCPUUsage = 0.0 MonitorSelfImageSize = 67400 MonitorSelfRegisteredSocketCount = 0 MonitorSelfResidentSetSize = 7312 MonitorSelfSecuritySessions = 67 MonitorSelfTime = 1580912960 MyAddress = "<131.169.161.229:9620?addrs=131.169.161.229-9620&noUDP&sock=25772_06df_3>" MyCurrentTime = 1580913023 MyType = "Machine" NAF_RESOURCE = false Name = "slot1@xxxxxxxxxxxxxxxxx" NextFetchWorkDelay = -1 NODE_IS_HEALTHY = true NODE_STATUS = "All_OK" NumClaims = 64 NumDynamicSlots = 0 NumPids = 0 OLD_RESOURCE = false OnlyMulticore = false OpSys = "LINUX" OpSysAndVer = "CentOS7" OpSysLegacy = "LINUX" OpSysLongName = "CentOS Linux release 7.7.1908 (Core)" OpSysMajorVer = 7 OpSysName = "CentOS" OpSysShortName = "CentOS" OpSysVer = 707 PartitionableSlot = true PslotRollupInformation = true Rank = 0.0 RecentDaemonCoreDutyCycle = 0.0001122386815403109 RecentJobPreemptions = 0 RecentJobRankPreemptions = 0 RecentJobStarts = 0 RecentJobUserPrioPreemptions = 0 Requirements = (START) && (IsValidCheckpointPlatform) && (WithinResourceLimits) RetirementTimeRemaining = 0 SingularityVersion = "singularity version 3.5.2-1.1.el7" SlotID = 1 SlotType = "Partitionable" SlotTypeID = 1 SlotWeight = Cpus Start = (NODE_IS_HEALTHY =?= true) && (StartJobs =?= true) StartdIpAddr = "<131.169.161.229:9620?addrs=131.169.161.229-9620&noUDP&sock=25772_06df_3>" StarterAbilityList = "HasJava,HasJICLocalStdin,HasJICLocalConfig,HasTDP,HasSingularity,HasPerFileEncryption,HasFileTransfer,HasTransferInputRemaps,HasVM,HasReconnect,HasMPI,HasFileTransferPluginMethods,HasJobDeferral,HasSelfCheckpointTransfers" StartJobs = true State = "Unclaimed" SubnetMask = "255.255.248.0" TargetType = "Job" TEST_RESOURCE = true TimeToLive = 2147483647 TotalCondorLoadAvg = 0.0 TotalCpus = 64.0 TotalDisk = 958636676 TotalLoadAvg = 0.3 TotalMemory = 257757 TotalSlotCpus = 64 TotalSlotDisk = 958636676.0 TotalSlotMemory = 257757 TotalSlots = 1 TotalTimeUnclaimedBenchmarking = 27 TotalTimeUnclaimedIdle = 2676 TotalVirtualMemory = 272064580 UidDomain = "desy.de" Unhibernate = MY.MachineLastMatchTime =!= undefined UpdateSequenceNumber = 11 UpdatesHistory = "00000000000000000000000000000000" UpdatesLost = 0 UpdatesSequenced = 54 UpdatesTotal = 57 UtsnameMachine = "x86_64" UtsnameNodename = "batch0350.desy.de" UtsnameRelease = "3.10.0-1062.7.1.el7.x86_64" UtsnameSysname = "Linux" UtsnameVersion = "#1 SMP Mon Dec 2 17:33:29 UTC 2019" VirtualMemory = 272064580 WakeOnLanEnabledFlags = "Magic Packet" WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet" WithinResourceLimits = (ifThenElse(TARGET._cp_orig_RequestCpus =!= undefined,TARGET.RequestCpus <= MY.Cpus,MY.ConsumptionCpus <= MY.Cpus) && ifThenElse(TARGET._cp_orig_RequestDisk =!= undefined,TARGET.RequestDisk <= MY.Disk,MY.ConsumptionDisk <= MY.Disk) && ifThenElse(TARGET._cp_orig_RequestMemory =!= undefined,TARGET.RequestMemory <= MY.Memory,MY.ConsumptionMemory <= MY.Memory))
Attachment:
binNjRtg1gseF.bin
Description: Binary data
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature