Hello, we are using HTCondor 8.4.4 and are experiencing issues when jobs get evicted due to the computing machine being turned off or if the scheduling machine loses contact with the computing instance and is unable to reconnect before
rescheduling the job, for some reason the jobs become idle and remain in idle even though there are available computing instances to take the job. We are using partitionable slots on our computing machines. From investigating it appears that the matching
is failing due to the requirement TARGET.Disk >= RequestDisk. The condor.submit for this job does not have a RequestDisk specified, the job only specifies the required number of CPUs and Memory. The jobs run on EC2 instances that have local disks of size
1TB and share an EFS volume of size 8EB, the jobs that run on these machines write to both of these locations. My primary confusion arises from the fact that the post-eviction job_ad now has a DiskUsage = 42500000 and specifies RequestDisk = 42500096 while
the slot_ad advertises 42498100 for its Disk (the job_ad and slot_ad were found on a worker that normally runs the job correctly). This is clearly where the requirements are failing but I have checked the machine and it has almost the entire 1 TB of it’s
local drive free so I don’t understand why the slot_ad is being limited to Disk which is just slightly lower than what is needed by the job. I have provided output below: condor_q -better-analyze: -- Schedd: htcondorscheduler1.localdomain : < User priority for
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx is not available, attempting to analyze without it. --- 5521.000: Run analysis summary. Of 19 machines, 16 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job Last successful match: Wed Apr 19 14:15:23 2017 Last failed match: Wed Apr 19 14:34:43 2017 Reason for last match failure: no match found
The Requirements _expression_ for your job is: ( HAS_DOCKER && HAS_RCP_DFS && target.machine isnt MachineAttrMachine1 && target.machine isnt MachineAttrMachine2 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ) Your job defines the following attributes: DiskUsage = 42500000 FileSystemDomain = "htcondorscheduler1.localdomain" RequestDisk = 42500000 RequestMemory = 20480 The Requirements _expression_ for your job reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 18 HAS_DOCKER [1] 18 HAS_RCP_DFS [9] 18 TARGET.OpSys == "LINUX" [11] 8 TARGET.Disk >= RequestDisk [13] 3 TARGET.Memory >= RequestMemory [15] 19 TARGET.HasFileTransfer Suggestions: Condition Machines Matched Suggestion --------- ---------------- ---------- 1 HAS_DOCKER 0 REMOVE 2 HAS_RCP_DFS 0 REMOVE 3 ( TARGET.Memory >= 20480 ) 3 4 ( TARGET.Disk >= 42500000 ) 8 5 ( TARGET.OpSys == "LINUX" ) 18 6 target.machine isnt MachineAttrMachine119
7 target.machine isnt MachineAttrMachine219
8 ( TARGET.Arch == "X86_64" ) 19 9 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "htcondorscheduler1.localdomain" ) ) Job_ad (From the worker machine): Arguments = "426465790066.dkr.ecr.us-east-1.amazonaws.com/ai-terrestrial_pipeline_all:4.1.17 terrestrial-stage_1400 --workflow=workflow.json" AutoClusterAttrs = "ConcurrencyLimits,NiceUser,Rank,Requirements,_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory,MachineLastMatchTime,DiskUsage,FileSystemDomain" AutoClusterId = 13 BufferBlockSize = 32768 BufferSize = 524288 BytesRecvd = 9386.0 BytesSent = 0.0 ClusterId = 5521 Cmd = "/usr/local/bin/condor-docker" CommittedSlotTime = 0 CommittedSuspensionTime = 0 CommittedTime = 0 CompletionDate = 0 CondorPlatform = "$CondorPlatform: x86_64_Ubuntu14 $" CondorVersion = "$CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $" CoreSize = 0 CpusProvisioned = 36 CumulativeSlotTime = 492.0 CumulativeSuspensionTime = 0 CurrentHosts = 0 DAGManJobId = 5519 DAGManNodesLog = "/disk-root/condor/execute/HT028_1407138585/260/./HT028_1407138585.dagman.nodes.log" DAGManNodesMask = "0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" DAGNodeName = "stage_1400" DAGParentNodeNames = "stage_1100" DiskProvisioned = 1073184440 DiskUsage = 42500000 DiskUsage_RAW = 40584772 EncryptExecuteDirectory = false EnteredCurrentStatus = 1492611815 Environment = "" Err = "job.stderr.5521" ExecutableSize = 0 ExecutableSize_RAW = 0 ExitBySignal = false ExitStatus = 0 FileSystemDomain = "htcondorscheduler1.localdomain" GlobalJobId = "htcondorscheduler1.localdomain#5521.0#1492611317" ImageSize = 750000 ImageSize_RAW = 623124 In = "/dev/null" Iwd = "/disk-root/condor/execute/HT028_1407138585/260/stage_1400" JobCurrentStartDate = 1492611323 JobCurrentStartExecutingDate = 1492611323 JobLeaseDuration = 600 JobMachineAttrs = "Machine" JobMachineAttrsHistoryLength = 5 JobNotification = 0 JobPrio = 1400 JobRunCount = 1 JobStartDate = 1492611323 JobStatus = 1 JobUniverse = 5 KeepClaimIdle = 20 LastJobLeaseRenewal = 1492611814 LastJobStatus = 2 LastMatchTime = 1492611323 LastPublicClaimId = "<>#1492611105#1#..." LastRejMatchReason = "no match found " LastRejMatchTime = 1492612483 LastRemoteHost = "slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx" LastSuspensionTime = 0 LastVacateTime = 1492611815 LeaveJobInQueue = false LocalSysCpu = 0.0 LocalUserCpu = 0.0 MachineAttrCpus0 = 1 MachineAttrMachine0 = "ip-10-122-226-188.localdomain" MachineAttrSlotWeight0 = 1 MaxHosts = 1 MemoryProvisioned = 60387 MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 ) MinHosts = 1 MyType = "Job" NiceUser = false NumCkpts = 0 NumCkpts_RAW = 0 NumJobMatches = 1 NumJobStarts = 1 NumRestarts = 0 NumShadowStarts = 1 NumSystemHolds = 0 > > OrigMaxHosts = 1 Out = "job.stdout.5521" Owner = "condor" PeriodicHold = false PeriodicRelease = false PeriodicRemove = ( ( JobStatus == 5 ) && ( CurrentTime - EnteredCurrentStatus ) > 300 ) ProcId = 0 ProvisionedResources = "Cpus Memory Disk Swap" QDate = 1492611317 Rank = 0.0 RemoteAutoregroup = false RemoteNegotiatingGroup = "<none>" RemoteSysCpu = 0.0 RemoteUserCpu = 0.0 RemoteWallClockTime = 492.0 RequestCpus = 1 RequestDisk = 42500096 RequestMemory = 20480 Requirements = ( HAS_DOCKER && HAS_RCP_DFS && target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory
>= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ) ResidentSetSize = 12500 ResidentSetSize_RAW = 12192 RootDir = "/" ServerTime = 1492612963 ShouldTransferFiles = "IF_NEEDED" StartdPrincipal = "execute-side@matchsession/" StartdSendsAlives = true StreamErr = false StreamOut = false SubmitEventNotes = "DAG Node: stage_1400" TargetType = "Machine" TotalSuspensions = 0 TransferExecutable = false TransferIn = false TransferInput = "../workflow.json" TransferInputSizeMB = 0 TransferOutput = "OUT" User = "condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" UserLog = "/disk-root/condor/execute/HT028_1407138585/260/stage_1400/job.log" WantCheckpoint = false WantRemoteIO = true WantRemoteSyscalls = false WhenToTransferOutput = "ON_EXIT" _condor_SEND_LEFTOVERS = false _condor_SEND_PAIRED_SLOT = true _condor_StartdHandlesAlives = true Slot_ad (From the worker machine): Activity = "Idle" AddressV1 = "{[ p=\"primary\"; a=\"\"; port=37559; n=\"Internet\"; ], [ p=\"IPv4\"; a=\"\"; port=37559; n=\"Internet\"; ]}" Arch = "X86_64" CLAIM_WORKLIFE = 1200 COLLECTOR_HOST_STRING = "" CONTINUE = true CanHibernate = true CheckpointPlatform = "LINUX X86_64 3.13.0-91-generic normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2" ClockDay = 3 ClockMin = 882 CondorLoadAvg = 0.0 CondorPlatform = "$CondorPlatform: x86_64_Ubuntu14 $" CondorVersion = "$CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $" ConsoleIdle = 359 CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.5 ) CpuBusyTime = 0 CpuIsBusy = false Cpus = 1 CurrentRank = 0.0 DaemonCoreDutyCycle = -0.1805678648829709 DetectedCpus = 36 DetectedMemory = 60387 Disk = 42498100 DynamicSlot = true EnteredCurrentActivity = 1492612964 EnteredCurrentState = 1492612964 ExpectedMachineGracefulDrainingBadput = 0 ExpectedMachineGracefulDrainingCompletion = 1492612605 ExpectedMachineQuickDrainingBadput = 0 ExpectedMachineQuickDrainingCompletion = 1492612605 FileSystemDomain = "ip-10-122-225-241.localdomain" HAS_AWS = true HAS_DOCKER = true HAS_RCP_DFS = true HardwareAddress = "12:fc:4f:64:cc:26" HasCheckpointing = true HasEncryptExecuteDirectory = true HasFileTransfer = true HasFileTransferPluginMethods = "file,ftp,http,data" HasIOProxy = true HasJICLocalConfig = true HasJICLocalStdin = true HasJobDeferral = true HasMPI = true HasPerFileEncryption = true HasReconnect = true HasRemoteSyscalls = true HasTDP = true HasVM = false HibernationLevel = 0 HibernationState = "NONE" HibernationSupportedStates = "S3,S4,S5" IsLocalStartd = false IsOwner = ( START =?= false ) IsValidCheckpointPlatform = ( TARGET.JobUniverse =!= 1 || ( ( MY.CheckpointPlatform =!= undefined ) && ( ( TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) ) IsWakeAble = false IsWakeOnLanEnabled = false IsWakeOnLanSupported = false JobPreemptions = 0 JobRankPreemptions = 0 JobStarts = 0 JobUserPrioPreemptions = 0 KFlops = 1750755 KILL = false KeyboardIdle = 359 LastBenchmark = 1492612631 LastFetchWorkCompleted = 0 LastFetchWorkSpawned = 0 LastUpdate = 1492612631 LoadAvg = 0.0 Machine = "ip-10-122-225-241.localdomain" MachineMaxVacateTime = 10 * 60 MachineResources = "Cpus Memory Disk Swap" MaxJobRetirementTime = 0 Memory = 20480 Mips = 24337 MonitorSelfAge = 241 MonitorSelfCPUUsage = 0.008310156277141429 MonitorSelfImageSize = 45312 MonitorSelfRegisteredSocketCount = 1 MonitorSelfResidentSetSize = 6212 MonitorSelfSecuritySessions = 3 MonitorSelfTime = 1492612845 MyAddress = "<>" MyCurrentTime = 1492612964 MyType = "Machine" Name = "slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx" NextFetchWorkDelay = -1 NumPids = 0 OpSys = "LINUX" OpSysAndVer = "Ubuntu14" OpSysLegacy = "LINUX" OpSysLongName = "Ubuntu 14.04.4 LTS" OpSysMajorVer = 14 OpSysName = "Ubuntu" OpSysShortName = "Ubuntu" OpSysVer = 1404 PERIODIC_CHECKPOINT = ( ( time() - LastPeriodicCheckpoint ) / 60.0 ) > ( 180.0 + -7 ) PREEMPT = ( false ) || ( TotalDisk < 1000000 ) ParentSlotId = 1 PrivateNetworkName = "ip-10-122-225-241.localdomain" PslotRollupInformation = true Rank = 0.0 RemoteAutoregroup = false RemoteNegotiatingGroup = "<none>" RemoteSysCpu = 0.0 RemoteUserCpu = 0.0 RemoteWallClockTime = 492.0 RequestCpus = 1 RequestDisk = 42500096 RequestMemory = 20480 Requirements = ( HAS_DOCKER && HAS_RCP_DFS && target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory
>= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) ) ResidentSetSize = 12500 ResidentSetSize_RAW = 12192 RootDir = "/" ServerTime = 1492612963 ShouldTransferFiles = "IF_NEEDED" StartdPrincipal = "execute-side@matchsession/" StartdSendsAlives = true StreamErr = false StreamOut = false SubmitEventNotes = "DAG Node: stage_1400" TargetType = "Machine" TotalSuspensions = 0 TransferExecutable = false TransferIn = false TransferInput = "../workflow.json" TransferInputSizeMB = 0 TransferOutput = "OUT" User = "condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" UserLog = "/disk-root/condor/execute/HT028_1407138585/260/stage_1400/job.log" WantCheckpoint = false WantRemoteIO = true WantRemoteSyscalls = false WhenToTransferOutput = "ON_EXIT" _condor_SEND_LEFTOVERS = false _condor_SEND_PAIRED_SLOT = true _condor_StartdHandlesAlives = true |