One of the folks here tried to get this posted to the mailing list, but was having some problems doing it, so I'm forwarding it: Hello htcondor-users, In our scaling studies within the LSST data management project we have started to obtain several modes of error as we utilize HTCondor and HTCondor DAGman. In our tests we are using DAGMAN to run a collection of identical jobs where each job take about 6 minutes to execute. We are processing on the TACC Lonestar cluster by using Glidein to add Lonestar compute nodes to out working pool, where our central manager runs on a machine at NCSA. We are using CCB with the server being the collector on our central manager. We are running 8064 test jobs at scales of 504, 1008, and 2016 cores / htcondor processing slots, by using 42 nodes - 12 cores/node, 84 nodes - 12 cores/node, 168 nodes - 12 cores/node, respectively. 1) One collection of error modes are messages to the xxxxx.dag.nodes.log file of the form ... 007 (92602.000.000) 04/07 19:49:35 Shadow exception! Error from slot11@17868@xxxxxxxxxxxxxxxxxxxxxxxxxxxx: FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <206.76.195.45:33839> .... 007 (92299.000.000) 04/07 19:49:35 Shadow exception! Error from slot9@17868@xxxxxxxxxxxxxxxxxxxxxxxxxxxx: ProcD has failed ... 024 (95543.000.000) 04/07 20:19:43 Job reconnection failed Job disconnected too long: JobLeaseDuration (1200 seconds) expired Can not reconnect to slot8@3632@xxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job We observed these errors at the 2016 core scale, but did not observe them at the 504 or 1008 core scale. We have made some progress in mitigating these errors but would like to understand further: With a setting of DAGMAN_USER_LOG_SCAN_INTERVAL=5 such errors occur at the 2016 core scale, but with a higher setting DAGMAN_USER_LOG_SCAN_INTERVAL=40 we do not observe them. We have a hypothesis that the errors result from a busy/backed up schedd, and suspect the parameter change has lightened the load on the schedd or perhaps the collector, and diminished the load on the central manager. Can condor experts shed light on the results with these various parameter settings and the errors observed? Is there a more direct parameter setting change that might impact theses errors? 2) At any of these scales but primarily on the high 2016 core end, we can observe JobEvictedEvents like ... 028 (95502.000.000) 04/07 20:24:32 Job ad information event triggered. Proc = 0 EventTime = "2013-04-07T20:24:32" TriggerEventTypeName = "ULOG_JOB_EVICTED" TriggerEventTypeNumber = 4 RunRemoteUsage = "Usr 0 00:03:20, Sys 0 00:00:01" RunLocalUsage = "Usr 0 00:00:00, Sys 0 00:00:00" SentBytes = 0.0 MyType = "JobEvictedEvent" Checkpointed = false TerminatedAndRequeued = false Cluster = 95502 MachineSlotName = "slot2@18857@xxxxxxxxxxxxxxxxxxxxxxxxxxxx" Subproc = 0 EventTypeNumber = 28 CurrentTime = time() ReceivedBytes = 3242.000000 TerminatedNormally = false ... 004 (95403.000.000) 04/07 20:24:32 Job was evicted. in xxxxx.dag.nodes.log. We have very little visibility on the cause of these job evictions. Can the condor team advise on how to debug the cause of these evictions, and how they might be addressed? |