To clarify "we would rather prefer jobs to restart": I would expect shadows to restart and reconnect to jobs. Jobs on the startd should keep on running for some time without the shadow. Cheerio, Max > Am 18.01.2017 um 11:45 schrieb Fischer, Max (SCC) <max.fischer@xxxxxxx>: > > Hi all, > > We've seen a massive number jobs vacating after a schedd restart (via Puppet). This seems to have affected many but *not all* jobs. > While we know the trigger, the behaviour doesn't make sense for us. On another cluster used for testing, schedds restart frequently without evicting jobs. > Since such change-config-and-restart are not uncommon, we would rather prefer jobs to restart... > > Doing manual restart via systemctl/service leads to the same results. The main difference between clusters is the OS. > I'm not aware of any configuration switches for this, so I cannot rule out that one exists and is set differently. > > Failing Schedd: > OS SL7 > Condor 8.4.10 > systemctl > Testing Schedd: > OS SL6 > Condor 8.4.10 > service > > Short story of what happened, logs below: > - Puppet was told to implement config changes related to History rotation [1] > - Puppet ran, changed configs, and notified/restarted condor service [2] > - the local Master is told to shut down, and shuts down the schedd [3] > + we should probably change permissions > - the schedd is told to shut down, and shuts down the shadows [4] > + The line "Called preempt( 1 ) for a graceful shutdown" is not present on our test cluster when restarting > - the shadows stop [5] > + Our test cluster has no equivalent of the eviction messages > + The message indicates that eviction is unexpected > + The signal sent to the shadows is the same on both clusters > - the remote startd disables the claim and evicts the job [6] > + Obviously, we do NOT see this in the test cluster > + This happens at the same time as the Schedd/Shadow dies, not delays or timeouts > + The starter just dies > - the schedd restarts, and Shadows try to reconnect [7] > - the schedd writes its restart log for reconnection [8] > + Extracted from another host, but same behaviour > + Lots of jobs seem to have survived > > I'm rather confused that the Schedd/Shadow seem not to expect jobs to fail, and even try to reconnect them. As far as I can tell, it is the Schedd/Shadow that trigger jobs to vacate. > > Cheers, > Max > > [1] > SCHEDD.ROTATE_HISTORY_DAILY = True > SCHEDD.MAX_HISTORY_LOG = 100 * $(MEGABYTE) > SCHEDD.MAX_HISTORY_ROTATIONS = 4 > > [2] > file {'/etc/condor/config.d/060.condor_schedd.cfg': > content => template('middleware_arc_ce/condor/condor_schedd.erb'), > mode => '0644', > notify => Service['condor'], > } > # .... > service{'condor': > ensure => 'running', > enable => true, > } > > [3] > 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) PERMISSION DENIED to condor@xxxxxxxxx from host 10.97.13.133 for command 60005 (DC_OFF_GRACEFUL), access level ADMINISTRATOR: reason: ADMINISTRATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.97.13.133,arc-1-kit.gridka.de, hostname size = 1, original ip address = 10.97.13.133 > 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) DC_AUTHENTICATE: Command not authorized, done! > 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown. > 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Sent SIGTERM to SCHEDD (pid 1712004) > > [4] > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown. > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Cron: Killing all jobs > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Called preempt( 1 ) for a graceful shutdown > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.176:17048?addrs=10.1.106.176-17048> [pid 3694145] for job 592252.0 > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.117:20778?addrs=10.1.106.117-20778> [pid 3694147] for job 592234.0 > > [5] Failed Job > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.103.101:20740?addrs=10.1.103.101-20740> [pid 2312872] for job 598067.0 > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS|D_FAILURE) Shadow pid 2312872 died with signal 9 (Killed) > /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Match record (slot1@xxxxxxxxxxxxxxxxxxxxx <10.1.103.101:20740?addrs=10.1.103.101-20740> for SGM.default.SC.arc, 598067.0) deleted > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Got SIGTERM. Performing graceful shutdown. > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Job 598067.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxxxxx > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): logEvictEvent with unknown reason (-1), not logging. > > [6] > 01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Called deactivate_claim() > 01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Claimed/Busy -> Preempting/Vacating > 01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got DEACTIVATE_CLAIM while in Preempting state, ignoring. > 01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got RELEASE_CLAIM while in Preempting state, ignoring. > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: starter exited > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: No preempting claim, returning to owner > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Preempting/Vacating -> Owner/Idle > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: IS_OWNER is false > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Owner -> Unclaimed > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Unclaimed -> Delete > 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Resource no longer needed, deleting > > [7] Failed Job Reconnect > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) Initializing a VANILLA shadow for job 598067.0 > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Trying to reconnect to disconnected job > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): LastJobLeaseRenewal: 1484571266 Mon Jan 16 13:54:26 2017 > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration: 2400 seconds > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2346 > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Found starter: <10.1.103.101:20845?addrs=10.1.103.101-20845> > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to reconnect to starter <10.1.103.101:20845?addrs=10.1.103.101-20845> > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect SUCCESS: connection re-established > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): File transfer completed successfully. > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Can no longer talk to condor_starter <10.1.103.101:20845> > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2400 > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): locateStarter(): ClaimId (<10.1.103.101:20740>#1480163328#11511#[Integrity="YES";Encryption="NO";CryptoMethods="3DES";]d48495f13fb528b07349833922f07601c37379d9) and GlobalJobId ( arc-1-kit.gridka.de#598067.0#1484571243 ) not found > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect FAILED: Job not found at execution machine > /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Exiting with JOB_SHOULD_REQUEUE > > [8] /var/log/condor/ScheddRestartReport > The schedd arc-2-kit.gridka.de restarted at 01/16/17 14:01:02. > It attempted to reconnect to machines where its jobs may still be running. > All reconnect attempts have finished. > Here is a summary of the reconnect attempts: > > 1425 total jobs where reconnecting is possible > 0 reconnects are still being attempted > 0 reconnects weren't attempted because the lease expired before the schedd restarted > 377 reconnects failed > 0 reconnects were interrupted by job removal or other event > 1048 reconnects succeeded_______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature