Hi,
There is frequent claim lease expire occurs and the /condor_startd/
drop the claim the running jobs are get killed.
1. As per the manual "The length of the claim lease is the job's ClassAd
attribute JobLeaseDuration." for my job JobLeaseDuration = 1200 but the
claimlease duration is set to 2400.
2. whether the claim is droped based on MAX_CLAIM_ALIVES_MISSED or
within the claimlease duration the Startd didn't receive any Alive packets.
In my pool I have the following configuration.
ALIVE_INTERVAL = 600 (default 300)
REQUEST CLAIM TIMEOUT = default(30 min)
MAX CLAIM ALIVES MISSED= default(6) at startd
I have copied the part of StartLog.
8/6 16:27:55 slot1.50: State change: claiming protocol successful
8/6 16:27:55 slot1.50: Changing state: Owner -> Claimed
8/6 16:27:55 slot1.50: Started ClaimLease timer (172447) w/ 2400 second
lease duration
8/6 16:27:55 slot1.50: Got activate_claim request from shadow
(<10.207.100.66:9978>)
8/6 16:27:55 slot1.50: Read request ad and starter from shadow.
8/6 16:27:56 slot1.50: JobLeaseDuration defined in job ClassAd: 1200
8/6 16:27:56 slot1.50: Resetting ClaimLease timer (172447) with new duration
8/6 16:27:56 slot1.50: About to Create_Process "condor_starter -f -a
slot1.50 gridprime.pesgrid.wipro.com"
8/6 16:27:56 slot1.50: State change: claim-activation protocol successful
8/6 16:27:56 slot1.50: Changing activity: Idle -> Busy
8/6 17:22:05 slot1.50: State change: claim lease expired (condor_schedd
gone?)
8/6 17:22:05 slot1.50: Changing state and activity: Claimed/Busy ->
Preempting/Killing
8/6 17:22:05 slot1.50: In Starter::kill() with pid 15687, sig 3 (SIGQUIT)
8/6 17:22:05 slot1.50: Got ALIVE while in Preempting state, ignoring.
8/6 17:23:11 slot1.50: State change: No preempting claim, returning to owner
8/6 17:23:11 slot1.50: Changing state and activity: Preempting/Killing
-> Owner/Idle
8/6 17:23:11 slot1.50: State change: IS_OWNER is false
8/6 17:23:11 slot1.50: Changing state: Owner -> Unclaimed
8/6 17:23:11 slot1.50: Changing state: Unclaimed -> Delete
8/6 17:23:11 slot1.50: Resource no longer needed, deleting
8/6 17:25:27 slot1.50: New machine resource of type -1 allocated
8/6 17:25:29 slot1.50: Rank of this claim is: 0.000000
8/6 17:25:29 slot1.50: Request accepted.
8/6 17:25:29 slot1.50: State change: claiming protocol successful
8/6 17:25:29 slot1.50: Changing state: Owner -> Claimed
8/6 17:25:29 slot1.50: Started ClaimLease timer (176480) w/ 2400 second
lease duration
8/6 17:25:30 slot1.50: Got activate_claim request from shadow
(<10.207.100.66:9845>)
8/6 17:25:30 slot1.50: Read request ad and starter from shadow.
8/6 17:25:31 slot1.50: JobLeaseDuration defined in job ClassAd: 1200
8/6 17:25:31 slot1.50: Resetting ClaimLease timer (176480) with new duration
8/6 17:25:31 slot1.50: About to Create_Process "condor_starter -f -a
slot1.50 gridprime.pesgrid.wipro.com"
8/6 17:25:32 slot1.50: State change: claim-activation protocol successful
8/6 17:25:32 slot1.50: Changing activity: Idle -> Busy
8/6 18:15:28 slot1.50: State change: claim lease expired (condor_schedd
gone?)
8/6 18:15:28 slot1.50: Changing state and activity: Claimed/Busy ->
Preempting/Killing
8/6 18:15:28 slot1.50: In Starter::kill() with pid 17698, sig 3 (SIGQUIT)
8/6 18:15:30 slot1.50: Got ALIVE while in Preempting state, ignoring.
8/6 18:17:12 slot1.50: State change: No preempting claim, returning to owner
8/6 18:17:12 slot1.50: Changing state and activity: Preempting/Killing
-> Owner/Idle
8/6 18:17:12 slot1.50: State change: IS_OWNER is false
by
Johnson
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/