Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] What makes the condor_startd stucked?

Date: Wed, 11 Feb 2015 10:46:27 +0000
From: qing <gang.qin@xxxxxxxxxxxxx>
Subject: [HTCondor-users] What makes the condor_startd stucked?

Dear Condor Expert:

Recently we found that from time to time 'condor_status -state' doesnot report the memory of a execute node correctly, the last number inthe memory is missing. In the following example you will see that'condor_status -state -wide' reports that it's 21175 but 'condor_status'says it's 211750, a '0' is missing. Meanwhile the free partionableslot is always at 'Matched' status, while it should be 'Unclaimed'.


  node064:~# condor_status -state -wide | grep node064

slot1@xxxxxxxxxxxxxxxxxxxxxxx 54 21175 0.180 0+00:01:20Matched 0+05:37:30 Idle 0+05:37:30slot1_12@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+09:17:05 Busy 0+05:37:10slot1_17@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+11:31:54 Busy 0+05:37:10slot1_2@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+05:36:52 Busy 0+05:36:11slot1_36@xxxxxxxxxxxxxxxxxxxxxxx 1 4000 1.000 53+22:26:55Claimed 0+16:26:48 Busy 0+05:37:10slot1_47@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+09:16:30 Busy 0+05:37:10slot1_53@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+09:16:30 Busy 0+05:37:10slot1_57@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+10:58:15 Busy 0+05:37:10slot1_61@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+11:00:09 Busy 0+05:37:10slot1_64@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55Claimed 0+10:59:43 Busy 0+05:37:10slot1_8@xxxxxxxxxxxxxxxxxxxxxxx 1 4000 1.000 53+22:26:55Claimed 0+21:33:27 Busy 0+05:37:10


node064:~# condor_status  | grep node064

slot1@xxxxxxxxxxxx LINUX X86_64 Matched Idle 0.270 2117500+05:40:40slot1_12@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:41:40slot1_17@xxxxxxxxx LINUX X86_64 Claimed Busy 0.380 30000+05:41:40slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:40:41slot1_36@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 40000+05:41:40slot1_47@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:41:40slot1_53@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:41:40slot1_57@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:41:40slot1_61@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:41:40slot1_64@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 30000+05:41:40slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.000 40000+05:41:40

From the StartLog we can see that the last status change of slot1 oneis at ~ 4:22.

node064:~# cat /var/log/condor/StartLog | grep slot1 | grep -v '_' |tail -n 1502/11/15 02:56:30 slot1: State change: match notification protocolsuccessful

02/11/15 02:56:30 slot1: Changing state: Unclaimed -> Matched
02/11/15 02:56:30 slot1: Changing state: Matched -> Unclaimed

02/11/15 03:01:36 slot1: Received match<10.141.0.64:43678>#1418988885#44480#...02/11/15 03:01:36 slot1: State change: match notification protocolsuccessful

02/11/15 03:01:36 slot1: Changing state: Unclaimed -> Matched
02/11/15 03:01:36 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:41 slot1: State change: entering Drained state

02/11/15 03:01:41 slot1: Changing state and activity: Unclaimed/Idle ->Drained/Retiring

02/11/15 04:21:46 slot1: State change: slot is no longer draining.

02/11/15 04:21:46 slot1: Changing state and activity: Drained/Retiring-> Owner/Idle

02/11/15 04:21:46 slot1: Changing state: Owner -> Unclaimed

02/11/15 04:22:46 slot1: Received match<10.141.0.64:43678>#1418988885#44490#...02/11/15 04:22:46 slot1: State change: match notification protocolsuccessful

02/11/15 04:22:46 slot1: Changing state: Unclaimed -> Matched

  On the central service side we also have.

svr021:~# grep node064 /var/log/condor/CollectorLog | grep Inserting |tail -n 1002/11/15 02:56:35 StartdAd : Inserting ** "<slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 02:56:35 StartdPvtAd : Inserting ** "<slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 03:01:41 StartdAd : Inserting ** "<slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 03:01:41 StartdPvtAd : Inserting ** "<slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 04:22:25 StartdAd : Inserting ** "<slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 04:22:25 StartdPvtAd : Inserting ** "<slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 04:22:45 StartdAd : Inserting ** "<slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 04:22:45 StartdPvtAd : Inserting ** "<slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 04:22:49 StartdAd : Inserting ** "<slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"02/11/15 04:22:49 StartdPvtAd : Inserting ** "<slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"

Seems to me that the startd stopped sending information to collectorat 04:22. Usually this can be fixed by a simple restart of the startddaemon. But what could lead to such a behavior?


  Cheers,Gang

Prev by Date: [HTCondor-users] CLUSTER 2015 CALL FOR PAPERS
Next by Date: Re: [HTCondor-users] SECURITY: Addressing a problem in mailx
Previous by thread: [HTCondor-users] CLUSTER 2015 CALL FOR PAPERS
Next by thread: [HTCondor-users] SC15 ALERT: Deadline for workshop proposals rapidly approaching - last day to submit Feb 14, 2015
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] What makes the condor_startd stucked?