Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] What makes the condor_startd stucked?
- Date: Wed, 11 Feb 2015 10:46:27 +0000
 
- From: qing <gang.qin@xxxxxxxxxxxxx>
 
- Subject: [HTCondor-users] What makes the condor_startd stucked?
 
Dear Condor Expert:
  Recently we found that from time to time 'condor_status -state' does 
not report the memory of a execute node correctly, the last number in 
the memory is missing. In the following example you will see that 
'condor_status -state -wide' reports that it's 21175 but 'condor_status' 
says it's 211750, a '0' is missing.   Meanwhile the free partionable  
slot is always at 'Matched' status, while it should be 'Unclaimed'.
  node064:~# condor_status -state -wide | grep node064
slot1@xxxxxxxxxxxxxxxxxxxxxxx       54 21175  0.180   0+00:01:20 
Matched   0+05:37:30 Idle    0+05:37:30
slot1_12@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+09:17:05 Busy    0+05:37:10
slot1_17@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+11:31:54 Busy    0+05:37:10
slot1_2@xxxxxxxxxxxxxxxxxxxxxxx      1  3000  1.000  53+22:26:55 
Claimed   0+05:36:52 Busy    0+05:36:11
slot1_36@xxxxxxxxxxxxxxxxxxxxxxx     1  4000  1.000  53+22:26:55 
Claimed   0+16:26:48 Busy    0+05:37:10
slot1_47@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+09:16:30 Busy    0+05:37:10
slot1_53@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+09:16:30 Busy    0+05:37:10
slot1_57@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+10:58:15 Busy    0+05:37:10
slot1_61@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+11:00:09 Busy    0+05:37:10
slot1_64@xxxxxxxxxxxxxxxxxxxxxxx     1  3000  1.000  53+22:26:55 
Claimed   0+10:59:43 Busy    0+05:37:10
slot1_8@xxxxxxxxxxxxxxxxxxxxxxx      1  4000  1.000  53+22:26:55 
Claimed   0+21:33:27 Busy    0+05:37:10
node064:~# condor_status  | grep node064
slot1@xxxxxxxxxxxx LINUX      X86_64 Matched   Idle      0.270 211750  
0+05:40:40
slot1_12@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:41:40
slot1_17@xxxxxxxxx LINUX      X86_64 Claimed   Busy      0.380 3000 
0+05:41:40
slot1_2@xxxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:40:41
slot1_36@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 4000 
0+05:41:40
slot1_47@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:41:40
slot1_53@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:41:40
slot1_57@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:41:40
slot1_61@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:41:40
slot1_64@xxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 3000 
0+05:41:40
slot1_8@xxxxxxxxxx LINUX      X86_64 Claimed   Busy      1.000 4000 
0+05:41:40
 From the StartLog we can see that the last status change of slot1 one 
is at ~ 4:22.
node064:~# cat /var/log/condor/StartLog | grep slot1 | grep -v '_' | 
tail -n 15
02/11/15 02:56:30 slot1: State change: match notification protocol 
successful
02/11/15 02:56:30 slot1: Changing state: Unclaimed -> Matched
02/11/15 02:56:30 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:36 slot1: Received match 
<10.141.0.64:43678>#1418988885#44480#...
02/11/15 03:01:36 slot1: State change: match notification protocol 
successful
02/11/15 03:01:36 slot1: Changing state: Unclaimed -> Matched
02/11/15 03:01:36 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:41 slot1: State change: entering Drained state
02/11/15 03:01:41 slot1: Changing state and activity: Unclaimed/Idle -> 
Drained/Retiring
02/11/15 04:21:46 slot1: State change: slot is no longer draining.
02/11/15 04:21:46 slot1: Changing state and activity: Drained/Retiring 
-> Owner/Idle
02/11/15 04:21:46 slot1: Changing state: Owner -> Unclaimed
02/11/15 04:22:46 slot1: Received match 
<10.141.0.64:43678>#1418988885#44490#...
02/11/15 04:22:46 slot1: State change: match notification protocol 
successful
02/11/15 04:22:46 slot1: Changing state: Unclaimed -> Matched
  On the central service side we also have.
svr021:~# grep node064 /var/log/condor/CollectorLog | grep Inserting | 
tail -n 10
02/11/15 02:56:35 StartdAd     : Inserting ** "< 
slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 02:56:35 StartdPvtAd  : Inserting ** "< 
slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 03:01:41 StartdAd     : Inserting ** "< 
slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 03:01:41 StartdPvtAd  : Inserting ** "< 
slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:25 StartdAd     : Inserting ** "< 
slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:25 StartdPvtAd  : Inserting ** "< 
slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:45 StartdAd     : Inserting ** "< 
slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:45 StartdPvtAd  : Inserting ** "< 
slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:49 StartdAd     : Inserting ** "< 
slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:49 StartdPvtAd  : Inserting ** "< 
slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
  Seems to me that the startd stopped sending information to collector 
at 04:22. Usually this can be fixed by a simple restart of the startd 
daemon. But what could lead to such a behavior?
  Cheers,Gang