[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference



Dear fellow scavengers,

I followed the instructions (thanks for your time !), and it still
fails, but in an interresting way.


Here is what I have setup on the CM :

ABSENT_REQUIREMENTS = (HibernationLevel?:0) == 0
EXPIRE_INVALIDATED_ADS = True
# 604800s = 7 days
ABSENT_EXPIRE_ADS_AFTER = 604800
OFFLINE_EXPIRE_ADS_AFTER = 604800
COLLECTOR_PERSISTENT_AD_LOG = /vol/condor/offline_ads/PersistentAdLog
ROOSTER_INTERVAL = 180
ROOSTER_UNHIBERNATE = (Offline && Unhibernate)


By looking closer at the collector (ie making requests every 0.5
seconds), I found that the Collector first holds (say at 10:10:10) a
ClassAd for my test EP with :

(Absent is not showing)
HibernationLevel = 5
HibernationState = "S5"
HibernationSupportedStates = "S3,S4,S5"
Offline = true

and then, 10 seconds later, the ClassAd changes to :

Absent = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"
(Offline is not present anymore)


If I look at the network traces, there is a ClassAd containing HibernationLevel = 5 and
HibernationState = "S5" at, says 10:10:10, and the TCP connection gets
closed at 10:10:20 (10 seconds later).


So I understand that what happens at 10:10:10 is the expected behavior,
but I don't understand why the Collector changes its mind and finally
marks the EP as Absent 10 seconds later.


I was able to reproduce this like 5-6 times today.


Below is what it looks like at the CollectorLog level.

At 12:14:24, the EP is showing as Offline, and at 12:14:34 is shows as
Absent.

The lines "condor_read(): Socket closed when trying to read 5 bytes from
<172.22.0.145:46381> in non-blocking mode" suggest the Collector maybe
expect the EP to be reachable (but it's powered off at this point)


Here "htcondor-slots" is my test CM, and render0412 is my test EP.


12:14:14 AccountingAd  : Updating ... "< <none>htcondor-slots.sta.buf.com >"
12:14:14 In OfflineCollectorPlugin::update ( 77 )
12:14:23 ScheddAd     : Updating ... "< htcondor-slots.sta.buf.com , 172.22.0.3 >"
12:14:23 In OfflineCollectorPlugin::update ( 1 )
12:14:23 Getting monitoring info for pid 8735
12:14:24 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxx , 172.22.0.145 >"
12:14:24 Want private ads, but no socket given!
12:14:24 In OfflineCollectorPlugin::update ( 60 )
12:14:24 Machine ad lifetime: 604800
12:14:24 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxxxxx,172.22.0.145>
12:14:24 Got QUERY_STARTD_PVT_ADS
12:14:24 QueryWorker: forked new high priority worker with id 9049 ( max 16 active 1 pending 0 )
12:14:24 Query after modification: *(true) && (Absent =!= True)*
12:14:24 (Sending 1 ads in response to query)
12:14:24 Query info: matched=1; skipped=0; query_time=0.001208; send_time=0.000454; type=MachinePrivate; requirements={(true) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<172.22.0.3:16509>; projection={}; filter_private_attrs=0
12:14:24 QueryWorker: Child 9049 done
12:14:24 Got QUERY_ANY_ADS
12:14:24 QueryWorker: forked new high priority worker with id 9050 ( max 16 active 1 pending 0 )
12:14:24 Query after modification: *((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= True)*
12:14:24 (Sending 2 ads in response to query)
12:14:24 Query info: matched=2; skipped=6; query_time=0.001359; send_time=0.001153; type=Any; requirements={((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<172.22.0.3:19011>; projection={}; filter_private_attrs=0
12:14:24 QueryWorker: Child 9050 done
12:14:24 AccountingAd  : Updating ... "< <none>htcondor-slots.sta.buf.com >"
12:14:24 In OfflineCollectorPlugin::update ( 77 )
12:14:25 Got INVALIDATE_MASTER_ADS
12:14:25 In OfflineCollectorPlugin::expire()
12:14:25               **** Removed(1) stale ad(s): "< render0412.sta.buf.com >"
12:14:25 (Invalidated 1 ads)
12:14:25 In OfflineCollectorPlugin::update ( 15 )
12:14:33 ScheddAd     : Updating ... "< htcondor-slots.sta.buf.com , 172.22.0.3 >"
12:14:33 In OfflineCollectorPlugin::update ( 1 )
12:14:34 Got QUERY_STARTD_PVT_ADS
12:14:34 QueryWorker: forked new high priority worker with id 9051 ( max 16 active 1 pending 0 )
12:14:34 Query after modification: *(true) && (Absent =!= True)*
12:14:34 (Sending 1 ads in response to query)
12:14:34 Query info: matched=1; skipped=0; query_time=0.001779; send_time=0.001715; type=MachinePrivate; requirements={(true) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<172.22.0.3:11769>; projection={}; filter_private_attrs=0
12:14:34 QueryWorker: Child 9051 done
12:14:34 Got QUERY_ANY_ADS
12:14:34 QueryWorker: forked new high priority worker with id 9052 ( max 16 active 1 pending 0 )
12:14:34 Query after modification: *((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= True)*
12:14:34 (Sending 2 ads in response to query)
12:14:34 Query info: matched=2; skipped=5; query_time=0.001341; send_time=0.001606; type=Any; requirements={((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<172.22.0.3:6439>; projection={}; filter_private_attrs=0
12:14:34 QueryWorker: Child 9052 done
12:14:34 AccountingAd  : Updating ... "< <none>htcondor-slots.sta.buf.com >"
12:14:34 In OfflineCollectorPlugin::update ( 77 )
12:14:35 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxx , 172.22.0.145 >"
12:14:35 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxx , 172.22.0.145 >"
12:14:35 In OfflineCollectorPlugin::update ( 0 )
12:14:35 Removed ad from persistent store key=<slot1@xxxxxxxxxxxxxxxxxxxxxx,172.22.0.145>
12:14:35 Got INVALIDATE_STARTD_ADS
12:14:35 In OfflineCollectorPlugin::expire()
12:14:35 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxxxxx,172.22.0.145>
12:14:35 (Invalidated 0 ads)
12:14:35 In OfflineCollectorPlugin::expire()
12:14:35 OfflineCollectorPlugin::persistentStoreRemove: Replacing existing offline ad.
12:14:35 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxxxxx,172.22.0.145>
12:14:35 (Invalidated 0 ads)
12:14:35 In OfflineCollectorPlugin::update ( 13 )
12:14:35 Removed ad from persistent store key=<slot1@xxxxxxxxxxxxxxxxxxxxxx,172.22.0.145>
12:14:35 condor_read(): Socket closed when trying to read 5 bytes from <172.22.0.145:46381> in non-blocking mode
12:14:35 IO: EOF reading packet header
12:14:35 DaemonCore: Can't receive command request from 172.22.0.145 (perhaps a timeout?)
12:14:35 condor_read(): Socket closed when trying to read 5 bytes from <172.22.0.145:43421> in non-blocking mode
12:14:35 IO: EOF reading packet header
12:14:35 DaemonCore: Can't receive command request from 172.22.0.145 (perhaps a timeout?)


I'm totally lost, it feels I'm real close to a working setup.

I have full network traces at hand if that sounds useful :-/.


Thanks,

-- 
Charles