[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference



Hi Todd and Christoph,

thanks for your confirmation it should work. I set all communications to
TCP, at least it improved packet sniffing ;).


I tried various combinations, and there are still problems for waking up
machines. Or, as I suspect, problems where machines sign off at
hibernation time.

Here, I have :

render0412 - dynamicslots enabled, cron enabled
render0413 - dynamicslots disabled, cron enabled
render0415 - dynamicslots enabled, cron disabled
render0415 - dynamicslots enabled, cron disabled


When I set :

ABSENT_REQUIREMENTS = True
EXPIRE_INVALIDATED_ADS = True
ROOSTER_UNHIBERNATE = ( Offline && Unhibernate )

When a computer goes to sleep, the CM add the Absent flag, and I can
then clear it later and wake the computer. (this is the set up I have in
production, where DynamicSlots and Cron are not used).

This yields these ClassAds :


## render0412
Absent = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"
Name = "slot1@xxxxxxxxxxxxxxxxxxxxxx"
SlotID = 1
SlotType = "Partitionable"
SlotTypeID = 1
SlotWeight = Cpus
State = "Unclaimed"

## render0413
Absent = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"
Name = "slot1@xxxxxxxxxxxxxxxxxxxxxx"
SlotID = 1
SlotType = "Static"
SlotTypeID = 0
SlotWeight = Cpus
State = "Unclaimed"

## render0415
Absent = true
HibernationLevel = 5
HibernationState = "S5"
HibernationSupportedStates = "S3,S4,S5"
Name = "slot1@xxxxxxxxxxxxxxxxxxxxxx"
Offline = true
SlotID = 1
SlotType = "Partitionable"
SlotTypeID = 1
SlotWeight = Cpus
State = "Unclaimed"

## render1702
Absent = true
HibernationLevel = 5
HibernationState = "S5"
HibernationSupportedStates = "S3,S4,S5"
Name = "slot1@xxxxxxxxxxxxxxxxxxxxxx"
Offline = true
SlotID = 1
SlotType = "Partitionable"
SlotTypeID = 1
SlotWeight = Cpus
State = "Unclaimed"


After wiping the Absent flag, The Negociator finds render0415 and
render1702, and Rooster wakes them up. But not the first two EP, because
they are not marked as Offline :-/.


But with this setting :

ABSENT_REQUIREMENTS = (HibernationLevel?:0) == 0)
EXPIRE_INVALIDATED_ADS = True
ROOSTER_UNHIBERNATE = ( Offline && Unhibernate )

the EP disappear from both "condor_status -absent" and "condor_status".
The collector log says it will keep the Ad in the persistent store, but also says it
deletes it right after:

15:39:35 Got QUERY_ANY_ADS
15:39:35 QueryWorker: forked new high priority worker with id 1838207 ( max 16 active 2 pending 0 )
15:39:35 QueryWorker: Child 1838206 done
15:39:35 Query after modification: *((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= True)*
15:39:35 (Sending 4 ads in response to query)
15:39:35 Query info: matched=4; skipped=8; query_time=0.002018; send_time=0.007333; type=Any; requirements={((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<172.22.0.3:5861>; projection={}; filter_private_attrs=0
15:39:35 QueryWorker: Child 1838207 done
15:39:35 AccountingAd  : Updating ... "< <none>htcondor-slots.sta.buf.com >"
15:39:35 In OfflineCollectorPlugin::update ( 77 )
15:39:35 ScheddAd     : Updating ... "< htcondor-slots.sta.buf.com , 172.22.0.3 >"
15:39:35 In OfflineCollectorPlugin::update ( 1 )
15:39:36 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxx , 172.22.1.174 >"
15:39:36 Want private ads, but no socket given!
15:39:36 In OfflineCollectorPlugin::update ( 60 )
15:39:36 Machine ad lifetime: 2147483647
15:39:36 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxxxxx,172.22.1.174>
15:39:36 Got INVALIDATE_MASTER_ADS
15:39:36 In OfflineCollectorPlugin::expire()
15:39:36               **** Removed(1) stale ad(s): "< render1702.sta.buf.com >"
15:39:36 (Invalidated 1 ads)
15:39:36 In OfflineCollectorPlugin::update ( 15 )

After this, there is no further information about EP in the log file.


I don't know where to look next :-/. Maybe there's something obvious I'm
missing !


I will try to make a more accurate summary of the settings, I understand
it looks difficult to grasp this message.


Thanks for any clues !

-- 
Charles