[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor with power saving PCs



As promised some info on how we use Condor with power saving PCs
(long).

The setup here is that we have around 300 PCs from our managed
windows (win xp) service in our pool. There is also a Solaris
central manager and a Solaris submit host. The manager can be
used for submitting jobs via a web interface. It is designed for
running specific codes and there are relatively small numbers 
of long (weeks possibly) jobs. The submit host is intended for
"power" users how want to submit large numbers ( 20 000 + ) of
short jobs from the command line. The execution hosts can run
Condor jobs at any time provided there is no recent keyboard/mouse
activity.

Many of the PCs in the Teaching Centres (classroms) use power saving
and upto recently we've turned this off for PCs used in the Condor
pool. This is obviously wasteful particularly as now if the machines
aren't running Condor jobs then they are lying idle (since most of 
the students are away).

I'm not 100 % up on how the Windows side of things work but as I 
understand it there is a system process running on the PCs which
can signal if there has been more than 30 min of inactivity and run
a batch script as the system user. We use this script to check if anyone
is logged in and shutdown the PC if not. As far as I know the only part of
the PC that remains powered up is the NIC. If a local user wants to login
to the machine they have to hit the power button to start a boot from
cold.

Crucially the NICs have a "wake on LAN" function which means that
PCs can be started up remotely simply by sending them a TCP/IP packet
(this is useful in getting them to install patches etc). We've used
this funtionality to wake up machines in the pool when there is demand
from Condor users.

One of the first problems we encountered was that the shutdown script
does not seem to detect Condor as a logged in user. So if Condor jobs
are running it will simply shut the machine down. To get around this we checked
for the presence of the condor job executable in the spool area (under c:\tmp
in our case). 

Now we had a script that didn't shutdown the machine while a job was running
but similarly didn't shutdown the machine Condor job running on it had gone 
away. To get around that we made the script loop around checking for the executable until it disappeared signalling the end of the job. Then it does a shutdown.

The problem with this is that is doesn't take account of situations where
a job is evicted or the spool area doesn't get wiped properly at the end
of the job. To get around that we do a saftey check to see if there is a real logged
on user before shutting down. If there is a real user logged in then the script exits without taking any action.

I enclosed the batch script at the end.

That dealt with (most of) the shutting down bit. To get the machines to wake up we
have a cron job which runs on the submit host checking the queue state 
periodically. If there are more idle jobs than there are idle execute hosts
we wake up all the hosts that had shut down. This is a bit of a brute force 
method but seems to work here. Most of our power users saturate the pool quickly so we
may as well wake up the whole lot as try and wakeup a subset of them to 
satisfy the demand. Of course if they don't get used they will automatically
shut down after 30 min anyway.

I've enclosed the perl script for this at the end. The wakeonlan executable
which sends out the wake up packets was compiled from C source off the 
web (just google).

To make things a bit easier we wake up the machines in batches corresponding
to each teaching centre. For each centre there is a file containing the MAC
addresses of all the machines in it (you will need this along with the broadcast
address of the PCs).

Having said all of that there was still another niggling problem. We originally
had Condor set up to run jobs after > 15 mins keyboard/mouse inactivity. Since 
the timer only seems to increment every 5 minutes this is actually more like 20 mins. 
We you add in the time it takes for the machines to boot up and for Condor to
do the matchmaking we found that the machines were shutting down again before
jobs had a chance to run. When the inactivity time was reduced to >= 10 mins
things worked fine (note the inequalities).

Let me know if you have an comments/queries

regards,

-ian.

------------------------------
Dr Ian C. Smith
e-Science Team,
University of Liverpool,
Computing Services Department.

------------- powersaving .bat file ---------------------------------
@echo off
if exist %systemdrive%\nopowersave.dat goto :EOF

set shutdown=%systemroot%\system32\psshutdown.exe
set loggedon=%systemroot%\system32\psloggedon.exe
set powersavelog=%systemroot%\powersave.txt
set find=%systemroot%\system32\find.exe
set sleep=%systemroot%\system32\sleep.exe

if not exist %shutdown% goto :EOF
if not exist %loggedon% goto :EOF

REM loop around until condor disappears or user logs in

:checkagain
%loggedon% -l -x >>%powersavelog%
%loggedon% -l -x | %find% "%computername%"
if errorlevel 1 goto nouser
REM someone logged in - exit out
echo %date% %time%: user logged in, not powering off.>>%powersavelog%
goto :EOF

:nouser
REM no one logged in - check whether condor appears to be running
echo Batch files in tmp structure: >>%powersavelog%
dir %systemdrive%\tmp\*.bat /s >>%powersavelog%
echo Instances of condor_exec found within it: >>%powersavelog%
dir %systemdrive%\tmp\*.bat /s | %find% "condor_exec" /i >>%powersavelog% 2>&1
if errorlevel 1 goto nocondor

echo "Condor appears to be running; waiting 30 minutes before checking again." >>%powersavelog%
sleep 1800
goto checkagain

:nocondor
REM no user logged in or condor job running - shutdown
echo set lastpowersave=down,%date% %time:~0,8% >%systemroot%\lastpowersave.bat
%shutdown% -c -k -m "This computer is about to power itself off to save energy." -t 10
echo %date% %time%: no user, powering off.>>%powersavelog%

------------------- cron job perl script ----------------------------------
#!/usr/local/bin/perl

use strict;

my $condor_bin    = '/opt1/condor/bin';
my $queue_args    = ' -constraint "JobStatus==1" -f "%d\\n" clusterid ';
my $condor_q      = "$condor_bin/condor_q $queue_args"; 
my $status_args   = '-constraint \'State=="Unclaimed"\' -f "%s\\n" Name';
my $condor_status = "$condor_bin/condor_status $status_args";

my $get_idle_jobs     = "$condor_q | wc -l | tr -d ' ' ";
my $get_idle_machines = "$condor_status | wc -l | tr -d ' ' ";

my $no_of_idle_jobs;
my $IP_address;
my $MAC_address_file;
my $no_of_idle_machines;
my $email_file;
my $centre;
my $broadcast_address;

# actual broadcast addresses removed for security reasons 

my %all_centres = ( "ROTC"=>"138.xxx.xxx.255", 
                    "ARC2"=>"138.xxx.xxx.255", 
                    "CDTC"=>"138.xxx.xxx.255", 
                    "ERTC"=>"138.xxx.xxx.255" );

my $wakeup_root =  "/opt1/condor_wakeup";
my $wakeup = "$wakeup_root/wakeonlan ";
$email_file = "$wakeup_root/status";

$no_of_idle_jobs = `$get_idle_jobs`;
$no_of_idle_machines = `$get_idle_machines`;

open( EMAIL, ">", $email_file );
print EMAIL "no of idle jobs = $no_of_idle_jobs";
print EMAIL "no of idle machines = $no_of_idle_machines";
close( EMAIL );

if( $no_of_idle_jobs - $no_of_idle_machines > 10 )
{
   # testing only
   # `/usr/bin/mailx -s "woke up Condor pool" asdasuyuy\@liv.ac.uk < $email_file`;   
   
   while( ( $centre, $broadcast_address ) = each %all_centres )
   {
     $MAC_address_file = "$wakeup_root/MAC_addresses/" . $centre;
     print "$wakeup -i $broadcast_address -f $MAC_address_file\n" ;
     print `$wakeup -i $IP_address -f $MAC_address_file`;
   }
}






























> 
> > -----Original Message-----
> > From: condor-users-bounces@xxxxxxxxxxx 
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Smith, Ian
> > Sent: Monday, July 02, 2007 2:38 PM
> > To: Condor-Users Mail List
> > Subject: [Condor-users] Condor with power saving PCs
> > 
> > Dear All,
> > 
> > I've spent quite a bit of time recently getting Condor to work
> > with our Win XP pool on PCs that employ power saving (that is 
> > they shutdown after 30 min of inactivity if no one is logged on). 
> > This was a good deal more fiddly than I first anticipated. If anyone
> > is interested I'll post the details here. With full economic costing
> > coming in in UK academia I think this could be a growing concern.
> > 
> > regards,
> > 
> > -ian.
> > 
> > ------------------------------
> > Dr Ian C. Smith
> > e-Science Team,
> > University of Liverpool,
> > Computing Services Department.
> > 
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to 
> > condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> > The archives can be found at: 
> > https://lists.cs.wisc.edu/archive/condor-users/
> > 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>