Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Idle Jobs & an Authentication Issue
- Date: Tue, 14 Nov 2006 19:01:52 -0600 (CST)
- From: Smith Denvil L <dls1115@xxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Idle Jobs & an Authentication Issue
Hello - I am setting up a Condor pool to do a demo for our Grid class this
Friday afternoon. I am currently experimenting with 3 laptops in my pool.
All 3 laptops see each other.
I have 2 issues:
1) jobs that are submitted to run locally, remain on idle for about 20
minutes before executing.
2) jobs that are submitted to run remotely, receive a warning that the
output and error files are not writable by condor.
Below is the lab handout which has the changes I have made so far to the
config files.
Condor Scheduler
Install Procedure
1. As user root
Turn off torque
$ /etc/init.d/pbs stop
$ chkconfig --level 3 pbs off
$ chkconfig --level 4 pbs off
$ chkconfig --level 5 pbs off
2. As user root
Copy the rpm for the install over to your local machine
$ scp root@xxxxxxxxxxxxxxxxx:/globus_ins/condor/ \
condor-6.8.2-linux-x86-rhel3-dynamic-1.i386.rpm \
/globus_ins/
3. As user root
Create the condor user
$ useradd condor
4. As user root
Create a directory for condor
$ mkdir /usr/local/condor
5. As user root
Move to the directory that has the condor rpm
$ cd /globus_installs
6. As user root
Install the condor package
$ rpm -i condor-6.8.2-linux-x86-rhel3-dynamic-1.i386.rpm
--prefix=/usr/local/condor
Unable to find a valid Java installation
Java Universe will not work properly until the JAVA
(and JAVA_MAXHEAP_ARGUMENT) parameters are set in the configuration file!
Condor has been installed into:
/usr/local/condor
In order for Condor to work properly you must set your
CONDOR_CONFIG environment variable to point to your
Condor configuration file:
/usr/local/condor/etc/condor_config
before running Condor commands/daemons.
7. As user root
Set the CONDOR & CONDOR_CONFIG environment variables.
(remember to add the new environment variable CONDOR to your
PATH)
(remember to add this new environment variable to the list of export
variables)
$ cd ~
$ vi .bash_profile
CONDOR=/usr/local/condor
CONDOR_CONFIG=/usr/local/condor/etc/condor_config
PATH=$PATH:$CONDOR/bin:$CONDOR/sbin:$TORQUE_ROOT/bin:$TORQUE_ROOT/sbin:
export CONDOR CONDOR_CONFIG
$ source .bash_profile
$ cp .bash_profile /home/globus/.bash_profile
cp: overwrite `/home/globus/.bash_profile'? y
$ cp .bash_profile /home/griduserxx/.bash_profile
cp: overwrite `/home/griduserxx/.bash_profile'? y
8. As user root
Modify the Condor configuration file to allow other hosts to submit
jobs to your local host
$ cp $CONDOR/etc/condor_config $CONDOR/etc/condor_config.SAV
$ vi $CONDOR/etc/condor_config
a) remove the comment on the below line of code on line 212
212 ## HOSTALLOW_WRITE = *
212 HOSTALLOW_WRITE = *
b) comment out the below line of code on line 215
215 HOSTALLOW_WRITE =
YOU_MUST_CHANGE_THIS_INVALID_CONDOR_CONFIGURATION_VALUE
215 ##HOSTALLOW_WRITE =
YOU_MUST_CHANGE_THIS_INVALID_CONDOR_CONFIGURATION_VALUE
9. As user root
Start Condor
$ condor_master
$ ps ax | grep condor
5204 ? Ss 0:00 condor_master
5205 ? Ss 0:00 condor_collector -f
5206 ? Ss 0:00 condor_negotiator -f
5207 ? Ss 0:00 condor_schedd -f
5208 ? Ss 0:05 condor_startd -f
5217 pts/1 S+ 0:00 grep condor
You should see the following condor daemons executing:
master, collector,
negotiator, schedd, & startd
10. As user root
Check the status of your local condor pool - wait a few minutes after
initially starting the Condor
master before checking the status to allow all the daemons to initialize
$ condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
gridxx.local LINUX INTEL Unclaimed Idle 0.180 1002
0+00:00:04
Total Owner Claimed Unclaimed Matched Preempting
Backfill
INTEL/LINUX 1 0 0 1 0 0
0
Total 1 0 0 1 0 0
0
11. As user griduserxx
Create a user working directory for condor and copy some
condor files
from gridpresent
$ mkdir condor
$ cd condor
$ scp root@xxxxxxxxxxxxxxxxx:/globus_ins/condor/runner* .
$ scp root@xxxxxxxxxxxxxxxxx:/globus_ins/condor/looper* .
12. As user griduserxx
Notice the command script used by condor
$ more runner.cmd
####################
##
## Test Condor command file
##
####################
universe = vanilla
executable = runner.exe
output = runner.out
error = runner.err
log = runner.log
queue
13. As user griduserxx
Compile 2 Fortran 77 programs in condor
$ condor_compile f77 -o runner.exe runner.f
LINKING FOR CONDOR : /usr/bin/ld -L/usr/local/condor/lib -Bstatic
--eh-frame-hdr . . .
$ condor_compile f77 -o looper.exe looper.f
LINKING FOR CONDOR : /usr/bin/ld -L/usr/local/condor/lib -Bstatic
--eh-frame-hdr -m elf_i386 -dyn . . .
14. As user griduserxx
Submit a job in condor
$ condor_submit runner.cmd
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 4.
15. As user griduserxx
Notice the files generated by the job
$ more runner.out
OUTPUT FROM PGM RUNNER
$ more runner.err
Condor: Notice: Will checkpoint to condor_exec.exe.ckpt
Condor: Notice: Remote system calls disabled.
$ more runner.log
000 (004.000.000) 11/06 16:59:16 Job submitted from host: <130.70.83.6:33007>
...
001 (004.000.000) 11/06 16:59:18 Job executing on host: <130.70.83.6:33008>
...
005 (004.000.000) 11/06 16:59:18 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
16. As user root
Stop Condor
$ condor_master -off
Check and make sure all the condor daemons were stopped
$ ps ax | grep condor
5204 ? Ss 0:00 condor_master
5205 ? Ss 0:00 condor_collector -f
5206 ? Ss 0:00 condor_negotiator -f
5207 ? Ss 0:00 condor_schedd -f
5208 ? Ss 0:05 condor_startd -f
5217 pts/1 S+ 0:00 grep condor
If you notice that the condor daemons are still running, as in above,
simply kill the master
$ kill 5204
All the condor daemons should now be off
ps ax | grep condor
5537 pts/1 S+ 0:00 grep condor
Notice that none of the daemons are now listed as active
17. As user root
Modify the global config file to acknowledge gridpresent.local as the
pool manager
$ vi $CONDOR/etc/condor_config
a) add 1 line on line 51 for CONDOR_HOST = gridpresent.local
## What machine is your central manager?
51 CONDOR_HOST = gridpresent.local
##--------------------------------------------------------------------
## Pathnames:
##--------------------------------------------------------------------
## Where have you installed the bin, sbin and lib condor directories?
RELEASE_DIR = /usr/local/condor
18. As user root
Modify the local config file to acknowledge gridpresent.local as the
pool manager
$ vi $CONDOR/local.gridxx/condor_config.local
a) Modify line 4
2 ## What machine is your central manager?
3
4 CONDOR_HOST = gridxx.local
and replace gridxx.local with gridpresent.local
4 CONDOR_HOST = gridpresent.local
19. As user root
Start Condor
$ condor_master
$ ps ax | grep condor
5204 ? Ss 0:00 condor_master
5205 ? Ss 0:00 condor_collector -f
5206 ? Ss 0:00 condor_negotiator -f
5207 ? Ss 0:00 condor_schedd -f
5208 ? Ss 0:05 condor_startd -f
5217 pts/1 S+ 0:00 grep condor
You should see the following condor daemons executing:
master, collector,
negotiator, schedd, & startd
20. As user root
Check the status of the global class condor pool - wait a few minutes
after initially starting the Condor
master before checking the status to allow all the daemons to initialize
$ condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
gridxx.local LINUX INTEL Unclaimed Idle 0.180 1002
0+00:00:04
gridxx.local LINUX INTEL Unclaimed Idle 0.180 1002
0+00:00:04
. . . . . .
.
. . . . . .
.
gridpresent.local LINUX INTEL Unclaimed Idle 0.180 1002
0+00:00:04
Total Owner Claimed Unclaimed Matched Preempting
Backfill
INTEL/LINUX x 0 0 x 0 0
0
Total x 0 0 x 0 0
0
You should now see all the other laptops in the class which have joined
the pool
21. As user root
Open up the necessary ports for the Condor scheduler by adding 2
lines to iptables
$ cp /etc/sysconfig/iptables /etc/sysconfig/iptables.SAV
$ vi /etc/sysconfig/iptables
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport
60000:60500 -j ACCEPT
# Condor Scheduler
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 9618 -j
ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport
32000:33000 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j
ACCEPT
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
COMMIT
22. As user root
Apply the changes in iptables to the firewall
$ service iptables stop
$ service iptables start
More information may be found in the Condor Version 6.8.2 manuals :
http://www.cs.wisc.edu/condor/manual/v6.8/
CHANGES PENDING :
1. As user root
Modify the global config file
a) uncomment UID_DOMAIN = $(FULL_HOSTNAME)
b) comment out UID_DOMAIN = your.domain
c) uncomment FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
d) comment out FILESYSTEM_DOMAIN = your.domain
e) comment out COLLECTOR_NAME = My Pool
f) add COLLECTOR_NAME = gridpresent.local
g) modify JAVA = /opt/jdk1.5.0_08/bin/java
h) uncomment REQUIRE_LOCAL_CONFIG_FILE = TRUE
2. As user root
Configure the local node as a submit and execution node and point to
the central manager
$ condor_configure --type=submit,execute
--central-manager=gridpresent.local --owner=condor
3. As user root
BOTTOM OF CONDOR_CONFIG
TESTING SECTION
4. As user root
$ chmod o=+rwx /home/griduser06/condor
$ chmod o=+rwx /home/griduser06/condor/runner.err
$ chmod o=+rwx /home/griduser06/condor/runner.out
-- NOTES --
>> use condor_reschedule - when a job remains in idle status
Thanks for the help,
Denvil...