Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Problems running a test PVM job with Condor
- Date: Fri, 3 Dec 2004 17:15:22 +0000
- From: Angel de Vicente <angelv@xxxxxx>
- Subject: [Condor-users] Problems running a test PVM job with Condor
Hi,
I'm starting to play with PVM and Condor-PVM, but no success yet.
I am running Condor-6.6.7 in ~ 200 machines with no problems (around 100 Linux,
from where I am sending the job).
I downloaded the latest version of PVM (3.4.4) today, compiled it and tried
successfully to compile and run one of the example programs that comes with the
PVM distribution: master1 slave1
But when I try with Condor-PVM, no success. Does anybody with experience with
Condor-PVM know what could be happening? I include below all the details.
Thanks a lot,
Angel de Vicente
ps. By the way, is ther a MW mailing list? The information at
http://www.cs.wisc.edu/condor/mw/ does not seem to be up-to-date.
--------------------------------------
With the pvm console, the master1 programs works OK
[angelv@guinda PVM]$ pvm
pvm> add filomena
add filomena
1 successful
HOST DTID
filomena 80000
pvm> conf
conf
2 hosts, 1 data format
HOST DTID ARCH SPEED DSIG
guinda 40000 LINUX 1000 0x00408841
filomena 80000 LINUX 1000 0x00408841
pvm> spawn -> master1
spawn -> master1
[1]
1 successful
t80001
pvm> [1:t40003] EOF
[1:t40004] EOF
[1:t40002] EOF
[1:t80003] EOF
[1:t80004] EOF
[1:t80002] EOF
[1:t80001] Spawning 6 worker tasks ... SUCCESSFUL
[1:t80001] I got 700.000000 from 4; (expecting 700.000000)
[1:t80001] I got 900.000000 from 5; (expecting 900.000000)
[1:t80001] I got 500.000000 from 3; (expecting 500.000000)
[1:t80001] I got 100.000000 from 1; (expecting 100.000000)
[1:t80001] I got 300.000000 from 2; (expecting 300.000000)
[1:t80001] I got 500.000000 from 0; (expecting 500.000000)
[1:t80001] EOF
[1] finished
pvm>
Since the documentation of Condor (section 2.9.2) says that the PVM and
Condor-PVM are binary compatible, I tried to run the master1/slave1 program.
my submit file is:
-----------------
universe = PVM
executable = master1
output = out.dat
error = err.dat
log = pvm.log
Requirements = (Arch == "INTEL") && (OpSys == "LINUX")
machine_count = 1..2
queue
I send it to the queue, and it says that it is running, but it sits there too
long and I don't get anything in the output or error.
[angelv@guinda PVM]$ condor_q
-- Submitter: guinda.iac.es : <161.72.81.187:30045> : guinda.iac.es
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
457.0 angelv 12/3 17:05 0+00:01:43 R 0 0.1 master1
1 jobs; 0 idle, 1 running, 0 held
[angelv@guinda PVM]$
It seems that something started OK
[angelv@guinda PVM]$ ps -aux | grep pvm
angelv 23873 0.0 0.5 7964 2720 ? S 17:06 0:00 condor_shadow.pvm <161.72.81.187:30064>
angelv 23874 0.0 0.1 1704 728 ? S 17:06 0:00 /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
And the ShadowLog of guinda seems to be trying to open other pvmds in the other
machines, but I do not know why they fail. Here there are the last the lines in
the ShadowLog of guinda. It looks like everything is going fine, but then there
is a line like
12/3 17:05:16 (457.0) (23809):Can't start new machines now{ filomena.iac.es}
Any ideas what could be wrong?
12/3 17:05:16 (?.?) (23809):********** Multi_Shadow starting up **********
12/3 17:05:16 (?.?) (23809):uid=0, euid=2120, gid=0, egid=20
12/3 17:05:16 (?.?) (23809):My_Filesystem_Domain = "iac.es"
12/3 17:05:16 (?.?) (23809):My_UID_Domain = "iac.es"
12/3 17:05:16 (?.?) (23809):Shadow reading via ASCII
12/3 17:05:16 (?.?) (23809):First Line: 457 0 1
12/3 17:05:16 (457.0) (23809):Created class:
12/3 17:05:16 (457.0) (23809):#0: 0 (1, 2) has 0
12/3 17:05:16 (457.0) (23809):New process for proc 0
12/3 17:05:16 (457.0) (23809):AllocProc() returning 0
12/3 17:05:16 (457.0) (23809):Machine from schedd: <161.72.80.41:46440> <161.72.80.41:46440>#1825676418 0
12/3 17:05:16 (457.0) (23809):Machine Line: filomena.iac.es 0
12/3 17:05:16 (457.0) (23809):Machines now cur = 1 desire = 2
12/3 17:05:16 (457.0) (23809):Updated class:
12/3 17:05:16 (457.0) (23809):#0: 0 (1, 2) has 1
12/3 17:05:16 (457.0) (23809):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
12/3 17:05:16 (457.0) (23809):PVM is pid 23810
12/3 17:05:16 (457.0) (23809):pvmd response: /tmp/fileMQop00
12/3 17:05:16 (457.0) (23809):PVMSOCK=/tmp/fileMQop00
12/3 17:05:16 (457.0) (23809):pvm_fd = 4, mytid = t40001
12/3 17:05:16 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:05:16 (457.0) (23809):Ok to start waiting hosts
12/3 17:05:16 (457.0) (23809):PVMd message is SM_STHOST from t80000000
12/3 17:05:16 (457.0) (23809):SM_STHOST: 80000 "" "161.72.80.41" "$PVM_ROOT/lib/pvmd -s -d0x11c -nfilomena.iac.es 1 a14851bb:8043 4080 2 a1485029:0000"
12/3 17:05:16 (457.0) (23809):New process for proc 0
12/3 17:05:16 (457.0) (23809):AllocProc() returning 0
12/3 17:05:16 (457.0) (23809):Shadow: Entering multi_send_job(filomena.iac.es)
12/3 17:05:16 (457.0) (23809):Requesting Alternate Starter 1
12/3 17:05:16 (457.0) (23809):Shadow: Request to run a job was REFUSED
12/3 17:05:16 (457.0) (23809):RemoveHost: Sending HostDelete notify on t80080000
12/3 17:05:16 (457.0) (23809):SendNotification(kind = 2, tid = t80080000)
12/3 17:05:16 (457.0) (23809):signal_startd( filomena.iac.es, 443 )
12/3 17:05:16 (457.0) (23809):Adding host filomena.iac.es to STARTACK msg.
12/3 17:05:16 (457.0) (23809):Num Hosts to pack = 1
12/3 17:05:16 (457.0) (23809):Packing tid t80000 with reply PvmNoHost
12/3 17:05:16 (457.0) (23809):Sending SM_STHOSTACK to PVMd
12/3 17:05:16 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:05:16 (457.0) (23809):Can't start new machines now{ filomena.iac.es}
12/3 17:05:16 (457.0) (23809):Updated class:
12/3 17:05:16 (457.0) (23809):#0: 0 (1, 2) has 0
12/3 17:05:16 (457.0) (23809):PVMd message is SM_ADDACK from t80000000
12/3 17:05:16 (457.0) (23809):pvmd reports error -6 on SM_ADDACK: PvmNoHost
12/3 17:05:16 (457.0) (23809):pvm_machines_starting = 0(should be 0)
12/3 17:05:16 (457.0) (23809):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/3 17:05:16 (457.0) (23809):Not enough machines in class 0 to start local proc.
12/3 17:05:16 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:05:16 (457.0) (23809):Ok to start waiting hosts
12/3 17:06:13 (457.0) (23809):Shadow reading via ASCII
12/3 17:06:13 (457.0) (23809):First Line: 457 0 1
12/3 17:06:13 (457.0) (23809):Machine from schedd: <161.72.81.147:32804> <161.72.81.147:32804>#1709993205 0
12/3 17:06:13 (457.0) (23809):Machine Line: calendula.iac.es 0
12/3 17:06:13 (457.0) (23809):Machines now cur = 1 desire = 2
12/3 17:06:13 (457.0) (23809):Updated class:
12/3 17:06:13 (457.0) (23809):#0: 0 (1, 2) has 1
12/3 17:06:13 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:06:13 (457.0) (23809):Ok to start waiting hosts
12/3 17:06:13 (457.0) (23809):PVMd message is SM_STHOST from t80000000
12/3 17:06:13 (457.0) (23809):SM_STHOST: c0000 "" "161.72.81.147" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncalendula.iac.es 1 a14851bb:8043 4080 3 a1485193:0000"
12/3 17:06:13 (457.0) (23809):New process for proc 0
12/3 17:06:13 (457.0) (23809):AllocProc() returning 0
12/3 17:06:13 (457.0) (23809):Shadow: Entering multi_send_job(calendula.iac.es)
12/3 17:06:13 (457.0) (23809):Requesting Alternate Starter 1
12/3 17:06:13 (457.0) (23809):Shadow: Request to run a job was ACCEPTED
12/3 17:06:13 (457.0) (23809):Shadow: RSC_SOCK connected, fd = 6
12/3 17:06:13 (457.0) (23809):Multi_Shadow: CLIENT_LOG connected, fd = 7
12/3 17:06:13 (457.0) (23809):in new_timer()
12/3 17:06:13 (457.0) (23809):Timer List
12/3 17:06:13 (457.0) (23809):^^^^^ ^^^^
12/3 17:06:13 (457.0) (23809):id = 0, when = 180
12/3 17:06:14 (457.0) (23809):Shadow: send_pvm_job_info
12/3 17:06:14 (457.0) (23809):send_pvm_job_info: arg = -s -d0x11c -ncalendula.iac.es 1 a14851bb:8043 4080 3 a1485193:0000 -f
12/3 17:06:14 (457.0) (23809):On LogSock for host calendula.iac.es:
-> [pvmd pid20947]
12/3 17:06:14 (457.0) (23809):On LogSock for host calendula.iac.es:
-> 12/03 17:06:38 version 3.4.2
-> [pvmd pid20947] 12/03 17:06:38 ddpro 2316 tdpro 1318
-> [pvmd pid20947] 12/03 17:06:38 main() debug mask is 0x11c (tsk,slv,hst,sch)
12/3 17:06:14 (457.0) (23809):In cancel_timer()
12/3 17:06:14 (457.0) (23809):Timer List
12/3 17:06:14 (457.0) (23809):^^^^^ ^^^^
12/3 17:06:14 (457.0) (23809):Received PVM info from calendula.iac.es
12/3 17:06:14 (457.0) (23809):Adding host calendula.iac.es to STARTACK msg.
12/3 17:06:14 (457.0) (23809):Num Hosts to pack = 1
12/3 17:06:14 (457.0) (23809):Packing tid tc0000 with reply ddpro<2316> arch<LINUX> ip<a1485193:80d5> mtu<4080> dsig<4229185>
12/3 17:06:14 (457.0) (23809):Sending SM_STHOSTACK to PVMd
12/3 17:06:14 (457.0) (23809):PVMd message is SM_ADDACK from t80000000
12/3 17:06:14 (457.0) (23809):Host #1(calendula.iac.es) has been added to PVM, pvmd_tid = 800c0000
12/3 17:06:14 (457.0) (23809):SendNotification(kind = 3, tid = t800c0000)
12/3 17:06:14 (457.0) (23809):pvm_machines_starting = 0(should be 0)
12/3 17:06:14 (457.0) (23809):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/3 17:06:14 (457.0) (23809):open_max = 1024
12/3 17:06:14 (457.0) (23809):Local PVM process pid = 23872
12/3 17:06:14 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:06:14 (457.0) (23809):Ok to start waiting hosts
12/3 17:06:14 (457.0) (23809):PVMd message is SM_EXECACK from t80000000
12/3 17:06:14 (457.0) (23809):Setting local tid to t40002
12/3 17:06:14 (457.0) (23809):PVMd message is SM_CONFIG from t40002
12/3 17:06:14 (457.0) (23809):PVMd message is SM_SPAWN from t40002
12/3 17:06:14 (457.0) (23809):ERROR "Assertion ERROR on (count == 1)" at line 591 in file pvm_emulation.C
12/3 17:06:14 (457.0) (23809):Multi_Shadow: Shutting down...
12/3 17:06:14 (457.0) (23809):Updated class:
12/3 17:06:14 (457.0) (23809):#0: 0 (1, 2) has 1
12/3 17:06:14 (457.0) (23809):signal_startd( calendula.iac.es, 443 )
12/3 17:06:14 (457.0) (23809):in new_timer()
12/3 17:06:14 (457.0) (23809):Timer List
12/3 17:06:14 (457.0) (23809):^^^^^ ^^^^
12/3 17:06:14 (457.0) (23809):id = 1, when = 300
12/3 17:06:15 (?.?) (23873):********** Multi_Shadow starting up **********
12/3 17:06:15 (?.?) (23873):uid=0, euid=2120, gid=0, egid=20
12/3 17:06:15 (?.?) (23873):My_Filesystem_Domain = "iac.es"
12/3 17:06:15 (?.?) (23873):My_UID_Domain = "iac.es"
12/3 17:06:15 (?.?) (23873):Shadow reading via ASCII
12/3 17:06:15 (?.?) (23873):First Line: 457 0 1
12/3 17:06:15 (457.0) (23873):Created class:
12/3 17:06:15 (457.0) (23873):#0: 0 (1, 2) has 0
12/3 17:06:15 (457.0) (23873):New process for proc 0
12/3 17:06:15 (457.0) (23873):AllocProc() returning 0
12/3 17:06:15 (457.0) (23873):Machine from schedd: <161.72.81.147:32804> <161.72.81.147:32804>#1709993205 0
12/3 17:06:15 (457.0) (23873):Machine Line: calendula.iac.es 0
12/3 17:06:15 (457.0) (23873):Machines now cur = 1 desire = 2
12/3 17:06:15 (457.0) (23873):Updated class:
12/3 17:06:15 (457.0) (23873):#0: 0 (1, 2) has 1
12/3 17:06:15 (457.0) (23873):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
12/3 17:06:15 (457.0) (23873):PVM is pid 23874
12/3 17:06:15 (457.0) (23873):pvmd response: /tmp/fileFJ8XV4
12/3 17:06:15 (457.0) (23873):PVMSOCK=/tmp/fileFJ8XV4
12/3 17:06:15 (457.0) (23873):pvm_fd = 4, mytid = t40001
12/3 17:06:15 (457.0) (23873):Entered StartWaitingHosts()
12/3 17:06:15 (457.0) (23873):Ok to start waiting hosts
12/3 17:06:15 (457.0) (23873):PVMd message is SM_STHOST from t80000000
12/3 17:06:15 (457.0) (23873):SM_STHOST: 80000 "" "161.72.81.147" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncalendula.iac.es 1 a14851bb:8045 4080 2 a1485193:0000"
12/3 17:06:15 (457.0) (23873):New process for proc 0
12/3 17:06:15 (457.0) (23873):AllocProc() returning 0
12/3 17:06:15 (457.0) (23873):Shadow: Entering multi_send_job(calendula.iac.es)
12/3 17:06:15 (457.0) (23873):Requesting Alternate Starter 1
12/3 17:06:15 (457.0) (23873):Shadow: Request to run a job was REFUSED
12/3 17:06:15 (457.0) (23873):RemoveHost: Sending HostDelete notify on t80080000
12/3 17:06:15 (457.0) (23873):SendNotification(kind = 2, tid = t80080000)
12/3 17:06:15 (457.0) (23873):signal_startd( calendula.iac.es, 443 )
12/3 17:06:15 (457.0) (23873):Adding host calendula.iac.es to STARTACK msg.
12/3 17:06:15 (457.0) (23873):Num Hosts to pack = 1
12/3 17:06:15 (457.0) (23873):Packing tid t80000 with reply PvmNoHost
12/3 17:06:15 (457.0) (23873):Sending SM_STHOSTACK to PVMd
12/3 17:06:15 (457.0) (23873):Entered StartWaitingHosts()
12/3 17:06:15 (457.0) (23873):Can't start new machines now{ calendula.iac.es}
12/3 17:06:15 (457.0) (23873):Updated class:
12/3 17:06:15 (457.0) (23873):#0: 0 (1, 2) has 0
12/3 17:06:15 (457.0) (23873):PVMd message is SM_ADDACK from t80000000
12/3 17:06:15 (457.0) (23873):pvmd reports error -6 on SM_ADDACK: PvmNoHost
12/3 17:06:15 (457.0) (23873):pvm_machines_starting = 0(should be 0)
12/3 17:06:15 (457.0) (23873):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/3 17:06:15 (457.0) (23873):Not enough machines in class 0 to start local proc.
12/3 17:06:15 (457.0) (23873):Entered StartWaitingHosts()
12/3 17:06:15 (457.0) (23873):Ok to start waiting hosts
--
----------------------------------
http://www.iac.es/galeria/angelv/
PostDoc Software Support
Instituto de Astrofisica de Canarias