Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor on X86_64 no run works
- Date: Mon, 05 Nov 2007 15:26:44 +0100
- From: jmferrer <jmferrer@xxxxxxxx>
- Subject: Re: [Condor-users] Condor on X86_64 no run works
Hi.
Kewley, J (John) escribió:
> Some thoughts:
>
> 1. You mention "flock". You shouldn't need this if you just have a
> single pool.
>
Yes, I know, there are 2 pool.
But the second pool is unknow for me. Only I know that I have to
activate flock.
> 2. I notice you have vm1, vm2 ... vm5 mentioned, that implies more than
> 4 processors
> per node, you might have hyperthreading turned on, in which case
> condor will register
> (possibly) 8 slots per node.
>
OK, 2 quad core = 8 cpu
> 3. Have you tried
> condor_q -anal
> or
> condor_q -better-anal
> to see why it isn't matching?
>
gargamel:~ # condor_q -analyze
Error: Could not connect to negotiator ((null))
before work, now no, I'm searching in google.
But condor_q :
-- Submitter: gargamel.localdomain : <XXXXXXXXXX:38974> :
gargamel.localdomain
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
41.0 condor 11/5 13:20 0+00:00:00 I 0 9.8 for
41.1 condor 11/5 13:20 0+00:00:00 I 0 9.8 for
41.2 condor 11/5 13:20 0+00:00:00 I 0 9.8 for
41.3 condor 11/5 13:20 0+00:00:00 I 0 9.8 for
41.4 condor 11/5 13:20 0+00:00:00 I 0 9.8 for
42.0 condor 11/5 15:21 0+00:00:00 I 0 9.8 loop
42.1 condor 11/5 15:21 0+00:00:00 I 0 9.8 loop
42.2 condor 11/5 15:21 0+00:00:00 I 0 9.8 loop
42.3 condor 11/5 15:21 0+00:00:00 I 0 9.8 loop
42.4 condor 11/5 15:21 0+00:00:00 I 0 9.8 loop
> 4. You do a "queue 5", but all the jobs write to the same error and
> output files,
> this may not be what is desired. To write to different ones, use
> something like
> output = loop$(PROCESS).out
> error = loop$(PROCESS).err
>
OK, thanks,
After submit loop.sub I have 5 files err and 5 files out
--------------> empty all
> 5. I can't see a
> log = loop.log
> line, this is useful - have a look in there to see what is produced.
> [Note: don't use $(PROCESS) for this one
>
OK, thanks
000 (042.000.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.001.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.002.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.003.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
...
000 (042.004.000) 11/05 15:21:08 Job submitted from host: <MY_IP:38974>
> 6. Have a look in the SchedLog of your submit node to see what is in
> there
>
last 50 lines after run loop.sub
11/5 15:18:32 (pid:4513)
******************************************************
11/5 15:18:33 (pid:4513) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
11/5 15:18:33 (pid:4513) ** /home/condor/sbin/condor_schedd
11/5 15:18:33 (pid:4513) ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/5 15:18:33 (pid:4513) ** $CondorPlatform: I386-LINUX_DEBIAN40 $
11/5 15:18:33 (pid:4513) ** PID = 4513
11/5 15:18:33 (pid:4513) ** Log last touched 11/5 15:09:11
11/5 15:18:34 (pid:4513)
******************************************************
11/5 15:18:34 (pid:4513) Using config source: /home/condor/condor_config
11/5 15:18:34 (pid:4513) Using local config sources:
11/5 15:18:34 (pid:4513) /home/condor/etc/gargamel.local
11/5 15:18:34 (pid:4513) DaemonCore: Command Socket at <MI_IP:38974>
11/5 15:18:35 (pid:4513) History file rotation is enabled.
11/5 15:18:35 (pid:4513) Maximum history file size is: 20971520 bytes
11/5 15:18:35 (pid:4513) Number of rotated history files is: 2
11/5 15:18:36 (pid:4513) Sent ad to central manager for condor@localdomain
11/5 15:18:37 (pid:4513) Sent ad to 3 collectors for condor@localdomain
11/5 15:18:41 (pid:4513) GCB: [GCB_connect(17)]<192.168.3.100:9618>:
direct connect using _CB_do_connect failed
11/5 15:18:41 (pid:4513) attempt to connect to <192.168.3.100:9618>
failed: Transport endpoint is already connected (connect errno = 106).
Will keep trying for 20 total seconds (15 to go).
11/5 15:21:11 (pid:4513) DaemonCore: Command received via UDP from host
<MY_IP:32788>
11/5 15:21:11 (pid:4513) DaemonCore: received command 421 (RESCHEDULE),
calling handler (reschedule_negotiator)
11/5 15:21:11 (pid:4513) Sent ad to central manager for condor@localdomain
11/5 15:21:12 (pid:4513) Sent ad to 3 collectors for condor@localdomain
11/5 15:21:12 (pid:4513) Called reschedule_negotiator()
11/5 15:21:12 (pid:4513) failed to send RESCHEDULE command to negotiator
11/5 15:23:39 (pid:4513) DaemonCore: PERMISSION DENIED to unknown user
from host <MY_IP:59285> for command 493 (NEGOTIATE_WITH_SIGATTRS)
> 7. Are these nodes on a cluster, i.e. on a private network, if so then
> you
> will need full connectivity between all submit nodes and all execute
> nodes.
> See paper and presentation on
> http://epubs.cclrc.ac.uk/work-details?w=34452
> for more details
>
> Good luck
>
I read now, thanks
> JK
>
>
>> -----Original Message-----
>> From: condor-users-bounces@xxxxxxxxxxx
>> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of jmferrer
>> Sent: Monday, November 05, 2007 12:34 PM
>> To: condor-users@xxxxxxxxxxx
>> Subject: [Condor-users] Condor on X86_64 no run works
>>
>> Hi.
>>
>> I'm trying build a Cluster with:
>>
>> OpenSuse 10.2
>> Condor-6.8.6
>> Kernel suse 2.6.18.2-34-default
>>
>>
>> System:
>>
>> 1 Central Manager 1cpu x P4 ----------> no execute and yes flock
>> 19 nodes 2 quadcore inet X86_64
>>
>> I share /home in Central manger (for all nodes NFS)
>>
>> If I run condor_status
>>
>> gargamel:/home/condor # condor_status
>>
>> Name OpSys Arch State Activity LoadAv Mem
>> ActvtyTime
>>
>> vm1@smurf0 LINUX X86_64 Owner Idle 0.000
>> 996 0+00:06:45
>> vm2@smurf0 LINUX X86_64 Unclaimed Idle 0.000
>> 996 4+23:45:04
>> vm3@smurf0 LINUX X86_64 Unclaimed Idle 0.000
>> 996 4+23:45:05
>> vm4@smurf0 LINUX X86_64 Unclaimed Idle 0.000
>> 996 4+23:45:07
>> vm5@smurf0 LINUX X86_64 Unclaimed Idle 0.000
>> 996 4+23:45:08
>> ..............................
>> Total 87 1 0 86 0
>> 0 0
>>
>> some nodes is off
>>
>> My submit file
>> gargamel:/home/condor # cat /home/pepe/test_condor/loop.submit
>> #archivo de descripcion generado automaticamente universe =
>> vanilla executable = loop output = loop.out error = loop.err
>> Requirements = (Arch =="INTEL" && OpSys == "LINUX") || \
>> (Arch =="X86_64" && OpSys == "LINUX") queue 5
>>
>>
>>
>>
>> somebody can show me how do work this?
>>
>>
>>
>> Sorry for my englis, I'm from almeria IR.
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to
>> condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>