Herman,
Well I think you are about right but I found the evidence in the job
output
file rather than the log files. I probably missed this as I was not
waiting
five minutes to look.
Still not sure why this is happening. One more point is that the
network is
simple Windows workgroup not a Windows domain and no DNS.
It appears that I have some connection issue between the two
machines.
Condor appears to be rescheduling the job every five minutes.
______________________________________________________________________________
000 (008.000.000) 04/16 14:27:04 Job submitted from host:
<192.168.50.1:54597>
...
001 (008.000.000) 04/16 14:47:07 Job executing on host:
<192.168.50.1:54599>
...
022 (008.000.000) 04/16 14:47:08 Job disconnected, attempting to
reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
...
024 (008.000.000) 04/16 14:47:08 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot3@jhowes-HPT, rescheduling job
...
001 (008.000.000) 04/16 14:52:08 Job executing on host:
<192.168.50.1:54599>
...
022 (008.000.000) 04/16 14:52:08 Job disconnected, attempting to
reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
...
024 (008.000.000) 04/16 14:52:08 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot3@jhowes-HPT, rescheduling job
...
001 (008.000.000) 04/16 14:57:08 Job executing on host:
<192.168.50.1:54599>
...
022 (008.000.000) 04/16 14:57:08 Job disconnected, attempting to
reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
Best,
John L. (Jack) Howes
On 16.04.2012 09:21, Hermann Fuchs wrote:
Hi
You should have a look into the Negotiatior log on the master
server as
well as the startlog on the execute node.
I had a similar case where the Master matched the job, while the
execute
node rejected it for some reason.
Then the master matched it again, the execute node rejected it and
so
on...
The
Request has not yet been considered by the matchmaker.
means in this case that after the Master matched a job it forgets
all
about it. If the job comes back again (e.g. because it was rejected
by
the execute node a split second later) the master thinks it is a
new
job.
Cheers,
Hermann
On Mon, 2012-04-16 at 08:16 -0400, jhowes@xxxxxxxxxxxxxxxx wrote:
I am looking for some help in trying to debug something that seems
like
it should work without trouble [but not for me].
I setup a personal condor on my laptop under Win7 with no trouble
and
also set up the additional config stuff to enable RunAsOwner.
Tested
with a simple Perl script job and it works as expected.
Next step was to add another node to create a real pool. So,
installed
the same version (7.6.6) on the desktop. Just used the msi script
and
pointed this at my laptop as the pool central manager. Also added
the
credd changes to the config to allow RunAsOwner. Both machines
are
quadcores running Win7 64 bit.
But jobs just sit in queue -- no difference in behavior whether
submitting a RunAsOwner or not.
Condor status looks right -- two machines four slots each. The
daemons
that are running look right and there is nothing that jumps out at
me in
the logs.
Seems like this should be dead simple but I am stuck. Any insight
in
where to look would be appreciated.
_____________________________________________________________________________________
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\jhowes>condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
slot1@HPTlaptop WINNT61 X86_64 Unclaimed Idle 0.070
973
0+00:15:04
slot2@HPTlaptop WINNT61 X86_64 Unclaimed Idle 0.000
973
0+00:14:46
slot3@HPTlaptop WINNT61 X86_64 Unclaimed Idle 0.000
973
0+00:15:06
slot4@HPTlaptop WINNT61 X86_64 Unclaimed Idle 0.000
973
0+00:15:07
slot1@jhowes-HPT WINNT61 X86_64 Unclaimed Idle 0.090
2026
0+00:35:31
slot2@jhowes-HPT WINNT61 X86_64 Unclaimed Idle 0.000
2026
0+00:34:40
slot3@jhowes-HPT WINNT61 X86_64 Unclaimed Idle 0.000
2026
0+00:35:33
slot4@jhowes-HPT WINNT61 X86_64 Unclaimed Idle 0.000
2026
0+00:35:34
Total Owner Claimed Unclaimed Matched
Preempting
Backfill
X86_64/WINNT61 8 0 0 8 0
0
0
Total 8 0 0 8 0
0
0
C:\Users\jhowes>condor_q
-- Submitter: jhowes-HPT : <192.168.50.1:52258> : jhowes-HPT
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
7.0 jhowes 4/16 07:49 0+00:00:00 I 0 0.0
TimeStamp.pl
1 jobs; 1 idle, 0 running, 0 held
C:\Users\jhowes>condor_q -analyze
-- Submitter: jhowes-HPT : <192.168.50.1:52258> : jhowes-HPT
---
007.000: Request has not yet been considered by the matchmaker.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for
Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien
Tel. + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/