[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] OSX Submit Woes




More OSX Woes

If I submit Condor jobs from a Mac in our Condor pool no jobs execute
on the host from which the jobs were submitted.  Condor makes matches
for this host, but the jobs never proceed to state Claimed.

The SchedLog on the submit host shows:

  1/26 17:23:11 condor_write(): Socket closed when trying to write
    buffer
  1/26 17:23:11 Buf::write(): condor_write() failed
  1/26 17:23:11 Couldn't send eom to startd.
  1/26 17:23:11 Sent RELEASE_CLAIM to startd on <130.60.165.121:50543>
  1/26 17:23:11 Match record (<130.60.165.121:50543>, 35, 2) deleted
  1/26 17:23:11 condor_read(): recv() returned -1, errno = 54, assuming
    failure.
  1/26 17:23:11 Response problem from startd.
  1/26 17:23:11 Sent RELEASE_CLAIM to startd on <130.60.165.121:50543>
  1/26 17:23:11 Match record (<130.60.165.121:50543>, 35, 3) deleted

The StartLog on the submit host shows:

  1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
    <130.60.165.121:50559> for command 442 (REQUEST_CLAIM)
  1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
    <130.60.165.121:52145> for command 443 (RELEASE_CLAIM)
  1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
    <130.60.165.121:50560> for command 442 (REQUEST_CLAIM)
  1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
    <130.60.165.121:52146> for command 443 (RELEASE_CLAIM)

Complete relevant sections from relevant log files included below.

Our pool has just three hosts:

Mite: x86_64 master
Mote: OSX (the submit host in this example)
Speck: OSX (the other Mac in this example)

If I submit from the Master, which is an x86_64 Linux host jobs run on
mote.  If I submit on speck, jobs run on mote, but not speck.

We're running Condor 6.6.10 on the AMD and PPC boxes.  The Macs run
OSX 10.4.4 with all patches applied.  The Linux box runs SuSE 9.3 with
the 2.6.11.4-21.8-smp kernel.

We're using NIS for authentication and NFS via the automounter for
shared filesystem access.

Any ideas?

Thanks,
        Chance


--
Chance Reschke
University of Zurich
Institute for Theoretical Physics
044 63 56192




======================================================================== =

condor_status snippet immediately after submitting on MOTE

vm1@xxxxxxxxx OSX   PPC    Matched    Idle  0.000  4000  0+00:00:03
vm2@xxxxxxxxx OSX   PPC    Matched    Idle  0.000  4000  0+00:00:04
vm1@xxxxxxxxx OSX   PPC    Claimed    Busy  0.250  4000  0+00:00:02
vm2@xxxxxxxxx OSX   PPC    Claimed    Busy  0.000  4000  0+00:00:03

condor_status snippet a couple of minutes later

vm1@xxxxxxxxx OSX   PPC    Unclaimed  Idle  0.000  4000  0+00:00:47
vm2@xxxxxxxxx OSX   PPC    Unclaimed  Idle  0.000  4000  0+00:00:48
vm1@xxxxxxxxx OSX   PPC    Claimed    Busy  0.100  4000  0+00:02:45
vm2@xxxxxxxxx OSX   PPC    Claimed    Busy  0.000  4000  0+00:02:43

======================================================================== =
NegotiatorLog on Master

1/26 17:23:08 ---------- Started Negotiation Cycle ----------
1/26 17:23:08 Phase 1:  Obtaining ads from collector ...
1/26 17:23:08   Getting all public ads ...
1/26 17:23:08   Sorting 20 ads ...
1/26 17:23:08   Getting startd private ads ...
1/26 17:23:08 Got ads: 20 public and 12 private
1/26 17:23:08 Public ads include 2 submitter, 12 startd
1/26 17:23:08 Phase 2:  Performing accounting ...
1/26 17:23:08 Phase 3:  Sorting submitter ads by priority ...
1/26 17:23:08 Phase 4.1:  Negotiating with schedds ...
1/26 17:23:08   Negotiating with reschke@xxxxxxxxxxxxxxx at
<130.60.165.121:50544>
1/26 17:23:09     Request 00035.00000:
1/26 17:23:09       Matched 35.0 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.122:52158>
1/26 17:23:09       Successfully matched with vm1@xxxxxxxxxxx
1/26 17:23:09     Request 00035.00001:
1/26 17:23:09       Matched 35.1 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.122:52158>
1/26 17:23:09       Successfully matched with vm2@xxxxxxxxxxx
1/26 17:23:09     Request 00035.00002:
1/26 17:23:09       Matched 35.2 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.121:50543>
1/26 17:23:09       Successfully matched with vm1@xxxxxxxxxx
1/26 17:23:09     Request 00035.00003:
1/26 17:23:09       Matched 35.3 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.121:50543>
1/26 17:23:09       Successfully matched with vm2@xxxxxxxxxx
1/26 17:23:09     Request 00035.00004:
1/26 17:23:09       Rejected 35.4 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544>: no match found
1/26 17:23:10     Got NO_MORE_JOBS;  done negotiating
1/26 17:23:10 ---------- Finished Negotiation Cycle ----------

======================================================================== =
MatchLog on Master

1/26 17:23:09       Matched 35.0 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.122:52158>
1/26 17:23:09       Matched 35.1 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.122:52158>
1/26 17:23:09       Matched 35.2 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.121:50543>
1/26 17:23:09       Matched 35.3 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544> preempting none <130.60.165.121:50543>
1/26 17:23:09       Rejected 35.4 reschke@xxxxxxxxxxxxxxx
<130.60.165.121:50544>: no match found

======================================================================== = ======================================================================== = ======================================================================== =
SchedLog on Submitter

1/26 17:23:09 DaemonCore: Command received via UDP from host
<130.60.165.121:52141>
1/26 17:23:09 DaemonCore: received command 421 (RESCHEDULE), calling
handler (reschedule_negotiator)
1/26 17:23:10 Sent ad to central manager for reschke@xxxxxxxxxxxxxxx
1/26 17:23:10 Called reschedule_negotiator()
1/26 17:23:10 DaemonCore: Command received via TCP from host
<172.19.2.50:39435>
1/26 17:23:10 DaemonCore: received command 416 (NEGOTIATE), calling
handler (negotiate)
1/26 17:23:10 Negotiating for owner: reschke@xxxxxxxxxxxxxxx
1/26 17:23:10 Checking consistency running and runnable jobs
1/26 17:23:10 Tables are consistent
1/26 17:23:11 Out of servers - 4 jobs matched, 4 jobs idle, 1 jobs
rejected
1/26 17:23:11 condor_write(): Socket closed when trying to write
buffer
1/26 17:23:11 Buf::write(): condor_write() failed
1/26 17:23:11 Couldn't send eom to startd.
1/26 17:23:11 Sent RELEASE_CLAIM to startd on <130.60.165.121:50543>
1/26 17:23:11 Match record (<130.60.165.121:50543>, 35, 2) deleted
1/26 17:23:11 condor_read(): recv() returned -1, errno = 54, assuming
failure.
1/26 17:23:11 Response problem from startd.
1/26 17:23:11 Sent RELEASE_CLAIM to startd on <130.60.165.121:50543>
1/26 17:23:11 Match record (<130.60.165.121:50543>, 35, 3) deleted
1/26 17:23:13 Started shadow for job 35.0 on "<130.60.165.122:52158>",
(shadow pid = 4653)
1/26 17:23:16 Started shadow for job 35.1 on "<130.60.165.122:52158>",
(shadow pid = 4655)
1/26 17:23:17 Sent ad to central manager for reschke@xxxxxxxxxxxxxxx

======================================================================== =
StartLog on Submitter

1/26 17:23:10 DaemonCore: Command received via UDP from host
<172.19.2.50:38254>
1/26 17:23:10 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
1/26 17:23:10 vm1: match_info called
1/26 17:23:10 vm1: Received match <130.60.165.121:50543>#1863951920
1/26 17:23:10 vm1: State change: match notification protocol
successful
1/26 17:23:10 vm1: Changing state: Unclaimed -> Matched
1/26 17:23:10 DaemonCore: Command received via UDP from host
<172.19.2.50:38254>
1/26 17:23:10 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
1/26 17:23:10 vm2: match_info called
1/26 17:23:10 vm2: Received match <130.60.165.121:50543>#1970742417
1/26 17:23:10 vm2: State change: match notification protocol
successful
1/26 17:23:10 vm2: Changing state: Unclaimed -> Matched
1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
<130.60.165.121:50559> for command 442 (REQUEST_CLAIM)
1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
<130.60.165.121:52145> for command 443 (RELEASE_CLAIM)
1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
<130.60.165.121:50560> for command 442 (REQUEST_CLAIM)
1/26 17:23:11 DaemonCore: PERMISSION DENIED to unknown user from host
<130.60.165.121:52146> for command 443 (RELEASE_CLAIM)


======================================================================== =
ShadowLog on Submitter

1/26 17:23:14 ******************************************************
1/26 17:23:14 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/26 17:23:14 ** /qgd/condor/sbin/condor_shadow
1/26 17:23:14 ** $CondorVersion: 6.6.10 Jun 13 2005 $
1/26 17:23:14 ** $CondorPlatform: PPC-OSX_10_3 $
1/26 17:23:14 ** PID = 4653
1/26 17:23:14 ******************************************************
1/26 17:23:14 Using config file: /qgd/home/condor/condor_config
1/26 17:23:14 Using local config files: /qgd/condor/etc/mote.local
1/26 17:23:14 DaemonCore: Command Socket at <130.60.165.121:50561>
1/26 17:23:15 Initializing a VANILLA shadow
1/26 17:23:15 (35.0) (4653): Request to run on <130.60.165.122:52158>
was ACCEPTED
1/26 17:23:16 ******************************************************
1/26 17:23:16 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/26 17:23:16 ** /qgd/condor/sbin/condor_shadow
1/26 17:23:16 ** $CondorVersion: 6.6.10 Jun 13 2005 $
1/26 17:23:16 ** $CondorPlatform: PPC-OSX_10_3 $
1/26 17:23:16 ** PID = 4655
1/26 17:23:16 ******************************************************
1/26 17:23:16 Using config file: /qgd/home/condor/condor_config
1/26 17:23:16 Using local config files: /qgd/condor/etc/mote.local
1/26 17:23:16 DaemonCore: Command Socket at <130.60.165.121:50565>
1/26 17:23:17 Initializing a VANILLA shadow
1/26 17:23:18 (35.1) (4655): Request to run on <130.60.165.122:52158>
was ACCEPTED

======================================================================== =