Hi All,
(Apologies if you receive multiple copies of this post. The
camgrid-users mailing list appears to be blocking another of my email
addresses.)
We currently run several pools (all linux) with v7.0.5 and are looking
to upgrade piecemeal to v7.2.2. Encouraged by the entry in section 8.2
of the v7.2.2 manual, namely "We believe that Condor 7.2.x and 7.0.x
are wire-compatible, and can be freely mixed between computers in a
Condor pool.", we've been testing upgrading some machines. However,
we're seeing jobs getting rejected when the schedd is running 7.0.5
and the startd is running 7.2.2. No other changes have been made, i.e.
the configuration files have remained the same. Before I paste in the
relevant parts of the log files, a bit of background: many of our
machines have multiple IP addresses but Condor is forced to operate
using a specific address, selected by the NETWORK_INTERFACE value in a
machine's condor_config.local file. This address is always a "private"
(RFC 1918) address in the range 172.24.xxx.xxx.
Here's an example. The submit host has IP address 172.24.252.25 only,
whereas the execute has two addresses: 131.111.xxx.xxx (which should
*not* be used by Condor) and 172.24.116.4 (which should). So, here's
the SchedLog from the submit host for when both submit and execute
host are running 7.0.5 (job completes correctly):
4/20 17:45:08 Using config source: /etc/condor/condor_config
4/20 17:45:08 Using local config sources:
4/20 17:45:08 /usr/local/condor/local/condor_config.local
4/20 17:45:08 /usr/local/condor/local/condor_config.flocking
4/20 17:45:08 DaemonCore: Command Socket at <172.24.252.25:13743
<http://172.24.252.25:13743>>
4/20 17:45:08 Initializing a VANILLA shadow for job 8.0
4/20 17:45:08 (8.0) (3799): Request to run on <172.24.116.4:9692
<http://172.24.116.4:9692>> was ACCEPTED
4/20 17:45:09 (8.0) (3799): ZKM: setting default map to (null)
4/20 17:45:09 (8.0) (3799): Job 8.0 terminated: exited with status 0
4/20 17:45:09 (8.0) (3799): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
Now the corresponding relevant snippet for when the execute host has
been upgraded to 7.2.2 (job fails as file transfer does not take place):
4/18 06:19:52 Using config source: /etc/condor/condor_config
4/18 06:19:52 Using local config sources:
4/18 06:19:52 /usr/local/condor/local/condor_config.local
4/18 06:19:52 /usr/local/condor/local/condor_config.flocking
4/18 06:19:52 DaemonCore: Command Socket at <172.24.252.25:14228
<http://172.24.252.25:14228>>
4/18 06:19:52 Initializing a VANILLA shadow for job 6.0
4/18 06:19:52 (6.0) (3719): Request to run on <172.24.116.4:9668
<http://172.24.116.4:9668>> was ACCEPTED
4/18 06:19:52 (6.0) (3719): DaemonCore: PERMISSION DENIED to unknown
user from host <131.111.xxx.xxx:9633> for command 61000
(FILETRANS_UPLOAD), access level WRITE
4/18 06:19:52 (6.0) (3719): ERROR "Error from starter on
XXXX.escience.cam.ac.uk <http://XXXX.escience.cam.ac.uk>: Failed to
transfer files" at line 649 in file pseudo_ops.C
It would appear that in 7.2.2 Condor's trying to make use of an
interface on the execute host that's not the one nominated in
NETWORK_INTERFACE (in this case it's the canonical, globally routeable
address). Is there any reason why this has changed from 7.0.5? And is
there any way of getting 7.2.2 to conform with the desired 7.0.5
behaviour?
Best regards,
Mark
------------------------------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/