Hi all,
I'm having trouble debugging a cluster that wants to run MPI jobs. They
are getting failures in the sshd.sh script that ships with condor in the
jobs stderr:
chirp: couldn't putfile: No such file or directory
/usr/libexec/condor/sshd.sh: line 69: 23981 Aborted
$CONDOR_CHIRP put -perm 0700 $idkey
$_CONDOR_REMOTE_SPOOL_DIR/$_CONDOR_PROCNO.key
Tracing the relevant processes I see the following sent from chirp to
the starter:
"putfile /var/spool/condor/astro/30/0/cluster30.proc0.subproc0/1.key 448
1675"
starter sends
"\1\0\0\0S\0\0\0\0\0\0\1&var/spool/condor/astro/32/0/cluster32.proc0.subproc0/0.key\0\0\0\0\0\0\0\1\300\0\0\0\0\0\0\6\213"
and gets "\1\0\0\0\20" and
"\377\377\377\377\377\377\377\377\0\0\0\0\0\0\0\2" from the shadow, and
then writes "-3" to chirp which fails.
In the shadow log I'm getting things like:
ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxx: File
var/spool/condor/astro/25/0/cluster
25.proc0.subproc0/contact maps to url 1320272782, which I don't know how
to open.
and stracing it it tries to open "var/spool/...etc..." without a forward
slash and fails (not sure if this matters).
I've checked the obvious (to me) things like permissions on spool,
etc... and they look OK. Any help would be greatly appreciated.
Thanks,
William Strecker-Kellogg
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/