Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] CCB Server - Client Communication (Condor 7.3.1)
- Date: Thu, 16 Jul 2009 17:15:32 -0500
- From: "Herzfeld, David" <david.herzfeld@xxxxxxxxxxxxx>
- Subject: [Condor-users] CCB Server - Client Communication (Condor 7.3.1)
Hello Condor Group,
We are having an issue running jobs using the new CCB feature in Condor. We have nodes that are running a master and startd behind a NAT (Condor 7.3.1). These execute hosts are connecting to a Central Manager, running a Collector, Negotiator, etc with a public address (Condor 7.3.0). The machines appear to join the pool correctly - we can see them in condor_status and their status changes appropriately.
However, running a job on the machine only works intermittently. Most of the time we receive the following in the schedd log:
Match record (worker_EEFFCD67D127.domain.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted
07/16 16:42:18 (pid:562) Sent ad to central manager for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:18 (pid:562) Sent ad to 1 collectors for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Activity on stashed negotiator socket
07/16 16:42:37 (pid:562) Negotiating for owner: herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Out of servers - 1 jobs matched, 9 jobs idle, 1 jobs rejected
07/16 16:42:37 (pid:562) Failed to send CCB_REQUEST to collector 192.168.10.18:9618:
07/16 16:42:37 (pid:562) CCBClient: no more CCB servers to try for requesting reversed connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx; giving up.
07/16 16:42:37 (pid:562) Failed to send REQUEST_CLAIM to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx failed
07/16 16:42:37 (pid:562) Match record (worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted.
The log files on the execute host show nothing unusual - no jobs are getting rejected nor does it say that any sort of communications failure has occured. The ALLOW_DAEMON line on the execute host is set to *.
Sometimes a series of jobs are able to run (usually right after the execute node joins the pool). Any help in this matter would be greatly appreciated.
Many thanks,
David