[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting mad trying to flocking in condor



Hi Jaime,

Thanks so much for you reply,

2012/7/6 Jaime Frey <jfrey@xxxxxxxxxxx>
On Jul 4, 2012, at 2:57 AM, Michell Guzman Cancimance wrote:

I'm getting mad trying to flock a job from a cluster A (master.cluster.org, 172.18.0.2) to a cluster B (cl-master.mycluster.org. 178.12.100.2),
each cluster have a master and two worker nodes, the cluster A have nodes with arch X86_64, and the cluster
B have nodes with arch INTEL (32 bits). I have configured the two condor_config (the flocking section) in each master nodes of this clusters (master.cluster.org and cl-master.mycluster.org nodes) following the steps in (http://research.cs.wisc.edu/condor/manual/v6.8/5_2Connecting_Condor.html). When I run a job en each cluster separately that works fine, but when I run a job with a requirement of an arch INTEL into the cluster A (the cluster whose nodes have X86_64 Arch) trying to
do a flock to the cluster B doesn't works. I have tried a lot of stuff but I can't get any success. I would appreciate any help in order to solve this problem.

Here are a couple things to try:

* Ensure you are setting should_transfer_files and when_to_transfer_output in your submit file, so that Condor isn't restricting the jobs to run only on machines that have the same shared filesystem.

Ok, checked.
 
* Run 'condor_status -submitters' on cluster B and see if there's an ad from the schedd on cluster A. If it's there, then the schedd from A is successfully flocking to B, though possibly not getting any matches.

I have run the command and the job is successfully flocked to B, but the job remains idle.

Name                                    Machine        Running   IdleJobs       HeldJobs

vagrant@xxxxxxxxxx         master.cl        0              1                    0
                                             RunningJobs           IdleJobs           HeldJobs

vagrant@xxxxxxxxxx                 0                           1                  0

               Total                         0                           1                  0

* If A is flocking to B but not getting any matches, run 'condor_q -analyze -pool cl-master.mycluster.org' to see if any job or machine requirements are preventing a match.

I have run the command and this is the result



-- Submitter: master.cluster.org : <10.0.2.15:60867> : master.cluster.org
---
005.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      2 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job
    No successful match recorded.
    Last failed match: Sat Jul  7 01:48:53 2012
    Reason for last match failure: no match found

The following attributes are missing from the job ClassAd:

CheckpointPlatform

 

* Check that you have FLOCK_TO set on A and FLOCK_FROM set on B.

Ok. Checked.


The job that I'm running uses an execution file and put the answer of that execution in a textfile;  I have checked my SchedLog file and I think that is not a match problem.

 I hope you can help me to figure out what's going on.

Best regards
Michell


SchedLog file

07/07/12 01:46:52 (pid:993) Match record (slot2@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant, 5.0) deleted
07/07/12 01:46:53 (pid:993) Activity on stashed negotiator socket: <172.18.1.2:51734>
07/07/12 01:46:53 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:46:53 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx
07/07/12 01:46:53 (pid:993) Finished negotiating for vagrant in local pool: 0 matched, 1 rejected
07/07/12 01:47:13 (pid:993) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/07/12 01:47:15 (pid:993) Failed to start non-blocking update to unknown.
07/07/12 01:47:15 (pid:993) Sent ad to central manager for vagrant@xxxxxxxxxxx
07/07/12 01:47:15 (pid:993) Sent ad to 1 collectors for vagrant@xxxxxxxxxxxxx
07/07/12 01:47:26 (pid:993) Failed to start non-blocking update to unknown.
07/07/12 01:47:52 (pid:993) Activity on stashed negotiator socket: <178.12.100.2:48483>
07/07/12 01:47:52 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:47:52 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx (flock level 2, pool 178.12.100.2)
07/07/12 01:47:52 (pid:993) Finished negotiating for vagrant in pool 178.12.100.2: 1 matched, 0 rejected
07/07/12 01:47:52 (pid:993) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot1@xxxxxxxxxxxxxxx <178.12.100.3:42322> for vagrant.
07/07/12 01:47:52 (pid:993) IO: Failed to read packet header
07/07/12 01:47:52 (pid:993) Response problem from startd when requesting claim slot1@xxxxxxxxxxxxxxxxx <178.12.100.3:42322> for vagrant 5.0.
07/07/12 01:47:52 (pid:993) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <178.12.100.3:42322> for vagrant: CEDAR:6004:failed reading from socket
07/07/12 01:47:52 (pid:993) Match record (slot1@xxxxxxxxxxxxxxxxx <178.12.100.3:42322> for vagrant, 5.0) deleted
07/07/12 01:47:53 (pid:993) Activity on stashed negotiator socket: <172.18.1.2:51734>
07/07/12 01:47:53 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:47:53 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx
07/07/12 01:47:53 (pid:993) Finished negotiating for vagrant in local pool: 0 matched, 1 rejected
07/07/12 01:48:52 (pid:993) Activity on stashed negotiator socket: <178.12.100.2:48483>
07/07/12 01:48:52 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:48:52 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx (flock level 2, pool 178.12.100.2)
07/07/12 01:48:52 (pid:993) Finished negotiating for vagrant in pool 178.12.100.2: 1 matched, 0 rejected
07/07/12 01:48:52 (pid:993) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot1@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant.
07/07/12 01:48:52 (pid:993) IO: Failed to read packet header
07/07/12 01:48:52 (pid:993) Response problem from startd when requesting claim slot1@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant 5.0.
07/07/12 01:48:52 (pid:993) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant: CEDAR:6004:failed reading from socket
07/07/12 01:48:52 (pid:993) Match record (slot1@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant, 5.0) deleted
07/07/12 01:48:53 (pid:993) Activity on stashed negotiator socket: <172.18.1.2:51734>
07/07/12 01:48:53 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:48:53 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx
07/07/12 01:48:53 (pid:993) Finished negotiating for vagrant in local pool: 0 matched, 1 rejected
07/07/12 01:49:52 (pid:993) Activity on stashed negotiator socket: <178.12.100.2:48483>
07/07/12 01:49:52 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:49:52 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx (flock level 2, pool 178.12.100.2)
07/07/12 01:49:52 (pid:993) Finished negotiating for vagrant in pool 178.12.100.2: 1 matched, 0 rejected
07/07/12 01:49:52 (pid:993) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot2@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant.
07/07/12 01:49:52 (pid:993) IO: Failed to read packet header
07/07/12 01:49:52 (pid:993) Response problem from startd when requesting claim slot2@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant 5.0.
07/07/12 01:49:52 (pid:993) Failed to send REQUEST_CLAIM to startd slot2@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant: CEDAR:6004:failed reading from socket
07/07/12 01:49:52 (pid:993) Match record (slot2@xxxxxxxxxxxxxxxxx <178.12.100.4:40086> for vagrant, 5.0) deleted
07/07/12 01:49:53 (pid:993) Activity on stashed negotiator socket: <172.18.1.2:51734>
07/07/12 01:49:53 (pid:993) Using negotiation protocol: NEGOTIATE
07/07/12 01:49:53 (pid:993) Negotiating for owner: vagrant@xxxxxxxxxxx
07/07/12 01:49:53 (pid:993) Finished negotiating for vagrant in local pool: 0 matched, 1 rejected

 
Thanks and regards,
Jaime Frey
UW-Madison Condor Team


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
"Nullius addictus jurare in verba magistri"