[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] issues with condor_q and imports from jobs



I was running condor_q on the wrong machine. When I run in on the
master it works.

As for the other issue, the sporadic import errors, yes, it's
happening on both. On Thu, Nov 29, 2018 at 1:05 PM John M Knoeller
<johnkn@xxxxxxxxxxx> wrote:
>
> CEDAR:6001:Failed to connect to <192.168.10.17:1571>
>
> If the schedd is actually  listening at that address and port, then "Failed to connect" is almost certainly because of a firewall or router and not because of HTCondor configuration.
>
> does
>
>     condor_status -schedd -af Name MyAddress
>
> show the address above?
>
> does condor_q work when you run it on the machine that is running the Schedd?
>
> You can have a look at the SchedLog to see if it is actively refusing the connection, but I don't think you will see anything.  if the problem is the HTCondor configuration causing the Schedd to refuse the command I would expect a different error message from condor_q.
>
> You don't say what version of HTCondor is being used, but I'm assuming that this is an older version because starting with 8.6, the default is to use shared port, in which case the port would be 9618, and not 1571 above.
>
> As for the import failures,  is there a shared file system?  that could result in intermittent errors.   Or perhaps a slow motion disk failure? do all of the failures happen on one of the execute machines? or do they happen on both?
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Larry Martell
> Sent: Thursday, November 29, 2018 6:15 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] issues with condor_q and imports from jobs
>
> I had a condor deployment I set up at a customer site 9 months ago. It
> had a master and 2 execute hosts. All was well and my customer was
> happily running job. Now they call me and said many jobs are
> sporadically failing. I took a look and the first thing I noticed was
> that condor_q does not work. On the master I get:
>
> -- Failed to fetch ads from: <192.168.10.17:1571> : liszt
> CEDAR:6001:Failed to connect to <192.168.10.17:1571>
>
> condor_status works and systemctl status condor reports all is well. I
> tried restarting condor on all hosts but still get the same error.
> None of the configs appear to have been changed.
>
> Next, I looked at the job failures. The jobs they run are all the same
> program and the are invoked using the python interface. The jobs are
> python scripts. They run 1,000's of them every night. On any given
> night some will fail with an import error on a module. The module
> being imported does exist, and it clearly can be imported, as some
> jobs work and some do not and they are all the same code.
>
> Anyone have any thoughts as to what can be going on and/or how I can
> debug this more?