[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCondor-CE] Maximum number of established TCP connections



Hi Sang,

Two thoughts:

1.  One possibility is that this CE is very overloaded and unresponsive.  Given that 3.5k jobs is not an enormous number, I suspect this is not the case.  However, you should be able to rule this out quickly by looking at the host-level monitoring.

2.  Is there a firewall (or other state tracking, such as connection tables on a hypervisor) between the CE host and CERN?  The authentication error is a timeout -- often this can be caused by a piece of middleware between the hosts overflowing its max connection limit and, in response, randomly dropping connections.
   - That is, the limitation may be more about a firewall issue and not TCP.
   - You can probably again look at the host-level monitoring to see whether there's a large number of TCP connections (the CE itself doesn't necessarily use a lot but perhaps other things are hitting the site limits when the cluster scales up?).

Hope this helps, 

Brian

> On Apr 5, 2020, at 9:38 PM, Sang Un Ahn <sahn@xxxxxxxxxxx> wrote:
> 
> Hi Brian,
> 
> It seems that my previous posting has been hold because of the size of log files. 
> 
> Anyway, I found an authenticate failure in SchedLog on CE as follows:
> <...>
> 04/06/20 09:56:00 (cid:3129983) (D_AUDIT) Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<134.75.125.153:4174>
> 04/06/20 09:56:00 (cid:3129983) (D_AUDIT) Authentication Failed, MethodsTried=FS,GSI
> 04/06/20 09:56:00 (D_ALWAYS) DC_AUTHENTICATE: authentication of <134.75.125.153:4174> did not result in a valid mapped user name, which is required for this command (488 SPOOL_JOB_FILES_WITH_PERMS), so aborting.
> 04/06/20 09:56:00 (D_ALWAYS) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1006:exceeded 1586134556 deadline during authentication|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXRxdom6)
> 04/06/20 09:56:00 (D_ALWAYS:2) QMGR Connection closed
> <...>
> 
> This also happens for remote submission (from CERN):
> <...>
> 04/06/20 09:56:54 (cid:3130255) (D_AUDIT) Authentication Failed, MethodsTried=FS,GSI
> 04/06/20 09:56:54 (D_ALWAYS) DC_AUTHENTICATE: authentication of <188.184.29.210:31988> did not result in a valid mapped user name, which is required for this command (488 SPOOL_JOB_FILES_WITH_PERMS), so aborting.
> 04/06/20 09:56:54 (D_ALWAYS) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1006:exceeded 1586134614 deadline during authentication|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXoXQljI)
> 04/06/20 09:56:54 (D_ALWAYS:2) ForkWorker::Fork: New child of 2912963 = 485875
> <...>
> 
> These happen occasionally while mostly OK like below.
> <...>
> 04/06/20 09:54:16 (cid:3129743) (D_AUDIT) Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<134.75.125.153:30377>
> 04/06/20 09:54:16 (cid:3129743) (D_AUDIT) AuthMethod=GSI, AuthId=/C=KR/O=KISTI/O=KISTI/CN=38191474 Sang Un Ahn,/alice/Role=lcgadmin/Capability=NULL,/alice/Role=NULL/Capability=NULL,/alice/lcg1/Role=NULL/Capability=NULL, CondorId=sgmalit1_01@xxxxxxxxx
> <...>
> 04/06/20 09:56:56 (cid:3130286) (D_AUDIT) Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<137.138.62.91:30028>
> 04/06/20 09:56:56 (cid:3130286) (D_AUDIT) AuthMethod=GSI, AuthId=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath,/alice/Role=NULL/Capability=NULL,/alice/alarm/Role=NULL/Capability=NULL,/alice/lcg1/Role=NULL/Capability=NULL,/alice/team/Role=NULL/Capability=NULL, CondorId=ali1_01@xxxxxxxxx
> <...>
> 04/06/20 09:57:12 (cid:3130542) (D_AUDIT) Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<188.184.29.210:48864>
> 04/06/20 09:57:12 (cid:3130542) (D_AUDIT) AuthMethod=GSI, AuthId=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath,/alice/Role=NULL/Capability=NULL,/alice/alarm/Role=NULL/Capability=NULL,/alice/lcg1/Role=NULL/Capability=NULL,/alice/team/Role=NULL/Capability=NULL, CondorId=ali1_01@xxxxxxxxx
> <...>
> 
> And I also found some failures in JobRouterLog:
> 
> [root@alice-t1-ce04 condor-ce]# grep -R claimed /var/log/condor-ce/*
> <...>
> /var/log/condor-ce/JobRouterLog:04/06/20 10:30:00 (D_ALWAYS:2) JobRouter failure (src=135485.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:30:38 (D_ALWAYS:2) JobRouter failure (src=135529.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:31:04 (D_ALWAYS:2) JobRouter failure (src=135478.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:32:13 (D_ALWAYS:2) JobRouter failure (src=135496.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:33:23 (D_ALWAYS:2) JobRouter failure (src=135489.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:33:28 (D_ALWAYS:2) JobRouter failure (src=135500.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:35:22 (D_ALWAYS) JobRouter failure (src=135442.0,route=Local_Condor): candidate job could not be claimed by JobRouter: Encountered problem reading current JobStatus for 135442.0
> /var/log/condor-ce/JobRouterLog:04/06/20 10:35:56 (D_ALWAYS:2) JobRouter failure (src=135511.0,route=Local_Condor): candidate job could not be claimed by JobRouter because it is already claimed by somebody else.
> <...>
> 
> For example, below is the tracking of the source ID "135442.0":
> 
> [root@alice-t1-ce04 condor-ce]# grep -R 135442.0 /var/log/condor-ce/*
> <AuditLog>
> /var/log/condor-ce/AuditLog:04/06/20 09:56:23 (cid:3130223) (D_AUDIT) Submitting new job 135442.0
> /var/log/condor-ce/AuditLog:04/06/20 09:57:12 (cid:3130542) (D_AUDIT) Transferring files for jobs 135442.0
> /var/log/condor-ce/AuditLog:04/06/20 09:57:14 (cid:3130542) (D_AUDIT) Received proxy for job 135442.0
> /var/log/condor-ce/AuditLog:04/06/20 10:09:06 (cid:3138980) (D_AUDIT) Received proxy for job 135442.0
> /var/log/condor-ce/AuditLog:04/06/20 10:09:06 (cid:3138980) (D_AUDIT) Refresh GSI cred for job 135442.0 succeeded
> /var/log/condor-ce/AuditLog:04/06/20 10:29:20 (cid:3143362) (D_AUDIT) Remove jobs 135442.0
> /var/log/condor-ce/AuditLog:04/06/20 10:29:20 (cid:3143362) (D_AUDIT) Finished Remove jobs 135442.0
> <JobRouterLog>
> /var/log/condor-ce/JobRouterLog:04/06/20 10:02:36 (D_ALWAYS:2) JobRouter: Found candidate job src=135442.0,route=Local_Condor
> /var/log/condor-ce/JobRouterLog:04/06/20 10:35:22 (D_ALWAYS) JobRouter failure (src=135442.0,route=Local_Condor): candidate job could not be claimed by JobRouter: Encountered problem reading current JobStatus for 135442.0
> /var/log/condor-ce/JobRouterLog:04/06/20 10:35:22 (D_ALWAYS:2) JobRouter (src=135442.0,route=Local_Condor): Cleaned up and removed routed job.
> /var/log/condor-ce/JobRouterLog:04/06/20 10:35:22 (D_ALWAYS:2) JobRouter (src=135442.0): job mirror synchronized; removing job from internal 'retirement' status
> <SchedLog>
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:23 (cid:3130223) (D_AUDIT) Submitting new job 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:24 (D_ALWAYS:2) New job: 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:24 (D_ALWAYS:2) New job: 135442.0, Duplicate Keys: 0, Total Keys: 95
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:27 (D_ALWAYS:2) Skipping periodic expressions for job 135442.0, because hold reason code is '16'
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:35 (D_ALWAYS:2) Job 135442.0 held for spooling. Checking how long...
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:35 (D_ALWAYS:2) Attribute StageInStart not set in 135442.0. Set it.
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:49 (D_ALWAYS:2) Job 135442.0 held for spooling. Checking how long...
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:49 (D_ALWAYS:2) Job 135442.0 on hold for 14 seconds.
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:55 (D_ALWAYS:2) Job 135442.0 held for spooling. Checking how long...
> /var/log/condor-ce/SchedLog.old:04/06/20 09:56:55 (D_ALWAYS:2) Job 135442.0 on hold for 20 seconds.
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:01 (D_ALWAYS:2) Job 135442.0 held for spooling. Checking how long...
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:01 (D_ALWAYS:2) Job 135442.0 on hold for 26 seconds.
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:06 (D_ALWAYS:2) Job 135442.0 held for spooling. Checking how long...
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:06 (D_ALWAYS:2) Job 135442.0 on hold for 31 seconds.
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:12 (cid:3130542) (D_AUDIT) Transferring files for jobs 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:12 (D_ALWAYS:2) generalJobFilesWorkerThread(): transfer files for job 135442.0
> /var/log/condor-ce/SchedLog.old:GlobalJobId = "alice-t1-ce04.sdfarm.kr#135442.0#1586134583"
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:14 (cid:3130542) (D_AUDIT) Received proxy for job 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:15 (D_ALWAYS:2) No HoldReasonSubCode found for job 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 09:57:15 (D_ALWAYS) Job 135442.0 released from hold: Data files spooled
> /var/log/condor-ce/SchedLog.old:04/06/20 10:09:06 (cid:3138980) (D_AUDIT) Received proxy for job 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 10:09:06 (cid:3138980) (D_AUDIT) Refresh GSI cred for job 135442.0 succeeded
> /var/log/condor-ce/SchedLog.old:04/06/20 10:28:01 (D_ALWAYS) Job 135442.0 put on hold: HTCondor-CE held job due to no matching routes, route job limit, or route failure threshold; see 'HTCondor-CE Troubleshooting Guide'
> /var/log/condor-ce/SchedLog.old:04/06/20 10:28:03 (D_ALWAYS:2) abort_job_myself: 135442.0 action:Hold log_hold:true
> /var/log/condor-ce/SchedLog.old:04/06/20 10:28:03 (D_ALWAYS:2) Cleared dirty attributes for job 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:20 (cid:3143362) (D_AUDIT) Remove jobs 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:20 (cid:3143362) (D_AUDIT) Finished Remove jobs 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:22 (D_ALWAYS:2) abort_job_myself: 135442.0 action:Remove log_hold:true
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:22 (D_ALWAYS:2) Cleared dirty attributes for job 135442.0
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:22 (D_ALWAYS:2) Job 135442.0 is finished
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:24 (D_ALWAYS:2) Job cleanup for 135442.0 will not block, calling jobIsFinished() directly
> /var/log/condor-ce/SchedLog.old:04/06/20 10:29:24 (D_ALWAYS:2) jobIsFinished() completed, calling DestroyProc(135442.0)
> 
> Do you have any idea on these?
> 
> Thank you.
> 
> Best regards,
> Sang-Un
> 
> 
>> On Apr 5, 2020, at 14:14, sahn <sahn@xxxxxxxxxxx> wrote:
>> 
>> Hi Brian,
>> 
>> Thank you for replying. Please find log files in the attachment. These are CE related logs. Please just let me know what to do to get more information for debugging.
>> 
>> Best regards,
>> Sang-Un
>> 
>> 
>> <CollectorLog><JobRouterLog><SchedLog><SharedPortLog>
>>> Date: Fri, 03 Apr 2020 02:58:51 +0000
>>> From: "Bockelman, Brian" <BBockelman@xxxxxxxxxxxxx>
>>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>>> Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
>>> Subject: Re: [HTCondor-users] [HTCondor-CE] Maximum number of
>>> 	established	TCP	connections
>>> Message-ID: <D6F948D0-0E97-4CED-B7B4-49CE87AFCA6B@xxxxxxxxxxxxx>
>>> Content-Type: text/plain; charset=us-ascii
>>> 
>>> Hi Sang-Un,
>>> 
>>> The theoretical number of TCP connections is 64K for a given IP address pair (to a fixed port) -- so, 3.5k connections doesn't sound like it's stressing the system too much.  In practice, we've been able to establish hundreds of thousands of connections (although that level requires quite a bit of tuning!).
>>> 
>>> Could you provide a few log snippets?  That might allow us to look at the exact error messages and provide some advice and/or insights.
>>> 
>>> Happy to help,
>>> 
>>> Brian
>>> 
>>>> On Apr 2, 2020, at 7:36 PM, Sang Un Ahn <sahn@xxxxxxxxxxx> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> First of all, I hope you all keep safe in such a troubled moment.
>>>> 
>>>> I am running a small set of HTCondor pool plus HTCondor-CE for taking jobs from ALICE VO. The pool is capable to offer around 3,500 concurrent job slots however more recently I just realised that when the number of established TCP connections (by condor_shadow) through 9618 is above 3,000, the number of concurrent jobs looks somehow hitting the ceiling. Precisely, the JobRouter is holding additional jobs in the queue of CE while no idle/hold jobs in local queue. As you might notice, JobRouter, Schedd, Collectord and Negotiatord are running on the same host for HTCondor-CE.
>>>> 
>>>> After searching on Google, the theoretical number of tcp connections on a Linux system is 64k (=65,535) and several guides on how to tune in terms of network performance exist. I am also consulting with the system administrators in my team and the number of connections and open files are limited by system parameters such as sysctl and/or limits.conf (ulimit). A network expert says it is not advised to increase the number of TCP connections above 4k, instead it is advised to have additional host, e.g. the second HTCondor-CE. 
>>>> 
>>>> So I was just wondering whether there exists any guidance on performance tuning for HTCondor/HTCondor-CE regarding networking and also there exists a limit on the capability of JobRouter.
>>>> 
>>>> FYI, the HTCondor-CE is running on a virtual machine with 8 cores and 16GB memory. The versions are 3.4.0 and 8.8.7 for HTCondor-CE and HTCondor, respectively. Please just let me know if you need more information.
>>>> 
>>>> Thank you.
>>>> 
>>>> Best regards,
>>>> Sang-Un
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>> 
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>> 
>> 
> 
> 
>