[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] increase time to transfer output files in startd before it thinks it's hung?



Thanks, I'll give it a try.  It wasn't the one I came up with - MAX_TRANSFER_QUEUE_AGE.  How does that fit in with the *NOT_RESPONDING_TIMEOUT knobs?  i.e. if I bump up the not responding time out, would the next thing I hit be the max queue age?

klint.


From: Cole Bollig <cabollig@xxxxxxxx>
Sent: Saturday, 15 February 2025 2:57 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Klint Gore <kgore4@xxxxxxxxxx>
Subject: Re: [HTCondor-users] increase time to transfer output files in startd before it thinks it's hung?
 
Hi Klint,

The configuration option Todd was referring to was STARTER_NOT_RESPONDING_TIMEOUT. Which is a more specific version of NOT_RESPONDING_TIMEOUT. As for logging, you could set STARTER_DEBUG = $(START_DEBUG), D_FULLDEBUG. This will increase the number of logging messages in the Starter logs but will give and in-depth look into what file transfer is doing. There is also a transfer history file that prints out some information regarding each transfer that has occurred. Not sure how useful it would be given it only tells you what has been finished attempting to transfer and there is only one history per EP so all starters will write to it. In case, you want to look at it you can do:

condor_history -file $(condor_config_val FILE_TRANSFER_STATS_LOG) -af:h TransferType ConnectionTimeSeconds TransferTotalBytes TransferFileName

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Klint Gore via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, February 13, 2025 11:54 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Klint Gore <kgore4@xxxxxxxxxx>
Subject: [HTCondor-users] increase time to transfer output files in startd before it thinks it's hung?
 
I've got a situation where the starter thinks that transferring output files has hung

02/14/25 05:26:28 ERROR: Child pid 77106 appears hung! Killing it hard.
02/14/25 05:26:28 Starter pid 77106 died on signal 9 (signal 9 (Killed))
02/14/25 05:26:33 slot1: State change: starter exited

It looks like the same situation ashttps://www-auth.cs.wisc.edu/lists/htcondor-users/2019-April/msg00051.shtml.  In that message, Todd refers to increasing the "I think you are dead" timeout.

Is that still the case 6 years later?  
If so, what's the "I think you are dead" setting actually called so I can test it?
Is there a way to log the amount to be returned? i.e. transfer_history for output files as well as input files.

I'm seeing it in 23.0.12 on debian bookworm.  I've just updated that starter to 23.0.20 in case that helps.

klint.