Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with final transfer of files
- Date: Wed, 6 Jul 2005 10:34:20 -0500
- From: Nick LeRoy <nleroy@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Problems with final transfer of files
On Wed July 6 2005 2:53 am, Joan J. Piles Contreras wrote:
Hello,
> We are heving troubles with some vanilla jobs that get an error _after_
> they are finished, and apparently after the final file transfer has
> taken place. This makes them start from the beginning over and over
> again. I have put full debug both in the starter and in the shadow
> daemons, and yet I have found no clue about it.
What version of Condor are you running on what O/S?
> It must be said that this doesn't happen in all the jobs, the ones where
> this happen are arguably the longest ones and the ones that generates
> bigger files, but still are all of them below 2G (there is one 1.4G big
> results file).
>
> Here is the relevant part from ShadowLog:
>
> 7/5 09:11:10 (2.0) (5950): condor_read(): Socket closed when trying to
> read buffer
> 7/5 09:11:10 (2.0) (5950): ERROR "Can no longer talk to condor_starter
> on execute machine (aaa.bbb.ccc.ddd)" at line 63 in file NTreceivers.C
> 7/5 09:11:10 (2.0) (5950): FileLock::obtain(1) failed - errno 37 (No
> locks available)
This isn't the cause of the problems, but concerns me. If I'm reading the
code correctly, this error means that the user/job log code couldn't lock the
log file to log the error. Do you have a low file lock set on your system or
some such? In general, you shouldn't see this, I think.
> 7/5 09:11:11 PASSWD_CACHE_REFRESH is undefined, using default value of 300
>
> And from the equivalent StarterLog
>
> 7/5 09:46:46 DoUpload: send file ModHarp153630.sta
> 7/5 09:46:46 ReliSock: put_file: sent 8149 bytes
> 7/5 09:46:46 DoUpload: exiting at 1413
> 7/5 09:46:46 ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
> final_transfer))" at line 336 in file jic_shadow.C
Now, this is where the actual error occurred. Knowing which version of Condor
could help narrow down where it went wrong.
> (Yes, I have just realized that the clock in this machine hasn't got the
> right time. Anyway, it's less than 1h between them, and I think it
> souldn't matter, as we have got problems as well with other machines in
> the pool).
I don't think that this should be an issue.
> Thanks in advance,
> Joan
Glad to help
-Nick
--
<<< There is no spoon. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy@xxxxxxxxxxx The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences