Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Problems with final transfer of files
- Date: Wed, 06 Jul 2005 09:53:07 +0200
- From: "Joan J. Piles Contreras" <jpiles@xxxxxxxxx>
- Subject: [Condor-users] Problems with final transfer of files
Hello,
We are heving troubles with some vanilla jobs that get an error _after_
they are finished, and apparently after the final file transfer has
taken place. This makes them start from the beginning over and over
again. I have put full debug both in the starter and in the shadow
daemons, and yet I have found no clue about it.
It must be said that this doesn't happen in all the jobs, the ones where
this happen are arguably the longest ones and the ones that generates
bigger files, but still are all of them below 2G (there is one 1.4G big
results file).
Here is the relevant part from ShadowLog:
7/5 09:11:10 (2.0) (5950): wrote 8149 bytes
7/5 09:11:10 (2.0) (5950): Entering BaseShadow::updateJobInQueue
7/5 09:11:10 (2.0) (5950): SHADOW_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
7/5 09:11:10 (2.0) (5950): SEC_DEBUG_PRINT_KEYS is undefined, using
default value of False
7/5 09:11:10 (2.0) (5950): AUTHENTICATE_FS: used file /tmp/qmgr_Is2ATk,
status: 1
7/5 09:11:10 (2.0) (5950): Updating Job Queue: SetAttribute(BytesSent,
-4030657536.000000)
7/5 09:11:10 (2.0) (5950): Updating Job Queue: SetAttribute(BytesRecvd,
8546448.000000)
7/5 09:11:10 (2.0) (5950): condor_read(): Socket closed when trying to
read buffer
7/5 09:11:10 (2.0) (5950): ERROR "Can no longer talk to condor_starter
on execute machine (aaa.bbb.ccc.ddd)" at line 63 in file NTreceivers.C
7/5 09:11:10 (2.0) (5950): FileLock::obtain(1) failed - errno 37 (No
locks available)
7/5 09:11:11 PASSWD_CACHE_REFRESH is undefined, using default value of 300
And from the equivalent StarterLog
7/5 09:46:46 DoUpload: send file ModHarp153630.sta
7/5 09:46:46 ReliSock: put_file: sent 8149 bytes
7/5 09:46:46 DoUpload: exiting at 1413
7/5 09:46:46 ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
final_transfer))" at line 336 in file jic_shadow.C
7/5 09:46:46 ShutdownFast all jobs.
7/5 09:46:46 Got ShutdownFast when no jobs running.
7/5 09:46:51 PASSWD_CACHE_REFRESH is undefined, using default value of 300
(Yes, I have just realized that the clock in this machine hasn't got the
right time. Anyway, it's less than 1h between them, and I think it
souldn't matter, as we have got problems as well with other machines in
the pool).
Thanks in advance,
Joan