Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] error using checkpointing
- Date: Fri, 18 Feb 2011 15:25:43 +0100
- From: Roberto Nunnari <roberto.nunnari@xxxxxxxx>
- Subject: [Condor-users] error using checkpointing
Hello.
I'm new to condor and to checkpointing, but we have a small cluster
here, and I'd like to introduce checkpointing..
As queueing system, we use SGE, and at present we don't plan to
change that.
So, I'm testing condor checkpointing, but whatever I do, I always get
errors and the .ckpt file never gets created, but just the .ckpt.tmp
To test it, I use a simple program that prints a counter and then
nanosleep() for 1 second.
first, I get warnings during compilation:
$ condor_compile gcc -o blah4 blah.c
LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-7.4.4/lib -Bstatic
--eh-frame-hdr -m elf_x86_64 --hash-style=gnu -dynamic-linker
/lib64/ld-linux-x86-64.so.2 -o blah4 /opt/condor-7.4.4/lib/condor_rt0.o
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbeginT.o
-L/opt/condor-7.4.4/lib -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
-L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccMbjxCs.o
/opt/condor-7.4.4/lib/libcondorsyscall.a
/opt/condor-7.4.4/lib/libcondor_z.a
/opt/condor-7.4.4/lib/libcomp_libstdc++.a
/opt/condor-7.4.4/lib/libcomp_libgcc.a
/opt/condor-7.4.4/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed
-lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv
-lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv
-lcondor_c /opt/condor-7.4.4/lib/libcomp_libgcc.a
/opt/condor-7.4.4/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crtn.o
/opt/condor-7.4.4/lib/libcondorsyscall.a(condor_file_agent.o): In
function `CondorFileAgent::open(char const*, int, int)':
/opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_ckpt/condor_file_agent.cpp:106:
warning: the use of `tmpnam' is dangerous, better use `mkstemp'
/opt/condor-7.4.4/lib/libcondorsyscall.a(special_stubs.o): In function
`condor_gethostbyaddr':
/opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_syscall_lib/special_stubs.cpp:201:
warning: Using 'gethostbyaddr' in statically linked applications
requires at runtime the shared libraries from the glibc version used for
linking
/opt/condor-7.4.4/lib/libcondorsyscall.a(special_stubs.o): In function
`condor_gethostbyname':
/opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_syscall_lib/special_stubs.cpp:194:
warning: Using 'gethostbyname' in statically linked applications
requires at runtime the shared libraries from the glibc version used for
linking
/opt/condor-7.4.4/lib/libcondorsyscall.a(sock.o): In function
`Sock::getportbyserv(char*)':
/opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_io/sock.cpp:233:
warning: Using 'getservbyname' in statically linked applications
requires at runtime the shared libraries from the glibc version used for
linking
Then, here's the run session, interrupted with SIGTSTP:
$ ./blah3 -_condor_D_ALL
User Job - $CondorPlatform: X86_64-LINUX_RHEL5 $
User Job - $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
Condor: Notice: Will checkpoint to ./blah3.ckpt
Condor: Notice: Remote system calls disabled.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Got SIGTSTP
Saved signal state.
About to save file state
CondorFileTable::checkpoint
OPEN FILE TABLE:
fd 0
logical name: default stdin
offset: 0
dups: 1
open flags: 0x0
not currently bound to a url.
fd 1
logical name: default stdout
offset: 134
dups: 1
open flags: 0x1
url: fd:1
size: 134
opens: 1
fd 2
logical name: default stderr
offset: 0
dups: 1
open flags: 0x1
not currently bound to a url.
working dir = /homea/nunnari/devel/provacondor
Done saving file state
About to update MyImage
Adding a DATA segment: start[0x76c000], end [0x10111000]
Image::AddSegment: name=[DATA], start=[76c000], end=[10111000],
length=[0xf9a5000], prot=[0x7fff00000000]
Adding a STACK segment: start[0x7fffb2603000], end [0x7fffb260dfff]
Image::AddSegment: name=[STACK], start=[7fffb2603000],
end=[7fffb260dfff], length=[0xafff], prot=[0x7fff00000000]
Pos: 261772320
Pos: 261817375
Size of ckpt image = 261817375 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./blah3.ckpt
Checkpoint name is "./blah3.ckpt"
Tmp name is "./blah3.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0x76c000,len=0xf9a5000)
I wrote 740320 bytes with write...
I wrote -1 bytes with write...
in SegMap::Write(): fd = 3, write_size=261030944
errno=14, core_loc=820be0
Write() Segment[0] of type DATA -> FAILED
errno = 14, nbytes = -1
Ckpt exit
Write failed with [-1]
Killed
I tried all types of install I could find, both binary and compile from
source, but the result is always the same.
My environment:
$ uname -rms
Linux 2.6.18-194.el5 x86_64
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Thank you for your help!
Best regards.
Roberto Nunnari