Hi,I have a somewhat strange problem. I linked my code with condor_compile and everything worked just fine. Also checkpointing worked fine. Now, it stopped working, more precisely: the program segfaults at random times, but runs fine otherwise. It seems that only ca. 50% of jobs are affected. I have no clue what component in the system changed. The userlog tells something like:
01 (1627.026.000) 02/24 19:08:41 Job executing on host: <144.92.180.55:33798> ... 005 (1627.026.000) 02/24 19:08:41 Job terminated. (0) Abnormal termination (signal 11) (0) No core file Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 2432 - Run Bytes Sent By Job 3003342 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job
I tried to run this job on that machine by hand and it works - no segfaults. Thus I looked in more detail and tried to make it checkpoint by sending SIGTSTP and voila I get a segfault. If I look at the core dump and the stack I find it always looks like that:
#0 0x08102788 in adler32 () #1 0x080fde76 in fill_window () #2 0x080fdc61 in deflate_slow () #3 0x080fcc87 in deflate () #4 0x080c704b in SegMap::Write () #5 0x080c682c in Image::Write () #6 0x080c6503 in Image::Write () #7 0x080c6382 in Image::Write () #8 0x080c7751 in Checkpoint () #9 <signal handler called>
It seems that 'adler32' is the last thing called. Searching the list archive I found one message stating a similar problem, but no solution. Any help would be much appreciated.
Thanks, Patrick -- Dr. Patrick Huber Physics Department University of Wisconsin Tel.:+1 608 262 2886 1150 University Avenue http://pheno.physics.wisc.edu/~phuber Madison, WI 53706, USA