[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ImageSize increase too big



One of our simulation programs had a lurking bug where it used uninitialized variables to dimension an array of 8-byte strings, and if the memory to which those variables happened to be assigned contained just the wrong kind of values, the software's memory allocation would happily set out towards 36 terabytes of RAM or what have you, and continue in that direction until the machine wedged (Red Hat 5) or the Out-of-Memory Killer triggered (Red Hat 6) to terminate it.

The fact that HTCondor made this behavior easily visible was an essential part of tracking down and stomping this bug. You should hook a GDB or an strace to the process to see what it's doing during that span of time where its memory is climbing.

	-Michael Pelletier.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Henning Fehrmann
Sent: Monday, May 7, 2018 9:22 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [External] [HTCondor-users] ImageSize increase too big

Hi,

we observed an unexplainable jump in the imagesize of an job.


-- Schedd: atlas2.atlas.local : <10.20.30.2:38705?... @ 05/05/18 12:52:07
 ID         OWNER            SUBMITTED     RUN_TIME ST PRI SIZE   CMD
2728869.0   XXXXX         4/28 11:29   7+00:14:54 R  0   9766.0	  XXXXXX 

But it never was using that much memory:

000 (2728869.000.000) 04/28 11:29:25 Job submitted from host: <10.20.30.2:38705?addrs=10.20.30.2-38705+[--1]-38705>
001 (2728869.000.000) 04/28 11:29:47 Job executing on host: <10.10.20.16:33435?addrs=10.10.20.16-33435+[--1]-33435>
006 (2728869.000.000) 04/28 11:29:56 Image size of job updated: 48676
006 (2728869.000.000) 04/28 11:34:56 Image size of job updated: 188768
006 (2728869.000.000) 04/28 11:39:56 Image size of job updated: 237552
006 (2728869.000.000) 04/28 11:44:57 Image size of job updated: 272380
006 (2728869.000.000) 04/28 12:19:59 Image size of job updated: 7411552
006 (2728869.000.000) 04/28 12:24:59 Image size of job updated: 7522440 ...
006 (2728869.000.000) 05/05 02:16:14 Image size of job updated: 7522984
001 (2728869.000.000) 05/05 02:43:55 Job executing on host: <10.10.17.14:46639?addrs=10.10.17.14-46639+[--1]-46639>
001 (2728869.000.000) 05/05 04:36:51 Job executing on host: <10.10.23.1:46285?addrs=10.10.23.1-46285+[--1]-46285>
007 (2728869.000.000) 05/05 06:32:57 Shadow exception!
001 (2728869.000.000) 05/05 07:00:50 Job executing on host: <10.10.9.13:41637?addrs=10.10.9.13-41637+[--1]-41637>

The job still runs on 10.10.9.13 with in the expected memory usage.

The imagesize however is
condor_q 2728869 -l|grep "^Image"
ImageSize = 10000000
ImageSize_RAW = 7522980

Which hasn't been manipulated by the user.

Is this a known issue?

We are running condor 8.6.
Do you need more config or logs?


Cheers,
Henning
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/