Hi again, I think I managed manually to replicate the disappearance of the jobs' cgroups at least partially. It looks like to be due to an issue with systemd/a typo... [*] ~~> Condor is innocent Cheers and sorry for the noise! Thomas ps: unfortunately, I do not completely understand the behaviour and would appreciate any ideas from systemd experts ;) [**] [*] - we are distributing a systemd unit via puppet, which starts a Singularity container/runscript (that binds the root path internally) ExecStart=/usr/bin/singularity run --bind /:/rootfs:ro /path/to/container.d - when distributed/updated on(to) a node, puppet would trigger a systemctl daemon-reload - and would ensure the service to be active - due to a bug (~>forgotten variable), the unit's template might contain a condition dangling in the air, i.e., >> [Unit] Description=foofoobar ConditionPathExists=/path/to/container.d ConditionPathExists= [Service] ExecStart=/usr/bin/singularity run --bind /:/rootfs:ro /path/to/container.d ... >> - when this (apparently defective) unit got started (and ensured by puppet...), the existing job slices in the cpu and memory controllers got wiped out!? (the condor.service parent slices survived the unit start) - with a fixed unit, the job slices survive (re)starts of the service! [**] - what I do not fully understand is why/how the processes loose their cgroup slices or why/how systemd/kernel does it?? The PIDs are unaffected - so I would have naively assumed, that once assigned to a cgroup, a process would stay there. But apparently the cgroups get removed(?) and the PIDs appended to the next parent group(?) - the slices get only wiped-out when the exec is started through systemd. And I have not been able to reproduce the behaviour taking each step manually. - what kind of namespace view does systemd has? I see systemd processes belonging to PPID=1 as well as PPID=0(!?)
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature