[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Startd crashes on HTCondor 25 nodes



Hi Arshad,


I’ve been looking into the startd crashes in HTCondor 25.0.7. 


Since the HTCondor team is based in a different time zone, their response might be delayed. I thought it might be helpful for you to check a few technical leads in the meantime to speed up the diagnosis.


Based on the stack trace and the DockerAPI::getImageInfos()(called by imageCacheUsed) source code, I wanted to share a specific lead that might be helpful for your investigation.


The image cache calculation logic parses the output of the docker images command using line.find(' '). (https://github.com/htcondor/htcondor/blob/main/src/condor_utils/docker-api.cpp#L1970-L1989)

----

size_t first_space = line.find(' ');

size_t second_space = line.find(' ', first_space + 1);

-~~~

std::string sha       = line.substr(first_space + 1, second_space - first_space - 1);

std::string size      = line.substr(second_space + 1);

-----


I suspect that the crash occurs when the output from your custom wrapper (condor-docker.py) contains unexpected formatting that breaks this parsing logic.


If you have a moment, could you verify the exact output of your wrapper on the affected nodes?

--------------

# Please check for any lines that do not strictly follow the "Repo:Tag SHA Size" format

/path/to/condor-docker.py images --format "{{.Repository}}:{{.Tag}} {{.ID}} {{.Size}}"

Normal output:
---
localhost/build_condorcecm:v20260311_1 535b671415f1 1.38 GB


=> Converting to { "imageName": "local~~~", "sha": 535b671415f1, "size": "1.38 GB"

----

I suspected that


Missing Spaces (Critical): The C++ code expects at least two spaces to extract the imageName and sha. If a line (like a warning, error, or empty line from the wrapper) has fewer than two spaces, the second_space variable becomes npos, leading to a massive integer underflow in line.substr(). This typically causes a heap corruption that results in the free() error you're seeing.


I hope this information helps narrow down the cause. 


Please feel free to ignore this if your team is already investigating these specific areas!


Best regards,


-- Geonmo

ââââââ ìë ëì ââââââ

ëëìë : Arshad Ahmad via HTCondor-users <htcondor-users@xxxxxxxxxxx>

ëëìë : HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

ìì : Arshad Ahmad <aahmad@xxxxxxxx>

ëìëì : 2026-03-13 (ê) 08:52:27

ìë : [HTCondor-users] Startd crashes on HTCondor 25 nodes


Hi Condor Team,
We are testing HTCondor 25 on a subset of our cluster for production evaluation and have noticed random startd crashes on the upgraded nodes. The upgrade is from 24.0.14 to 25.0.7.

Nodes running HTCondor 25 advertise DockerCachedImageSizeMb in the machine ClassAd, and the crashes appear to be related to this new attribute introduced in Condor 25. Relevant snippet from the StartLog:

Caught signal 6: si_code=4294967290, si_pid=2079, si_uid=0, si_addr=0x81F

Stack trace highlights: DockerAPI::imageCacheUsed()

condor_startd(MachAttributes::compute_for_update)

StartLog is attached for reference.
We are running Docker 28.5.1, which is consistent across the cluster. Jobs are launched using our custom condor-docker.py wrapper.
Please let us know if any additional information or logs would help diagnose and fix this issue.

Thank you,

Arshad


PNG image

PNG image