[Opendnssec-user] Restarting 2.1.6 after long semi-idle...

Havard Eidnes he at uninett.no
Tue Feb 16 19:48:29 UTC 2021


Hi,

I'm finally gearing up to transition my local OpenDNSSEC +
SoftHSM over to "supported" versions (2.x all around).

Quite a while back I set up a test system to perform the
migration on, and left it running.  Apparently ods-signerd had
stopped or crashed and I didn't notice (the enforcer continued to
run, though), and now when I came around to re-activating the
test installation I find that ods-signerd all too often decided
to SEGV and abort, e.g. a half CPU-hour after startup.

I have not found the actual root cause or robustness fix for this
problem.

What I did find was this:

This happens in the "ixfr_del_rr()" in this section of
code:

    for(i=0; i<nrrsigs; i++) {
        if(matchedsignatures[i].signature == NULL) {
            if (rrsigs[i] != NULL) {
                if (zone->db->is_initialized) {
                    pthread_mutex_lock(&zone->ixfr->ixfr_lock);
                    ixfr_del_rr(zone->ixfr, rrsigs[i]->rr);
                    pthread_mutex_unlock(&zone->ixfr->ixfr_lock);
                }
                while((signature = collection_iterator(rrset->rrsigs))) {
                    if(signature == rrsigs[i]) {
                        collection_del_cursor(rrset->rrsigs);
                    }
                }
            }
        } else
           ++reusedsigs;
    }

inside of rrset_sign() in signer/src/signer/rrset.c; the crash
actually happened inside the ldns library:

(gdb) where
#0  0x00007f7ff78484cc in ldns_rr_owner (rr=0x48333c66ffe51200) at ./rr.c:913
#1  0x00007f7ff78491d0 in ldns_rr_clone (rr=0x48333c66ffe51200) at ./rr.c:1404
#2  0x0000000000414ca8 in ixfr_del_rr (ixfr=0x7f7ff7ea7df0, 
    rr=0x48333c66ffe51200) at signer/ixfr.c:134
#3  0x0000000000419319 in rrset_sign (ctx=ctx at entry=0x7f7ff138a000, 
    rrset=rrset at entry=0x7f7fdb345f40, signtime=1613494513)
    at signer/rrset.c:758
#4  0x000000000040f42e in drudge (worker=0x7f7ff7e8b700)
    at daemon/signertasks.c:196
#5  0x000000000043b1c4 in runthread (data=0x7f7fee13fcd0) at janitor.c:318
#6  0x00007f7ff540c072 in ?? () from /usr/lib/libpthread.so.1
#7  0x00007f7ff5887bb0 in ?? () from /usr/lib/libc.so.12
#8  0x0000000000200000 in ?? ()
#9  0x0000000000000000 in ?? ()
(gdb) 

The 'rr' pointer in frames 0-2 is clearly bogus.
I looked at the rrsigs[x]->rr's in the debugger, and when it
crashed (I attached gdb to the process after starting it), all
the RR's pointed to un-mapped memory.

What I found was that in my /var/opendnssec/tmp there were a lot
of leftover and rather old *.xfrd-state files, and if I removed
those (or moved them elsewhere, which is what I did), ods-signerd
would thereafter not crash, and according to the logs would
continue do useful work; where it would crash after 30 minutes
before, it has now consumed some 330 minutes plus CPU-time, and
is still going.

I am guessing that some contents in the *.xfrd-state files
violated some built-in assumptions in the code which were
rendered invalid by their stale contents.  I could have wished
for some more sanity checking and robustness...

So ... even though the actual failing or fix hasn't been found,
this may prove useful as a workaround to consider should you face
a similar situation (however unlikely it is...)

Regards,

- Håvard


More information about the Opendnssec-user mailing list