[Opendnssec-user] ods-signerd: robustness & resource demands?

Yuri Schaeffer yuri at nlnetlabs.nl
Mon Jul 13 14:38:57 UTC 2015

Hash: SHA1

Hi Håvard,

Thank you for your detailed rapport. Sorry to hear the aftermath of
this crash turned out to be so catastrophic. We'll look in to it.


On 10-07-15 16:12, Havard Eidnes wrote:
> Hi,
> we're running OpenDNSSEC 1.4.7.
> Today we experienced a "silent" exit or crash of ods-signerd in our
> installation (no trace in the log file).
> When trying to restart OpenDNSSEC, I came across the following 
> issues:
> 1) It appears that ods-signerd does not gracefully handle running 
> out of available file descriptors.  It becomes quite upset when it
> fails to read some of the zone-specific xml files.
> 2) Is there no concurrency limit to e.g. outgoing zone transfer 
> requests, so that ods-signerd will in peaks (e.g. right after 
> startup) try to use a multiple of <number of configured zones> file
> descriptors?  When I finally moved away the tmp/ files of 
> OpenDNSSEC, and remembered to raise the FD limit (more below), 
> ods-signerd logs "zone <zone> request axfr to <hidden-master>" at
> the exact same second for all the configured zones.  If it actually
> tries to execute all of them at that same instant, that is going to
> be *asking* for trouble.  And all connections go to the same master
> too, it's probably also not too happy about being flooded with TCP
> DNS connections.  The old concept "thundering herd" in new
> disguise?
> 3) At one time ods-signerd logged
> Jul 10 15:12:00 hugin ods-signerd: [parser] unable to parse file
> /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml: xmlParseFile()
> failed
> There's no decoded errno here.  Did that fail because it could not
> open the file due to already having too many open files? It doesn't
> say, but it should, especially when it later logs the probably
> highly misleading sequence:
> Jul 10 15:12:00 hugin ods-signerd: [signconf] unable to update
> signconf: signconf
> /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml has errors Jul
> 10 15:12:00 hugin ods-signerd: [zone] unable to load signconf for
> zone 46.39.128.in-addr.arpa: signconf
> /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml Configuration
> error Jul 10 15:12:00 hugin ods-signerd: [tools] unable to load
> signconf for zone 46.39.128.in-addr.arpa: Configuration error Jul
> 10 15:12:00 hugin ods-signerd: [worker[4]] CRITICAL: failed to sign
> zone 46.39.128.in-addr.arpa: Configuration error Jul 10 15:12:00
> hugin ods-signerd: [worker[4]] backoff task [configure] for zone
> 46.39.128.in-addr.arpa with 60 seconds
> when the actual failure was probably that ods-signerd managed to 
> run itself out of available file descriptors, instead of me as an 
> operator having commited a "CRITICAL Configuration error".
> 4) When I tried restarting ods-signerd and left behind the tmp/ 
> directory, ods-sginerd complained about "bad ixfr journal: trailing
> RRs after final SOA" for a lot of the journal files. I've done
> nothing "special" -- I assume this should not happen under any
> circumstances?
> 5) When I raised the FD resource limit, it then came to checking 
> the zone expirations, and it seems there's a missing conversion
> from "relative time" (as in the SOA expire field) to absolute time,
> as it complained e.g.:
> Jul 10 15:08:52 hugin ods-signerd: [axfr] zone 39.128.in-addr.arpa
> expired at 3600000, and it is now 1436533732: not serving soa
> One is absolute (the last one), the other is a relative timestamp,
> and the two can't be directly compared, as appears to happen here.
> 6) ods-signerd had to be restarted (again), and that time it exited
> shortly thereafter with
> Jul 10 15:09:10 hugin ods-signerd: signer/ixfr.c:230: part_print:
> assertion part->soamin failed
> Not pretty, not robust.  At this time I concluded that ods-signerd
> would not be able to recover with the old data in its tmp/
> directory present, so ...
> 7) I moved OpenDNSSEC's tmp/ directory away, raised the FD resource
> limit, and restarted the process pair (enforcer and signer), and
> they're finally up and running again.
> 8) There appears to be a lack of "work flow control" internally in
> ods-signerd -- even after I successfully restarted the process
> pair, I see log entries such as this:
> Jul 10 15:12:34 hugin ods-signerd: [tools] unable to read zone
> 2.39.128.in-addr.arpa: adapter failed (Incoming zone transfer not
> ready) Jul 10 15:12:34 hugin ods-signerd: [worker[1]] backoff task
> [read] for zone 2.39.128.in-addr.arpa with 60 seconds
> It seems primitive to do "time-based poll waiting", can't the two 
> tasks simply synchronize their activities a bit better, so that the
> consumer doesn't even try to use the data before it's ready and
> this superfluous logging be removed?
> Is it just me, or does restarting ods-signerd after an abnormal
> exit appear to be more than just a little ... fragile?  I know I
> run OpenDNSSEC on an OS where the default resource limits are quite
> low, some would say "ridiculously low" compared to "modern"
> defaults -- the default FD limit is as low as 128.  However, I
> appear to be hitting this limit with only 39 configured zones, and
> if ods-sginerd's desired consumption of file descriptors can turn
> out to be proportional to the number of configured zones, that
> surely must spell Serious Trouble somewhere further down the line,
> and not just for me?
> My confidence in OpenDNSSEC's operational stability is somewhat 
> shaken by this experience...
> Regards,
> - Håvard _______________________________________________ 
> Opendnssec-user mailing list Opendnssec-user at lists.opendnssec.org 
> https://lists.opendnssec.org/mailman/listinfo/opendnssec-user
Version: GnuPG v2


More information about the Opendnssec-user mailing list