[Opendnssec-user] ods-signerd: robustness & resource demands?

Fri Jul 10 14:12:46 UTC 2015

Hi,

we're running OpenDNSSEC 1.4.7.

Today we experienced a "silent" exit or crash of ods-signerd in
our installation (no trace in the log file).

When trying to restart OpenDNSSEC, I came across the following
issues:

1) It appears that ods-signerd does not gracefully handle running
   out of available file descriptors.  It becomes quite upset
   when it fails to read some of the zone-specific xml files.

2) Is there no concurrency limit to e.g. outgoing zone transfer
   requests, so that ods-signerd will in peaks (e.g. right after
   startup) try to use a multiple of <number of configured zones>
   file descriptors?  When I finally moved away the tmp/ files of
   OpenDNSSEC, and remembered to raise the FD limit (more below),
   ods-signerd logs "zone <zone> request axfr to <hidden-master>"
   at the exact same second for all the configured zones.  If it
   actually tries to execute all of them at that same instant,
   that is going to be *asking* for trouble.  And all connections
   go to the same master too, it's probably also not too happy
   about being flooded with TCP DNS connections.  The old concept
   "thundering herd" in new disguise?

3) At one time ods-signerd logged

   Jul 10 15:12:00 hugin ods-signerd: [parser] unable to parse file /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml: xmlParseFile() failed

   There's no decoded errno here.  Did that fail because it could
   not open the file due to already having too many open files?
   It doesn't say, but it should, especially when it later logs
   the probably highly misleading sequence:

   Jul 10 15:12:00 hugin ods-signerd: [signconf] unable to update signconf: signconf /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml has errors
   Jul 10 15:12:00 hugin ods-signerd: [zone] unable to load signconf for zone 46.39.128.in-addr.arpa: signconf /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml Configuration error
   Jul 10 15:12:00 hugin ods-signerd: [tools] unable to load signconf for zone 46.39.128.in-addr.arpa: Configuration error
   Jul 10 15:12:00 hugin ods-signerd: [worker[4]] CRITICAL: failed to sign zone 46.39.128.in-addr.arpa: Configuration error
   Jul 10 15:12:00 hugin ods-signerd: [worker[4]] backoff task [configure] for zone 46.39.128.in-addr.arpa with 60 seconds

   when the actual failure was probably that ods-signerd managed to
   run itself out of available file descriptors, instead of me as an
   operator having commited a "CRITICAL Configuration error".

4) When I tried restarting ods-signerd and left behind the tmp/
   directory, ods-sginerd complained about "bad ixfr journal:
   trailing RRs after final SOA" for a lot of the journal files.
   I've done nothing "special" -- I assume this should not happen
   under any circumstances?

5) When I raised the FD resource limit, it then came to checking
   the zone expirations, and it seems there's a missing
   conversion from "relative time" (as in the SOA expire field)
   to absolute time, as it complained e.g.:

   Jul 10 15:08:52 hugin ods-signerd: [axfr] zone 39.128.in-addr.arpa expired at 3600000, and it is now 1436533732: not serving soa

   One is absolute (the last one), the other is a relative
   timestamp, and the two can't be directly compared, as appears
   to happen here.

6) ods-signerd had to be restarted (again), and that time it
   exited shortly thereafter with

   Jul 10 15:09:10 hugin ods-signerd: signer/ixfr.c:230: part_print: assertion part->soamin failed

   Not pretty, not robust.  At this time I concluded that
   ods-signerd would not be able to recover with the old data in its
   tmp/ directory present, so ...

7) I moved OpenDNSSEC's tmp/ directory away, raised the FD
   resource limit, and restarted the process pair (enforcer and
   signer), and they're finally up and running again.

8) There appears to be a lack of "work flow control" internally
   in ods-signerd -- even after I successfully restarted the
   process pair, I see log entries such as this:

   Jul 10 15:12:34 hugin ods-signerd: [tools] unable to read zone 2.39.128.in-addr.arpa: adapter failed (Incoming zone transfer not ready)
   Jul 10 15:12:34 hugin ods-signerd: [worker[1]] backoff task [read] for zone 2.39.128.in-addr.arpa with 60 seconds

   It seems primitive to do "time-based poll waiting", can't the two
   tasks simply synchronize their activities a bit better, so that
   the consumer doesn't even try to use the data before it's ready
   and this superfluous logging be removed?

Is it just me, or does restarting ods-signerd after an abnormal exit
appear to be more than just a little ... fragile?  I know I run
OpenDNSSEC on an OS where the default resource limits are quite low,
some would say "ridiculously low" compared to "modern" defaults --
the default FD limit is as low as 128.  However, I appear to be
hitting this limit with only 39 configured zones, and if
ods-sginerd's desired consumption of file descriptors can turn out
to be proportional to the number of configured zones, that surely
must spell Serious Trouble somewhere further down the line, and not
just for me?

My confidence in OpenDNSSEC's operational stability is somewhat
shaken by this experience...

Regards,

- Håvard