[Opendnssec-user] ods-signerd: robustness & resource demands?

Mon Jul 13 14:38:57 UTC 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Håvard,

Thank you for your detailed rapport. Sorry to hear the aftermath of
this crash turned out to be so catastrophic. We'll look in to it.

Regards,
Yuri

On 10-07-15 16:12, Havard Eidnes wrote:
> Hi,
> 
> we're running OpenDNSSEC 1.4.7.
> 
> Today we experienced a "silent" exit or crash of ods-signerd in our
> installation (no trace in the log file).
> 
> When trying to restart OpenDNSSEC, I came across the following 
> issues:
> 
> 1) It appears that ods-signerd does not gracefully handle running 
> out of available file descriptors.  It becomes quite upset when it
> fails to read some of the zone-specific xml files.
> 
> 2) Is there no concurrency limit to e.g. outgoing zone transfer 
> requests, so that ods-signerd will in peaks (e.g. right after 
> startup) try to use a multiple of <number of configured zones> file
> descriptors?  When I finally moved away the tmp/ files of 
> OpenDNSSEC, and remembered to raise the FD limit (more below), 
> ods-signerd logs "zone <zone> request axfr to <hidden-master>" at
> the exact same second for all the configured zones.  If it actually
> tries to execute all of them at that same instant, that is going to
> be *asking* for trouble.  And all connections go to the same master
> too, it's probably also not too happy about being flooded with TCP
> DNS connections.  The old concept "thundering herd" in new
> disguise?
> 
> 3) At one time ods-signerd logged
> 
> Jul 10 15:12:00 hugin ods-signerd: [parser] unable to parse file
> /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml: xmlParseFile()
> failed
> 
> There's no decoded errno here.  Did that fail because it could not
> open the file due to already having too many open files? It doesn't
> say, but it should, especially when it later logs the probably
> highly misleading sequence:
> 
> Jul 10 15:12:00 hugin ods-signerd: [signconf] unable to update
> signconf: signconf
> /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml has errors Jul
> 10 15:12:00 hugin ods-signerd: [zone] unable to load signconf for
> zone 46.39.128.in-addr.arpa: signconf
> /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml Configuration
> error Jul 10 15:12:00 hugin ods-signerd: [tools] unable to load
> signconf for zone 46.39.128.in-addr.arpa: Configuration error Jul
> 10 15:12:00 hugin ods-signerd: [worker[4]] CRITICAL: failed to sign
> zone 46.39.128.in-addr.arpa: Configuration error Jul 10 15:12:00
> hugin ods-signerd: [worker[4]] backoff task [configure] for zone
> 46.39.128.in-addr.arpa with 60 seconds
> 
> when the actual failure was probably that ods-signerd managed to 
> run itself out of available file descriptors, instead of me as an 
> operator having commited a "CRITICAL Configuration error".
> 
> 4) When I tried restarting ods-signerd and left behind the tmp/ 
> directory, ods-sginerd complained about "bad ixfr journal: trailing
> RRs after final SOA" for a lot of the journal files. I've done
> nothing "special" -- I assume this should not happen under any
> circumstances?
> 
> 5) When I raised the FD resource limit, it then came to checking 
> the zone expirations, and it seems there's a missing conversion
> from "relative time" (as in the SOA expire field) to absolute time,
> as it complained e.g.:
> 
> Jul 10 15:08:52 hugin ods-signerd: [axfr] zone 39.128.in-addr.arpa
> expired at 3600000, and it is now 1436533732: not serving soa
> 
> One is absolute (the last one), the other is a relative timestamp,
> and the two can't be directly compared, as appears to happen here.
> 
> 6) ods-signerd had to be restarted (again), and that time it exited
> shortly thereafter with
> 
> Jul 10 15:09:10 hugin ods-signerd: signer/ixfr.c:230: part_print:
> assertion part->soamin failed
> 
> Not pretty, not robust.  At this time I concluded that ods-signerd
> would not be able to recover with the old data in its tmp/
> directory present, so ...
> 
> 7) I moved OpenDNSSEC's tmp/ directory away, raised the FD resource
> limit, and restarted the process pair (enforcer and signer), and
> they're finally up and running again.
> 
> 8) There appears to be a lack of "work flow control" internally in
> ods-signerd -- even after I successfully restarted the process
> pair, I see log entries such as this:
> 
> Jul 10 15:12:34 hugin ods-signerd: [tools] unable to read zone
> 2.39.128.in-addr.arpa: adapter failed (Incoming zone transfer not
> ready) Jul 10 15:12:34 hugin ods-signerd: [worker[1]] backoff task
> [read] for zone 2.39.128.in-addr.arpa with 60 seconds
> 
> It seems primitive to do "time-based poll waiting", can't the two 
> tasks simply synchronize their activities a bit better, so that the
> consumer doesn't even try to use the data before it's ready and
> this superfluous logging be removed?
> 
> Is it just me, or does restarting ods-signerd after an abnormal
> exit appear to be more than just a little ... fragile?  I know I
> run OpenDNSSEC on an OS where the default resource limits are quite
> low, some would say "ridiculously low" compared to "modern"
> defaults -- the default FD limit is as low as 128.  However, I
> appear to be hitting this limit with only 39 configured zones, and
> if ods-sginerd's desired consumption of file descriptors can turn
> out to be proportional to the number of configured zones, that
> surely must spell Serious Trouble somewhere further down the line,
> and not just for me?
> 
> My confidence in OpenDNSSEC's operational stability is somewhat 
> shaken by this experience...
> 
> Regards,
> 
> - Håvard _______________________________________________ 
> Opendnssec-user mailing list Opendnssec-user at lists.opendnssec.org 
> https://lists.opendnssec.org/mailman/listinfo/opendnssec-user
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlWjzYEACgkQI3PTR4mhavgTvgCdF0WeXaaaqOD0D1mbVULyvMgI
Ra8An0O5iEvBdwJaVn8It6MmyCjS/Tvr
=Viv4
-----END PGP SIGNATURE-----