[Opendnssec-user] ods-signerd: robustness & resource demands?
Havard Eidnes
he at uninett.no
Fri Jul 10 14:12:46 UTC 2015
Hi,
we're running OpenDNSSEC 1.4.7.
Today we experienced a "silent" exit or crash of ods-signerd in
our installation (no trace in the log file).
When trying to restart OpenDNSSEC, I came across the following
issues:
1) It appears that ods-signerd does not gracefully handle running
out of available file descriptors. It becomes quite upset
when it fails to read some of the zone-specific xml files.
2) Is there no concurrency limit to e.g. outgoing zone transfer
requests, so that ods-signerd will in peaks (e.g. right after
startup) try to use a multiple of <number of configured zones>
file descriptors? When I finally moved away the tmp/ files of
OpenDNSSEC, and remembered to raise the FD limit (more below),
ods-signerd logs "zone <zone> request axfr to <hidden-master>"
at the exact same second for all the configured zones. If it
actually tries to execute all of them at that same instant,
that is going to be *asking* for trouble. And all connections
go to the same master too, it's probably also not too happy
about being flooded with TCP DNS connections. The old concept
"thundering herd" in new disguise?
3) At one time ods-signerd logged
Jul 10 15:12:00 hugin ods-signerd: [parser] unable to parse file /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml: xmlParseFile() failed
There's no decoded errno here. Did that fail because it could
not open the file due to already having too many open files?
It doesn't say, but it should, especially when it later logs
the probably highly misleading sequence:
Jul 10 15:12:00 hugin ods-signerd: [signconf] unable to update signconf: signconf /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml has errors
Jul 10 15:12:00 hugin ods-signerd: [zone] unable to load signconf for zone 46.39.128.in-addr.arpa: signconf /var/opendnssec/signconf/46.39.128.in-addr.arpa.xml Configuration error
Jul 10 15:12:00 hugin ods-signerd: [tools] unable to load signconf for zone 46.39.128.in-addr.arpa: Configuration error
Jul 10 15:12:00 hugin ods-signerd: [worker[4]] CRITICAL: failed to sign zone 46.39.128.in-addr.arpa: Configuration error
Jul 10 15:12:00 hugin ods-signerd: [worker[4]] backoff task [configure] for zone 46.39.128.in-addr.arpa with 60 seconds
when the actual failure was probably that ods-signerd managed to
run itself out of available file descriptors, instead of me as an
operator having commited a "CRITICAL Configuration error".
4) When I tried restarting ods-signerd and left behind the tmp/
directory, ods-sginerd complained about "bad ixfr journal:
trailing RRs after final SOA" for a lot of the journal files.
I've done nothing "special" -- I assume this should not happen
under any circumstances?
5) When I raised the FD resource limit, it then came to checking
the zone expirations, and it seems there's a missing
conversion from "relative time" (as in the SOA expire field)
to absolute time, as it complained e.g.:
Jul 10 15:08:52 hugin ods-signerd: [axfr] zone 39.128.in-addr.arpa expired at 3600000, and it is now 1436533732: not serving soa
One is absolute (the last one), the other is a relative
timestamp, and the two can't be directly compared, as appears
to happen here.
6) ods-signerd had to be restarted (again), and that time it
exited shortly thereafter with
Jul 10 15:09:10 hugin ods-signerd: signer/ixfr.c:230: part_print: assertion part->soamin failed
Not pretty, not robust. At this time I concluded that
ods-signerd would not be able to recover with the old data in its
tmp/ directory present, so ...
7) I moved OpenDNSSEC's tmp/ directory away, raised the FD
resource limit, and restarted the process pair (enforcer and
signer), and they're finally up and running again.
8) There appears to be a lack of "work flow control" internally
in ods-signerd -- even after I successfully restarted the
process pair, I see log entries such as this:
Jul 10 15:12:34 hugin ods-signerd: [tools] unable to read zone 2.39.128.in-addr.arpa: adapter failed (Incoming zone transfer not ready)
Jul 10 15:12:34 hugin ods-signerd: [worker[1]] backoff task [read] for zone 2.39.128.in-addr.arpa with 60 seconds
It seems primitive to do "time-based poll waiting", can't the two
tasks simply synchronize their activities a bit better, so that
the consumer doesn't even try to use the data before it's ready
and this superfluous logging be removed?
Is it just me, or does restarting ods-signerd after an abnormal exit
appear to be more than just a little ... fragile? I know I run
OpenDNSSEC on an OS where the default resource limits are quite low,
some would say "ridiculously low" compared to "modern" defaults --
the default FD limit is as low as 128. However, I appear to be
hitting this limit with only 39 configured zones, and if
ods-sginerd's desired consumption of file descriptors can turn out
to be proportional to the number of configured zones, that surely
must spell Serious Trouble somewhere further down the line, and not
just for me?
My confidence in OpenDNSSEC's operational stability is somewhat
shaken by this experience...
Regards,
- Håvard
More information about the Opendnssec-user
mailing list