[Opendnssec-develop] Transient HSM problem handling

Rick van Rein rick at openfortress.nl
Wed Sep 21 08:06:33 UTC 2011


Hello all,

I am researching a problem with our HSMs in high-availability
(replicated) mode, as we are seeing split-brain problems when
it generates keys on one HSM but does not always pass them over
to the other.  The same may also happen with the other writing
operation, namely key removal.  The underlying problem may be
caused by the PKCS #11 networking code or the network itself.

I wonder what OpenDNSSEC does when it runs into such transient
problems?  It appears that the signer recovers itself (although
it will complain if it fails to find keys, of course) and that
the enforcer crashes (when it gets CKR_DEVICE_ERROR).

Aside from the observed behaviour; what is the intended behaviour?
Would it be correct to state that transient errors mean waiting
a bit longer and trying again later?  Has this actually been
considered as a design consideration?  It is definately hairy, as
delayed HSMs could lead to zones running out of signatures, so
I can imagine that not making explicit choices design causes
various kinds of behaviour inside OpenDNSSEC.

I will put this on the agenda for today's meeting, but wanted
to document the problems a bit before.


Cheers,
 -Rick



More information about the Opendnssec-develop mailing list