[Opendnssec-develop] Transient HSM problem handling

Wed Sep 21 08:29:50 UTC 2011

> I am researching a problem with our HSMs in high-availability
> (replicated) mode, as we are seeing split-brain problems when
> it generates keys on one HSM but does not always pass them over
> to the other.  The same may also happen with the other writing
> operation, namely key removal.  The underlying problem may be
> caused by the PKCS #11 networking code or the network itself.

We had an issue this weekend where the HSMs were not in sync. Both
HSMs were still available in the cluster. It was fixed by doing a
manual synchronization, but it would be good if this was done
automatically. Then again, we did not have this option set (if that
would fix it???):

sudo ./vtl haAdmin -autoRecovery -retry 250

I have also enable logging for the HA-mode to get more info if it happens again.

> I wonder what OpenDNSSEC does when it runs into such transient
> problems?  It appears that the signer recovers itself (although
> it will complain if it fails to find keys, of course) and that
> the enforcer crashes (when it gets CKR_DEVICE_ERROR).

The only "problem" is that you get a lot of "[hsm] unable to get key:
key 2f6aced76ef88a49a902ccaccf11cbfa not found" in syslog.

Yes, the Enforcer should have more checks, so that it does not crash.

> Aside from the observed behaviour; what is the intended behaviour?
> Would it be correct to state that transient errors mean waiting
> a bit longer and trying again later?  Has this actually been
> considered as a design consideration?  It is definately hairy, as
> delayed HSMs could lead to zones running out of signatures, so
> I can imagine that not making explicit choices design causes
> various kinds of behaviour inside OpenDNSSEC.

The intended behaviour is to back off and try again later.

// Rickard