[Opendnssec-develop] How to replicate signer-stuck with SoftHSM
Rick van Rein (OpenFortress)
rick at openfortress.nl
Mon May 13 07:46:23 UTC 2013
Hi Rickard,
Good to hear from you.
> The Enforcer will never tell the Signer to use a key before it has been created with C_GenerateKeyPair. Could it be that your HSM returns from this function call before the key is available in the HSM (and synchronized within the cluster)?
That is one option that I'm contemplating. SafeNet is rather strict in their implementation of PKCS #11, although they are not flawless. But the signer should never do things that leads to deadlock. So it could go either way, and we're investigating which party to ask to remove the bug.
The two-out-of-four fault rate so far for multiple zones at once would match with the one-out-of-two selection of a reading HSM from our replicated set.
A variation might be that PKCS #11 describes certain liberties that are revealed with a different key-creating and key-using command; I seem to recall, but haven't found back yet, that one process does not always get to see updates in another; and if the signer reads the entire zone list, including not-seen-before zones and only then reopens the HSM slot, things could go awry.
In general however, the fault pattern seems to be caused by reading the zone list when an unknown zone is updated by the Enforcer. The new zone list includes ones that have no keys assigned yet, which could lead to exceptional behaviour. The SoftHSM avoids this behaviour, probably due to a global lock that holds its access to the Enforcer until it is entirely done? Could you confirm that the SoftHSM lock is global?
We do see the signer report that it will try again on the extra zones that it finds too early in the zone list, but it does not actually do this and instead it locks down entirely.
With 1.3.14, we'll have a debug-locks command listing the locks of the signer, that should prove to be helpful.
Cheers,
-Rick
More information about the Opendnssec-develop
mailing list