[Opendnssec-develop] ods-signer startup time issue
Rick van Rein
rick at openfortress.nl
Wed Feb 10 13:12:56 UTC 2016
We've experienced similar scalability problems with OpenDNSSEC at SURFnet. We don't have that many zones, but still need to wait 20-30 minutes for the Signer to start again.
> Each zone was loaded one at a time and checked against the softHSM. The time in between the new zone appears to increased with O(N^2). [...] So we think that this might an issue within the code itself.
Yes, the code is known to be troubled in this respect.
I have analysed libhsm, but that project that was dropped because the problems were so abundant and severe that the only sensible advise was to drop libhsm altogether (!) or start it from scratch. The libhsm component is a remainder from the project's infant phase and has not been designed for issues like these.
What appears to happen here is this:
1. The zone list is loaded, and the Signer iterates over it
2. For each zone, a key is searched in the HSM with a separate C_FindObject()
3. Some PKCS #11 implementations are known to implement this as a linear search through their object space
Step 3 is O(N) and step 2 is O(N) -- together they are O(N^2). Had this been done without libhsm's mutual isolation between PKCS #11 knowledge and Signer knowledge, then the following would have been prudent:
1. The zone list is loaded, and a list of all CKA_ID values from .signconf prepared
2. Iterate over PKCS #11 keys in *one* C_FindObject() operation
3. For each find, pass through the list to find the matching CKA_ID
Although step 3 is still up to O(N), it is now reachable code that can be optimised to O(log(N)) or better, unlike the current tediously slow PKCS #11 inner loop.
I would advise the Signer team to consider p11-kit for doing this. It has a nice module management and object iteration interface that can even iterate over multiple HSMs . The way p11-kit is doing this stuff is neat. And you have access to C_SignInit() and such, without the bug-filled and efficiency-stifling interference of libhsm.
 Supporting multiple HSMs is very useful during HSM rollovers; future Signers could support this through a model with PKCS #!! URI's as standardised in RFC 7512. Briefly put, that would mean that the CKA_ID in the current .signconf files is replaced with a pkcs11: URI that includes the CKA_ID but also token identifying info.
Interestingly... this change can be a gradual process. There is no reason why the zone loader couldn't already use normal PKCS #11 calls (possibly with p11-kit support added) even if the continuous signing operations continue to be done with libhsm.
> Is this a known scaling issue or should we start looking for other software that is able to hold bigger volume of zones?
I have often asked for libhsm to be removed from the project, because it is quite simply a liability to its stability and even security. The idea that this move could be a gradual transition started with zone list bootstrapping is a new idea though, and it is fully in line with PKCS #11 intentions. So perhaps it is time to reconsider.
Thanks for reporting this, Markus. If OpenDNSSEC scales so badly it is in real need of some serious fixing, IMHO.
More information about the Opendnssec-develop