From rick at openfortress.nl Wed Feb 10 13:12:56 2016 From: rick at openfortress.nl (Rick van Rein) Date: Wed, 10 Feb 2016 14:12:56 +0100 Subject: [Opendnssec-develop] ods-signer startup time issue In-Reply-To: <553FDBC0-4056-4B44-9BC8-F002BF50BF5D@prolocation.net> References: <553FDBC0-4056-4B44-9BC8-F002BF50BF5D@prolocation.net> Message-ID: <56BB3758.2010204@openfortress.nl> Hello, We've experienced similar scalability problems with OpenDNSSEC at SURFnet. We don't have that many zones, but still need to wait 20-30 minutes for the Signer to start again. > Each zone was loaded one at a time and checked against the softHSM. The time in between the new zone appears to increased with O(N^2). [...] So we think that this might an issue within the code itself. Yes, the code is known to be troubled in this respect. I have analysed libhsm, but that project that was dropped because the problems were so abundant and severe that the only sensible advise was to drop libhsm altogether (!) or start it from scratch. The libhsm component is a remainder from the project's infant phase and has not been designed for issues like these. What appears to happen here is this: 1. The zone list is loaded, and the Signer iterates over it 2. For each zone, a key is searched in the HSM with a separate C_FindObject() 3. Some PKCS #11 implementations are known to implement this as a linear search through their object space Step 3 is O(N) and step 2 is O(N) -- together they are O(N^2). Had this been done without libhsm's mutual isolation between PKCS #11 knowledge and Signer knowledge, then the following would have been prudent: 1. The zone list is loaded, and a list of all CKA_ID values from .signconf prepared 2. Iterate over PKCS #11 keys in *one* C_FindObject() operation 3. For each find, pass through the list to find the matching CKA_ID Although step 3 is still up to O(N), it is now reachable code that can be optimised to O(log(N)) or better, unlike the current tediously slow PKCS #11 inner loop. I would advise the Signer team to consider p11-kit for doing this. It has a nice module management and object iteration interface that can even iterate over multiple HSMs [1]. The way p11-kit is doing this stuff is neat. And you have access to C_SignInit() and such, without the bug-filled and efficiency-stifling interference of libhsm. [1] Supporting multiple HSMs is very useful during HSM rollovers; future Signers could support this through a model with PKCS #!! URI's as standardised in RFC 7512. Briefly put, that would mean that the CKA_ID in the current .signconf files is replaced with a pkcs11: URI that includes the CKA_ID but also token identifying info. Interestingly... this change can be a gradual process. There is no reason why the zone loader couldn't already use normal PKCS #11 calls (possibly with p11-kit support added) even if the continuous signing operations continue to be done with libhsm. > Is this a known scaling issue or should we start looking for other software that is able to hold bigger volume of zones? I have often asked for libhsm to be removed from the project, because it is quite simply a liability to its stability and even security. The idea that this move could be a gradual transition started with zone list bootstrapping is a new idea though, and it is fully in line with PKCS #11 intentions. So perhaps it is time to reconsider. Thanks for reporting this, Markus. If OpenDNSSEC scales so badly it is in real need of some serious fixing, IMHO. Cheers, -Rick From berry at nlnetlabs.nl Wed Feb 10 13:29:10 2016 From: berry at nlnetlabs.nl (Berry A.W. van Halderen) Date: Wed, 10 Feb 2016 14:29:10 +0100 Subject: [Opendnssec-develop] ods-signer startup time issue In-Reply-To: <56BB3758.2010204@openfortress.nl> References: <553FDBC0-4056-4B44-9BC8-F002BF50BF5D@prolocation.net> <56BB3758.2010204@openfortress.nl> Message-ID: <56BB3B26.7040303@nlnetlabs.nl> On 02/10/2016 02:12 PM, Rick van Rein wrote: > Hello, > > We've experienced similar scalability problems with OpenDNSSEC at SURFnet. We don't have that many zones, but still need to wait 20-30 minutes for the Signer to start again. > >> Each zone was loaded one at a time and checked against the softHSM. The time in between the new zone appears to increased with O(N^2). [...] So we think that this might an issue within the code itself. > > Yes, the code is known to be troubled in this respect. > > I have analysed libhsm, but that project that was dropped because the problems were so abundant and severe that the only sensible advise was to drop libhsm altogether (!) or start it from scratch. The libhsm component is a remainder from the project's infant phase and has not been designed for issues like these. > > What appears to happen here is this: > > 1. The zone list is loaded, and the Signer iterates over it > 2. For each zone, a key is searched in the HSM with a separate C_FindObject() > 3. Some PKCS #11 implementations are known to implement this as a linear search through their object space > > Step 3 is O(N) and step 2 is O(N) -- together they are O(N^2). Had this been done without libhsm's mutual isolation between PKCS #11 knowledge and Signer knowledge, then the following would have been prudent: > > 1. The zone list is loaded, and a list of all CKA_ID values from .signconf prepared > 2. Iterate over PKCS #11 keys in *one* C_FindObject() operation > 3. For each find, pass through the list to find the matching CKA_ID > > Although step 3 is still up to O(N), it is now reachable code that can be optimised to O(log(N)) or better, unlike the current tediously slow PKCS #11 inner loop. > > I would advise the Signer team to consider p11-kit for doing this. It has a nice module management and object iteration interface that can even iterate over multiple HSMs [1]. The way p11-kit is doing this stuff is neat. And you have access to C_SignInit() and such, without the bug-filled and efficiency-stifling interference of libhsm. > > [1] Supporting multiple HSMs is very useful during HSM rollovers; future Signers could support this through a model with PKCS #!! URI's as standardised in RFC 7512. Briefly put, that would mean that the CKA_ID in the current .signconf files is replaced with a pkcs11: URI that includes the CKA_ID but also token identifying info. > > Interestingly... this change can be a gradual process. There is no reason why the zone loader couldn't already use normal PKCS #11 calls (possibly with p11-kit support added) even if the continuous signing operations continue to be done with libhsm. > >> Is this a known scaling issue or should we start looking for other software that is able to hold bigger volume of zones? > > I have often asked for libhsm to be removed from the project, because it is quite simply a liability to its stability and even security. The idea that this move could be a gradual transition started with zone list bootstrapping is a new idea though, and it is fully in line with PKCS #11 intentions. So perhaps it is time to reconsider. > > > Thanks for reporting this, Markus. If OpenDNSSEC scales so badly it is in real need of some serious fixing, IMHO. > Thanks for this report and analysis, I've quickly made an issue for this (OPENDNSSEC-759) such that it won't slip our minds. I was not aware of some PKCS#11 implementations that would implement C_FindObject. Seems bad, but still much better anyway to use a single find to locate all keys if that is possible. However in any case this requires quite a bit of code reshuffling. There are many parts that need improvement and I would not mind replacing libhsm, but I do need to look at things like p11-kit before doing this. Multiple HSMs is indeed a big requirement IMHO, so that should not be blocked. With kind regards, Berry van Halderen