[Opendnssec-user] Error converting from 1.4.14 to 2.1.8

Havard Eidnes he at uninett.no
Thu Mar 4 23:58:16 UTC 2021


Hi,

we have now made our second failed attempt at converting our
production OpenDNSSEC 1.4.14 installation to 2.1.8.

So far we have found one specific problem and one more vague
problem:


1) We're using <NSEC3> for denial-of-existence.  NSEC3 uses a
"salt" value as an input value to the process.  If we move away
the old /var/opendnssec/signconf/ directory and create it anew,
OpenDNSSEC will populate it with an xml file per zone.  However,
they all have this part:

    <Denial>
      <NSEC3>
        <Hash>
          <Algorithm>1</Algorithm>
          <Iterations>5</Iterations>
          <Salt>0</Salt>
        </Hash>
      </NSEC3>
    </Denial>

This mixes just *terribly* with OpenDNSSEC's code, because of

ods_status
nsec3params_create_salt(const char* salt_str, uint8_t* salt_len,
    uint8_t** salt)
{
...
    *salt_len = (uint8_t) strlen(salt_str);
    if (*salt_len == 1 && salt_str[0] == '-') {
        *salt_len = 0;
        *salt = NULL;
        return ODS_STATUS_OK;
    } else if (*salt_len % 2 != 0) {
        ods_log_error("[%s] invalid salt %s", nsec3_str, salt_str);
        *salt = NULL;
        return ODS_STATUS_ERR;
    }
...

and of course we get the "invalid salt 0" message logged,
together with this sequence:

Mar  4 22:23:02 tilfeldigvis ods-signerd: [file] openfile /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml count 1
Mar  4 22:23:02 tilfeldigvis ods-signerd: [nsec3] invalid salt 0
Mar  4 22:23:02 tilfeldigvis ods-signerd: [nsec3] unable to create: create salt failed
Mar  4 22:23:02 tilfeldigvis ods-signerd: [signconf] unable to read signconf /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml: nsec3params_create() failed
Mar  4 22:23:02 tilfeldigvis ods-signerd: [signconf] unable to update signconf: failed to read file /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml (Memory allocation error)
Mar  4 22:23:02 tilfeldigvis ods-signerd: [zone] unable to load signconf for zone 4.0.6.6.7.7.7.4.nrenum.net: signconf /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml Memory allocation error
Mar  4 22:23:02 tilfeldigvis ods-signerd: [tools] unable to load signconf for zone 4.0.6.6.7.7.7.4.nrenum.net: Memory allocation error
Mar  4 22:23:02 tilfeldigvis ods-signerd: WARNING: unable to sign zone 4.0.6.6.7.7.7.4.nrenum.net, signconf is not ready
Mar  4 22:23:02 tilfeldigvis ods-signerd: back-off task [configure] for zone 4.0.6.6.7.7.7.4.nrenum.net with 480 seconds
Mar  4 22:23:02 tilfeldigvis ods-signerd: [scheduler] schedule task [configure] for 4.0.6.6.7.7.7.4.nrenum.net

The "Memory allocation error" messages are just Terribly
Misleading, and might send an operator on a wild goose chase
about adjusting OpenDNSSEC's resource limits.  They're lies, I
tell you!

Now, the old signconf files typically *do* have a non-zero salt
value, e.g.

                <Denial>
                        <NSEC3>
                                <Hash>
                                        <Algorithm>1</Algorithm>
                                        <Iterations>5</Iterations>
                                        <Salt>cd79fa0214ff93b7</Salt>
                                </Hash>
                        </NSEC3>
                </Denial>

and I suspect this data is just a reflection of what's in the
kasp.db file.

So ... did the NSEC3 not get converted / transferred to the new
kasp.db file in the conversion?  Why does it end up as the
single-character 0 in OpenDNSSEC 2?  It ended up as "0" in the
policy table in the converted kasp.db file.  Should it instead
have been NULL or the empty string?

One of my colleagues who helped me doing the conversion used
sqlite3 to change the global config in kasp.db to use NSEC
instead of NSEC3, and that seemed to have brought the process
further along.


2) We seem to have issues related to SoftHSM2, which we're
converting to at the same time.  The problem we're having is that
the level of error messages related to SoftHSM2 is nearly binary:
either it works or it doesn't, and not a word about "why" when it
doesn't.  As an operator this leaves me stumped about what's
really going on and what possible correction can be made.

You might find this set of log messages in the log:

Mar  4 22:28:50 tilfeldigvis ods-signerd: [zone] unable to publish keys for zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: error creating libhsm context
Mar  4 22:28:50 tilfeldigvis ods-signerd: [tools] unable to read zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: failed to publish dnskeys (HSM error)
Mar  4 22:28:50 tilfeldigvis ods-signerd: CRITICAL: failed to sign zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: HSM error
Mar  4 22:28:50 tilfeldigvis ods-signerd: back-off task [read] for zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa with 60 seconds

Of course, it doesn't get any better after 60 seconds.

If we follow the first "error creatign libhsm context", we get to

signer/src/signer/zone.c's zone_publish_dnskeys(), which does:

    /* hsm access */
    if (!skip_hsm_access) {
        ctx = hsm_create_context();
        if (ctx == NULL) {
            ods_log_error("[%s] unable to publish keys for zone %s: "
                "error creating libhsm context", zone_str, zone->name);
            return ODS_STATUS_HSM_ERR;
        }
    }

and if we nest on hsm_create_context() to
libhsm/src/lib/libhsm.c, which basically does hsm_ctx_clone()
based on that module's global _hsm_ctx within locking.

Now, hsm_ctx_clone() does hsm_session_clone(), and returns NULL
if that fails.  hsm_session_clone() in turn does
hsm_get_slot_id(), which may store nice error messages such as

         hsm_ctx_set_error(ctx, HSM_ERROR, "hsm_get_slot_id()",
                          "No slots found in HSM");
         hsm_ctx_set_error(ctx, HSM_ERROR, "hsm_get_slot_id()",
                          "Too many slots found in HSM");
        hsm_ctx_set_error(ctx, HSM_ERROR, "hsm_get_slot_id()",
                          "Could not allocate slot ID table");
        hsm_ctx_set_error(ctx, -1, "hsm_get_slot_id()",
            "could not find token with the name %s", token_name);

in the "ctx" parameter.  However, remember, that is the module-
global _hsm_ctx.  Apparently there's no logging done in libhsm.c
of any errors (it's a library, according to the name), and
_hsm_ctx isn't readily available in the file which calls
hsm_create_context(), so the opportunity of getting a proper
error message to explain *why* the HSM module is unhappy is lost.
I would call this an "interface design error".

Now, I'm not entirely certain how one might go about verifying
that the HSM module is happy with the SoftHSM2 database.  At
least the ods-migrate step succeeded, it said:

ods @ tilfeldigvis: {27} ods-migrate
Reading config file '/usr/pkg/etc/opendnssec/conf.xml'..
Connecting to HSM..
Connecting to database..
Computing keytags, this could take a while.
Added keytags for 1653 keys.
Finishing..
ods@ tilfeldigvis: {28} 

and took nearly 70 CPU minutes (yikes!)

Also, ods-enforcer is showing all the keys -- an example:

Keys:
Zone:                           Keytype: State:    Date of next transition: Size: Algorithm: CKA_ID:                          Repository: KeyTag:
0.0.7.5.5.3.7.7.4.nrenum.net    KSK      active    now                      2048  8          90ef217860f6f15368d3b4b3aedf71d8 SoftHSM     33904
0.9.4.8.4.6.7.4.nrenum.net      KSK      active    now                      2048  8          7f4796b2d79055a95c79fdad430bf2fa SoftHSM     40595
1.0.7.5.5.3.7.7.4.nrenum.net    KSK      active    now                      2048  8          de5f61ba82664928dbc6c26377595b1c SoftHSM     43282
1.9.4.8.4.6.7.4.nrenum.net      KSK      active    now                      2048  8          7af945a7ed5188d398e541a30f2eea7a SoftHSM     14144
2.5.4.2.2.7.4.nrenum.net        KSK      active    now                      2048  8          aafcc72f2b33f2cacec3216c37021b1e SoftHSM     56149
2.9.4.8.4.6.7.4.nrenum.net      KSK      active    now                      2048  8          77d95adf77729bcf6e89d2800206aade SoftHSM     19222
3.5.4.2.2.7.4.nrenum.net        KSK      active    now                      2048  8          b3929e9d4686f6d14350fd8efc3a07ef SoftHSM     33047
3.9.4.8.4.6.7.4.nrenum.net      KSK      active    now                      2048  8          c22698b4bafe25812adfe33218c395fb SoftHSM     60532

However, I'm not sure whether listing the keys actually causes
access to the SoftHSM.  Is there anything else I can/should do to
minimally verify that the SoftHSM2 module is working as intended?
So ... this leaves me without any answer as to what might be
wrong and what I ought to do about it.  Help!


Best regards,

- Håvard


More information about the Opendnssec-user mailing list