[Opendnssec-user] Error converting from 1.4.14 to 2.1.8

Havard Eidnes he at uninett.no
Fri Mar 5 13:44:11 UTC 2021


> On Fri, Mar 05, 2021 at 12:58:16AM +0100, Havard Eidnes via Opendnssec-user wrote:
>> 1) We're using <NSEC3> for denial-of-existence.  NSEC3 uses a
>> "salt" value as an input value to the process.  If we move away
>> the old /var/opendnssec/signconf/ directory and create it anew,
>> OpenDNSSEC will populate it with an xml file per zone.  However,
>> they all have this part:
>> 
>>     <Denial>
>>       <NSEC3>
>>         <Hash>
>>           <Algorithm>1</Algorithm>
>>           <Iterations>5</Iterations>
>>           <Salt>0</Salt>
>>         </Hash>
>>       </NSEC3>
>>     </Denial>
>
> Okay that salt is clearly wrong, is should be a hex string.  I need
> to check where this comes from, which takes a bit of time.  Did you
> have an explicit salt specified in your 1.4 installation?
> So instead of
>    <Salt length="8"/>
> You had 
>    <Salt>cd79fa0214ff93b7</Salt>
> in your kasp.xml?

Nope, kasp.xml had, in accordance with

https://wiki.opendnssec.org/display/DOCS20/kasp.xml

    <Denial>
      <NSEC3>
        <Resalt>P100D</Resalt>
        <Hash>
          <Algorithm>1</Algorithm>
          <Iterations>5</Iterations>
          <Salt length="8"/>
        </Hash>
      </NSEC3>
    </Denial>

> Are the kasp.xml from the old and new installation essentially
> the same?

Yes.

Only with this modification:

kasp.xml:
  Add <MaxZoneTTL>P1D</MaxZoneTTL> in the <Signatures> stanza.

I think this was learned the hard way from an earlier attempt at
following the migration-from-1-4-to-2-1 document (and it doesn't
say you need to do this).

BTW, https://www.opendnssec.org/migration-from-1-4-to-2-1/ says
in one of the points:

   The <MySQL><Host> child element is now mandatory.

That's ... misleading.  It certainly is only mandatory if you are
using MySQL as a backend (which we are not), but that's not
actually what the document says.

>> So ... did the NSEC3 not get converted / transferred to the new
>> kasp.db file in the conversion?  Why does it end up as the
>> single-character 0 in OpenDNSSEC 2?  It ended up as "0" in the
>> policy table in the converted kasp.db file.  Should it instead
>> have been NULL or the empty string?
>
> It should not have ended up as a single character 0 in the database.
> Did you query this in the database table?

No, I waited until the signconf file was created.  But ...
my colleague says that he looked in the database.  The schema
says this is a text field, so the database entry consisted of a
single-character string "0".

I've done the conversion from 1.4.14 to an earlier 2.1.x version
on another host (which worked at the time), but have come to the
realization that the test installation needs to be re-done in an
attempt to re-create the issues we ran across last night.  For
some reason I do not fathom, the test installation has gone to
<NSEC> (according to the signconf xml files) despite kasp.xml
saying <NSEC3>, but the migration was done with a script from
that earlier OpenDNSSEC 2.x version.

>> One of my colleagues who helped me doing the conversion used
>> sqlite3 to change the global config in kasp.db to use NSEC
>> instead of NSEC3, and that seemed to have brought the process
>> further along.
>
> Ehm... I'm confused, you migrated from NSEC3 to NSEC using a
> database change?  That seems dangerous.

Yes, I do realize that's possibly dangerous, but without it, we
got stuck at this particular issue and could make no further
progress in finding any other subsequent problems.  Changing
kasp.xml to say NSEC had apparently no effect.

> Was this in order to prevent a problem or a deliberate change.

A deliberate change in a desperate attempt to proceed beyond the
"illegal salt 0" which otherwise prevented other progress or
uncovering other problems.

> I'm not sure which side effects this could have.  And was this
> done before or after the migration?

It was done after migration, after we saw the "illegal salt 0"
log messages + other associated messages quoted earlier.

Looking at the sqlite_convert.sql file, it seems that everything
to do with salt is related to the global policy, and not the
per-zone data (starts with "UPDATE policy").  Hmm, the comment
above that block says

-- clumsy salt update. salt is optional in 1.4 but required in 2.0
-- sqlite is limited in what it can do in an update. I hope there is a
-- better way for this?

"Required in 2.0"?  Where?  In kasp.xml?  The migration document
is silent on that, and the documentation doesn't say it's
required in kasp.xml either.

Hmm, it seems that in the old OpenDNSSEC 1 installation all the
signconf xml files have the same value for <Salt>.  However, as
above, we have not manually specified a salt value, and leaves it
up to OpenDNSSEC to create any required salt on its own.
Besides, we've directed it to pick a new salt every 100 days, so
it doesn't seem "natural" to manually set a global salt value in
the kasp.xml file.  And ... the OpenDNSSEC 2.x documentation
doesn't say you must do that either, so we have not.

The old sqlite3 database appears to have the salt value in the
global policy.  So ... why wasn't it copied over to the new
database?

ods @ xxxxxxxxxxx: {13} sqlite3 /var/db/opendnssec/kasp.db
SQLite version 3.26.0 2018-12-01 12:34:55
Enter ".help" for usage hints.
sqlite> .table
KEYALLOC_VIEW        dbadmin              policies           
KEYDATA_VIEW         dnsseckeys           securitymodules    
PARAMETER_LIST       keypairs             serialmodes        
PARAMETER_VIEW       parameters           zones              
categories           parameters_policies
sqlite> .schema policies
CREATE TABLE policies (
  id            integer primary key autoincrement,    -- id
  name          varchar(30) not null,  -- name of the policy
  description   varchar(255), -- description of the
  salt          varchar(512), -- value of the salt
  salt_stamp    varchar(64),  -- when the salt was generated
  audit         text, -- contents of <Audit>

  unique(name)
);
sqlite> SELECT * from policies;
1|default|The UNINETT key and signing policy for OpenDNSSEC|ccae9067625332c1|2021-03-05 07:32:39|NULL
sqlite> ^D
ods @ xxxxxxxxxxx: {14} 

We'll check that after re-doing the migration on the test host;
the OpenDNSSEC 2.x kasp.db file is not available anymore.

>> 2) We seem to have issues related to SoftHSM2, which we're
>> converting to at the same time.  The problem we're having is that
>> the level of error messages related to SoftHSM2 is nearly binary:
>> either it works or it doesn't, and not a word about "why" when it
>> doesn't.  As an operator this leaves me stumped about what's
>> really going on and what possible correction can be made.
>> 
>> Mar  4 22:28:50 tilfeldigvis ods-signerd: [zone] unable to publish keys for zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: error creating libhsm context
>> Mar  4 22:28:50 tilfeldigvis ods-signerd: [tools] unable to read zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: failed to publish dnskeys (HSM error)
>
> It cannot access or find the keys.

Looking at the various functions in libhsm.c there are apparently
many different conditions which might cause either HSM_ERROR to
be set or NULL to be returned.

However, all you get back from hsm_create_context() is "No!" with
no further explanation.

So ... I'd say it's actually premature to conclude with "It
cannot access or find the keys" as the root cause for the problem.

> If using SoftHSM it means the keys aren't there at all, or are
> not readable.

Well, the sqlite3.db file certainly is there and was pointed to
by softhsm2.conf:

# cat /usr/pkg/etc/softhsm2.conf 
# SoftHSM v2 configuration file

directories.tokendir = /var/db/softhsm
objectstore.backend = db

# ERROR, WARNING, INFO, DEBUG
log.level = ERROR

# If CKF_REMOVABLE_DEVICE flag should be set
slots.removable = false
# 
# ls -lR /var/db/softhsm
total 4
drwx------  2 ods  ods  512 Mar  4 23:25 65dbf669-87dc-520a-fe8c-ea8aa8db326c

/var/db/softhsm-2/65dbf669-87dc-520a-fe8c-ea8aa8db326c:
total 3296
-rw-------  1 ods  ods  3317760 Mar  4 23:25 sqlite3.db
#

We converted the old softhsm version 1 database using

softhsm2-util --init-token --slot 0 --label OpenDNSSEC --pin xxxx --so-pin xxxx
softhsm2-migrate --db /var/db/softhsm-1/slot0.db --pin xxxx --token OpenDNSSEC

This latter produced output similar to this:

Found slot 685453932 with matching token label.
Object 27351 has been migrated
Object 27415 has been migrated
Object 27416 has been migrated
Object 27417 has been migrated
Object 27418 has been migrated
Object 27419 has been migrated
...
Object 34485 has been migrated
Object 34486 has been migrated
Object 34487 has been migrated
Object 34488 has been migrated
The database has been migrated to the new HSM

> [..]
>> of any errors (it's a library, according to the name), and
>> _hsm_ctx isn't readily available in the file which calls
>> hsm_create_context(), so the opportunity of getting a proper
>> error message to explain *why* the HSM module is unhappy is lost.
>> I would call this an "interface design error".
> 
> Error reporting isn't the best, but on the other hand the errors
> retrieved through the PKCS#11 interface won't help.  SoftHSM will
> also perform error reporting, which can often be usefull now.  This
> will happen especially during startup, less during actual access.

When I earlier backed out of the previous failed migration and
re-started OpenDNSSEC 1 I had made a silly mistake, and noticed
that in OpenDNSSEC 1.x the enforcer refuses to start if there is
a problem with accessing the HSM.

If there was a fundamental problem with the HSM of SoftHSM2,
would not the enforcer of OpenDNSSEC 2.x also refuse to start, or
was that check ripped out?  It certainly *did* start.

>> Now, I'm not entirely certain how one might go about verifying
>> that the HSM module is happy with the SoftHSM2 database.
>
> Running "ods-hsmutil list" is the easiest way to verify whether
> OpenDNSSEC can access the HSM.

OK, noted.

On my old OpenDNSSEC 1.4.14 this immediately produces a long list
of keys.

On my test OpenDNSSEC 2.1.8 system, this produced


Listing keys in all repositories.

and a process apparently stuck in a CPU loop; it's used 7 minutes
CPU time so far with no output to show for it. ?!?!?

OK, it *finally* produced the list after what looked like 8
minutes CPU time, which seems ... excessive?  Why on Earth would
it use so much CPU resources for this presumably relatively
mundane task?

>> ods @ tilfeldigvis: {27} ods-migrate
>> Reading config file '/usr/pkg/etc/opendnssec/conf.xml'..
>> Connecting to HSM..
>> Connecting to database..
>> Computing keytags, this could take a while.
>> Added keytags for 1653 keys.
>> Finishing..
>> ods@ tilfeldigvis: {28} 
>>
>> and took nearly 70 CPU minutes (yikes!)
>
> That seems excessive, but it is an expensive operation.  Since this
> is a migration step I'll keep it at that.

Yes, I also thought it was quite long time, but we have 550 zones.

>> Also, ods-enforcer is showing all the keys -- an example:
> [..]
>> However, I'm not sure whether listing the keys actually causes
>> access to the SoftHSM.
>
> No, the enforcer only accesses an HSM in generating keys, never afterwards.

OK.  But ... is that entirely accurate?  I see it's searching for
potential keys to remove, surely it would have to access the HSM
to do that as well?  And what about when you export a key, e.g.
as a DS record -- doesn't that also cause an access to the HSM?

All I see in the signconf files is a "Locator" for the keys, so
at least the public part of the key isn't stored there.  I'm
assuming the Locator is a key for looking up the actual key data
... in the HSM?

>> Is there anything else I can/should do to
>> minimally verify that the SoftHSM2 module is working as intended?
>> So ... this leaves me without any answer as to what might be
>> wrong and what I ought to do about it.  Help!
>
> A common problem is related to file access permissions.  If the migration
> happened for instance as a different user then which is used to run
> OpenDNSSEC, then files may no longer be accessible due to permission
> problems.

That should not be the problem, ref. above.

With that, I'll save + clobber my test installation, and re-do
the exact migration steps performed on our production system
yesteday, and see if this re-creates the issues we found.

Regards,

- Håvard


More information about the Opendnssec-user mailing list