[Opendnssec-user] ods-enforcerd in error loop required manual ods-ksmutil hacking to get unstuck :(

Paul Wouters paul at nohats.ca
Mon Sep 22 14:44:09 UTC 2014


Over the weekend, a few zones including libreswan.org failed
validation. The logs showed a stream of:

Sep 21 14:11:56 ns0 ods-enforcerd: No change to: /var/opendnssec/signconf/libreswan.org.xml
Sep 21 14:33:10 ns0 ods-signerd: [zone] unable to publish dnskeys for zone libreswan.org: error creating dnskey
Sep 21 14:33:10 ns0 ods-signerd: [tools] unable to read zone libreswan.org: failed to publish dnskeys (General error)
Sep 21 14:33:10 ns0 ods-signerd: [worker[4]] CRITICAL: failed to sign zone libreswan.org: General error
Sep 21 14:33:10 ns0 ods-signerd: [worker[4]] backoff task [read] for zone libreswan.org with 3600 seconds
Sep 21 15:11:57 ns0 ods-enforcerd: Zone libreswan.org found.
Sep 21 15:11:57 ns0 ods-enforcerd: Policy for libreswan.org set to default.
Sep 21 15:11:57 ns0 ods-enforcerd: Config will be output to /var/opendnssec/signconf/libreswan.org.xml.
Sep 21 15:11:57 ns0 ods-enforcerd: No change to: /var/opendnssec/signconf/libreswan.org.xml

A restart did not seem to help:

Sep 22 09:44:25 ns0 ods-signerd: [hsm] sign init: CKR_KEY_HANDLE_INVALID
Sep 22 09:44:25 ns0 ods-signerd: [hsm] error signing rrset with libhsm
Sep 22 09:44:25 ns0 ods-signerd: [rrset] unable to sign RRset[6]: lhsm_sign() failed
Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] sign zone httpca.org failed: 1 RRsets failed
Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] CRITICAL: failed to sign zone httpca.org: General error
Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] backoff task [sign] for zone httpca.org with 3600 seconds

This one seems due to a service restart leading to two ods-signerd's
running. One not referenced in the pid file. I used a hammer to restart.

Unfortunately, it remained broken:

Sep 22 10:25:34 ns0 ods-signerd: [worker[1]] CRITICAL: failed to sign zone libreswan.org: General error
Sep 22 10:25:34 ns0 ods-signerd: [worker[1]] backoff task [read] for zone libreswan.org with 60 seconds
Sep 22 10:25:34 ns0 ods-signerd: [tools] unable to read zone libreswan.com: failed to publish dnskeys (General error)
Sep 22 10:25:34 ns0 ods-signerd: [worker[4]] CRITICAL: failed to sign zone libreswan.com: General error
Sep 22 10:25:34 ns0 ods-signerd: [worker[4]] backoff task [read] for zone libreswan.com with 60 seconds
Sep 22 10:25:34 ns0 ods-signerd: [hsm] unable to get key: key f11737ffedb1723031545fe488c89bad not found
Sep 22 10:25:34 ns0 ods-signerd: [zone] unable to publish dnskeys for zone httpca.org: error creating dnskey
Sep 22 10:25:34 ns0 ods-signerd: [tools] unable to read zone httpca.org: failed to publish dnskeys (General error)
Sep 22 10:25:34 ns0 ods-signerd: [worker[2]] CRITICAL: failed to sign zone httpca.org: General error
Sep 22 10:25:34 ns0 ods-signerd: [worker[2]] backoff task [read] for zone httpca.org with 60 seconds

Looking around I saw:

[root at ns0 log]# ods-ksmutil key list --verbose |grep  "NOT IN"
SQLite database set to: /var/opendnssec/kasp.db
libreswan.org                   ZSK           active    2014-09-25 05:44:55 (retire)   2048    8           e9fa25e8214d2920c9489069bdfe61e6 SoftHSM NOT IN repository
libreswan.net                   ZSK           active    2014-09-25 05:44:55 (retire)   2048    8           635cc83526dfafc2c76605ae7db96ea3 SoftHSM NOT IN repository
libreswan.com                   ZSK           active    2014-09-25 05:44:55 (retire)   2048    8           cd09e7a69e890ee7ad41a30e9a323b73 SoftHSM NOT IN repository
httpca.org                      ZSK           active    2014-09-23 03:44:15 (retire)   2048    8           f11737ffedb1723031545fe488c89bad SoftHSM NOT IN repository

It is possible that testing with softhsm-2 and then reverting to
softhsm-1 caused these to happen, if these keys were generated during
the 2 days of running softhsm-v2.

Running ods-ksmutil key delete --cka_id <id> --force seemed to have
killed it and forced it to generate new ones

Sep 22 10:37:00 ns0 ods-enforcerd: INFO: Promoting ZSK from publish to active as this is the first pass for the zone
Sep 22 10:37:00 ns0 ods-enforcerd: WARNING: Making non-backed up ZSK active, PLEASE make sure that you know the potential problems of using keys which are not recoverable
Sep 22 10:37:00 ns0 ods-enforcerd: Called signer engine: /usr/sbin/ods-signer update httpca.org
Sep 22 10:37:00 ns0 ods-enforcerd: Disconnecting from Database...
Sep 22 10:37:00 ns0 ods-enforcerd: Sleeping for 3600 seconds.
Sep 22 10:37:00 ns0 ods-signerd: [signconf] zone httpca.org signconf: RESIGN[PT7200S] REFRESH[PT604800S] VALIDITY[PT1209600S] DENIAL[PT1209600S] JITTER[PT43200S] OFFSET[PT3600S] NSEC[50] DNSKEYTTL[PT3600S] SOATTL[PT3600S] MINIMUM[PT3600S] SERIAL[datecounter]

etc.

I guess I really need to look at shipping a mechanism to ensure CRITICAL
errors are mailed to root to prevent zone outages.

Paul



More information about the Opendnssec-user mailing list