[Opendnssec-user] ods-enforcerd in error loop required manual ods-ksmutil hacking to get unstuck :(
Paul Wouters
paul at nohats.ca
Mon Sep 22 14:44:09 UTC 2014
Over the weekend, a few zones including libreswan.org failed
validation. The logs showed a stream of:
Sep 21 14:11:56 ns0 ods-enforcerd: No change to: /var/opendnssec/signconf/libreswan.org.xml
Sep 21 14:33:10 ns0 ods-signerd: [zone] unable to publish dnskeys for zone libreswan.org: error creating dnskey
Sep 21 14:33:10 ns0 ods-signerd: [tools] unable to read zone libreswan.org: failed to publish dnskeys (General error)
Sep 21 14:33:10 ns0 ods-signerd: [worker[4]] CRITICAL: failed to sign zone libreswan.org: General error
Sep 21 14:33:10 ns0 ods-signerd: [worker[4]] backoff task [read] for zone libreswan.org with 3600 seconds
Sep 21 15:11:57 ns0 ods-enforcerd: Zone libreswan.org found.
Sep 21 15:11:57 ns0 ods-enforcerd: Policy for libreswan.org set to default.
Sep 21 15:11:57 ns0 ods-enforcerd: Config will be output to /var/opendnssec/signconf/libreswan.org.xml.
Sep 21 15:11:57 ns0 ods-enforcerd: No change to: /var/opendnssec/signconf/libreswan.org.xml
A restart did not seem to help:
Sep 22 09:44:25 ns0 ods-signerd: [hsm] sign init: CKR_KEY_HANDLE_INVALID
Sep 22 09:44:25 ns0 ods-signerd: [hsm] error signing rrset with libhsm
Sep 22 09:44:25 ns0 ods-signerd: [rrset] unable to sign RRset[6]: lhsm_sign() failed
Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] sign zone httpca.org failed: 1 RRsets failed
Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] CRITICAL: failed to sign zone httpca.org: General error
Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] backoff task [sign] for zone httpca.org with 3600 seconds
This one seems due to a service restart leading to two ods-signerd's
running. One not referenced in the pid file. I used a hammer to restart.
Unfortunately, it remained broken:
Sep 22 10:25:34 ns0 ods-signerd: [worker[1]] CRITICAL: failed to sign zone libreswan.org: General error
Sep 22 10:25:34 ns0 ods-signerd: [worker[1]] backoff task [read] for zone libreswan.org with 60 seconds
Sep 22 10:25:34 ns0 ods-signerd: [tools] unable to read zone libreswan.com: failed to publish dnskeys (General error)
Sep 22 10:25:34 ns0 ods-signerd: [worker[4]] CRITICAL: failed to sign zone libreswan.com: General error
Sep 22 10:25:34 ns0 ods-signerd: [worker[4]] backoff task [read] for zone libreswan.com with 60 seconds
Sep 22 10:25:34 ns0 ods-signerd: [hsm] unable to get key: key f11737ffedb1723031545fe488c89bad not found
Sep 22 10:25:34 ns0 ods-signerd: [zone] unable to publish dnskeys for zone httpca.org: error creating dnskey
Sep 22 10:25:34 ns0 ods-signerd: [tools] unable to read zone httpca.org: failed to publish dnskeys (General error)
Sep 22 10:25:34 ns0 ods-signerd: [worker[2]] CRITICAL: failed to sign zone httpca.org: General error
Sep 22 10:25:34 ns0 ods-signerd: [worker[2]] backoff task [read] for zone httpca.org with 60 seconds
Looking around I saw:
[root at ns0 log]# ods-ksmutil key list --verbose |grep "NOT IN"
SQLite database set to: /var/opendnssec/kasp.db
libreswan.org ZSK active 2014-09-25 05:44:55 (retire) 2048 8 e9fa25e8214d2920c9489069bdfe61e6 SoftHSM NOT IN repository
libreswan.net ZSK active 2014-09-25 05:44:55 (retire) 2048 8 635cc83526dfafc2c76605ae7db96ea3 SoftHSM NOT IN repository
libreswan.com ZSK active 2014-09-25 05:44:55 (retire) 2048 8 cd09e7a69e890ee7ad41a30e9a323b73 SoftHSM NOT IN repository
httpca.org ZSK active 2014-09-23 03:44:15 (retire) 2048 8 f11737ffedb1723031545fe488c89bad SoftHSM NOT IN repository
It is possible that testing with softhsm-2 and then reverting to
softhsm-1 caused these to happen, if these keys were generated during
the 2 days of running softhsm-v2.
Running ods-ksmutil key delete --cka_id <id> --force seemed to have
killed it and forced it to generate new ones
Sep 22 10:37:00 ns0 ods-enforcerd: INFO: Promoting ZSK from publish to active as this is the first pass for the zone
Sep 22 10:37:00 ns0 ods-enforcerd: WARNING: Making non-backed up ZSK active, PLEASE make sure that you know the potential problems of using keys which are not recoverable
Sep 22 10:37:00 ns0 ods-enforcerd: Called signer engine: /usr/sbin/ods-signer update httpca.org
Sep 22 10:37:00 ns0 ods-enforcerd: Disconnecting from Database...
Sep 22 10:37:00 ns0 ods-enforcerd: Sleeping for 3600 seconds.
Sep 22 10:37:00 ns0 ods-signerd: [signconf] zone httpca.org signconf: RESIGN[PT7200S] REFRESH[PT604800S] VALIDITY[PT1209600S] DENIAL[PT1209600S] JITTER[PT43200S] OFFSET[PT3600S] NSEC[50] DNSKEYTTL[PT3600S] SOATTL[PT3600S] MINIMUM[PT3600S] SERIAL[datecounter]
etc.
I guess I really need to look at shipping a mechanism to ensure CRITICAL
errors are mailed to root to prevent zone outages.
Paul
More information about the Opendnssec-user
mailing list