[Opendnssec-user] OpenDNSSEC in ISP environment (lots of small zones)?

Fri Jan 28 10:19:15 UTC 2011

Hello,

We are envisioning the deployment of OpenDNSSEC in an ISP environment in 
order to provide DNSSEC services to clients. As an ISP, the typical use 
is thus many thousand of small zones.

We've been looking at OpenDNSSEC for some time now, and have read and 
(hopefully) understood most of the documentation and the very useful 
hints and messages on the mailing-list. Initial tests with prior 
versions on a couple of zones were looked very promising, so we upgraded 
to OpenDNSSEC 1.2.0 and attempted to put a bit of load on the service. 
Wanting to avoid spending money on a HSM during testing, we are using 
SoftHSM, also version 1.2.0.

I realize what follows may sound like a rant, which it isn't supposed to 
be; it is more of a cry of help, coupled with the question on whether 
OpenDNSSEC is the right tool for our job. :)

We are basically using a default configuration as provided by the 
project. As mentioned, the first couple of zones work like a charm, and 
I was delighted to see that the round-trip-time of a dynamic update to 
BIND, the zone transfer to OpenDNSSEC, it signing the zone and providing 
it to an NSD server could be completed in just a couple of seconds! 
Lovely. ;-)

For testing, we added 10,000 synthetic zones, each with 610 RR all 
configured to use a single default policy. From that point onwards, it 
all becomes a bit blurry; the following observations are based on "look 
and feel".

For example, after about 2,000 key pairs were created, we notice 
concurrency seems to be a problem. While the enforcer is running , the 
KASP database has a lock on it so that I can't look at a key even, an 
operation which is surely read-only?

    ods-ksmutil key list -z c1767.aa
    SQLite database set to: 
/usr/local/stow/opendnssec-1.2.0/var/opendnssec/kasp.db
    /usr/local/stow/opendnssec-1.2.0/var/opendnssec/kasp.db.our_lock 
already locked, sleep
    ...

Our test system has 6GB of RAM on it. While enforcer and signer were 
running it locked up (swap), so we had to pull the plug on it. After 
restart, we notice that starting OpenDNSSEC with `ods-control start' 
doesn't start the enforcer (only the signerd is started). It appears 
that files left over in /var/run make
the enforcer think it is still running.

Just before the reboot, about 2,000 key pairs had been created. An 
`ods/ksmutil key list' then took an inordinate amount of time to complete:

	time ods-ksmutil  key list  > x.01
	SQLite database set to: 
/usr/local/stow/opendnssec-1.2.0/var/opendnssec/kasp.db

	real	5m28.749s
	user	4m44.685s
	sys	0m43.702s

The first 10,000 key pairs took over 4 hours to generate. During that 
time the signer was blocked (kasp.db.our_lock exists). After the four 
hours, there was no activity: no signing, no nothing. Two signer 
processes apparently hung. I killed off one of them, and the enforcer 
continued working.

    Jan 26 19:53:15 sign1 ods-signerd: zone fetcher transferred zone 
c1111.aa serial 1 successfully
    Jan 26 19:53:15 sign1 ods-signerd: daemon/cmdhandler.c:209: 
cmdhandler_handle_cmd_sign: assertion cmdc->engine->tasklist failed
    Jan 26 19:53:15 sign1 ods-signerd: zone fetcher transferred zone 
c1112.aa serial 1 successfully

Killing off the processes and restarting didn't help. An `ods-ksmutil 
update all' seems to have "fixed" the issue (I was able to launch 
ods-control), but the question remains as to what happened.

What we then did was to completely disable the auditor in the 
configuration and on the zone policy (all zones have the same policy), 
hoping to strongly decrease the load of the system. After an `update 
all` and a restart of the OpenDNSSEC daemons we experienced once again 
that the enforcer starts and the signers appear to wait on something 
(a01.aa is the first zone in zonelist.xml):

	1955 ?        Rs   193:45 
/usr/local/stow/opendnssec-1.2.0/sbin/ods-enforcerd
	1959 ?        Ss     0:00 
/usr/local/stow/opendnssec-1.2.0/sbin/ods-signerd -vvv
	1967 ?        S      0:00 sh -c 
/usr/local/stow/opendnssec-1.2.0/sbin/ods-signer sign a01.aa > /dev/null 
2>&1
	1968 ?        S      0:00 
/usr/local/stow/opendnssec-1.2.0/sbin/ods-signer sign a01.aa

(This has been so since a while now, again: note the times:
  -rw-r--r-- 1 opendnssec opendnssec 5223424 Jan 28 11:12 kasp.db
  -rw-r--r-- 1 opendnssec opendnssec       0 Jan 28 07:52 kasp.db.our_lock
)

I understand OpenDNSSEC is used mainly TLD environments, which have few 
but large zones. Is OpenDNSSEC theoretically suited to be used in 
production in a lots-of-small-zones environment?

Is what we are attempting to do, realistically feasable with OpenDNSSEC?

Thank you & regards,

	-JP