[Opendnssec-user] Problems adding largish # of zones

Thu Dec 3 22:04:09 UTC 2015

> It looks like the only recourse I have is to restart OpenDNSSEC, and
> then I'll once again get the problem that it falls over due to the
> contents of the tmp/ files it has created itself.  Double sigh!

And ... of course when I actually came to restarting it (after
removing the tmp/ files), it transferred 50 of the configured
zones and claims "transfer in progress" for the rest.  Aargh!!!

In a somewhat desperate attempt to gather more information and
remove some of the internal stumbling-blocks in OpenDNSSEC, I
took a hatchet to the code.  I started around that obnoxious
"transfer in progress" message and first created this patch:


Add more debugging code to zero in on "transfer already in progress" errors.
Also, don't skip sending notify messages to xfrd.

--- signer/src/wire/query.c.orig        2014-12-04 15:17:40.000000000 +0000
+++ signer/src/wire/query.c
@@ -371,8 +371,13 @@ query_process_notify(query_type* q, ldns
             } else {
                 if (addr2ip(q->addr, address, sizeof(address))) {
                     ods_log_info("[%s] ignore notify from %s: zone %s "
-                        "transfer in progress", query_str, address,
-                        q->zone->name);
+                        "transfer in progress, "
+                       "serial_notify %u serial_disk %u",
+                       query_str, address,
+                        q->zone->name,
+                       q->zone->xfrd->serial_notify,
+                       q->zone->xfrd->serial_disk
+                       );
                 } else {
                     ods_log_info("[%s] ignore notify: zone %s transfer in "
                         "progress", query_str, q->zone->name);
@@ -381,8 +386,8 @@ query_process_notify(query_type* q, ldns
                 q->zone->xfrd->serial_notify = serial;
                 q->zone->xfrd->serial_notify_acquired = time_now();
             }
-            lock_basic_unlock(&q->zone->xfrd->serial_lock);
-            goto send_notify_ok;
+            /* lock_basic_unlock(&q->zone->xfrd->serial_lock); */
+            /* goto send_notify_ok; */
         }
         q->zone->xfrd->serial_notify = serial;
         q->zone->xfrd->serial_notify_acquired = time_now();

Basically, get a little more info (not very useful, as it turns
out), and notify the xfrd irrespective of whether this piece of
code thinks it's already been notified or not.

After some more debugging and a new patch:


Debug the number of TCP connections.

--- signer/src/wire/xfrd.c.orig 2014-12-04 15:17:40.000000000 +0000
+++ signer/src/wire/xfrd.c
@@ -1431,6 +1431,8 @@ xfrd_tcp_obtain(xfrd_type* xfrd, tcp_set
     if (set->tcp_count < TCPSET_MAX) {
         ods_log_assert(!set->tcp_waiting_first);
         set->tcp_count ++;
+       ods_log_info("[%s] allocating new TCP connection: %d",
+               xfrd_str, set->tcp_count);
         /* find a free tcp_buffer */
         for (i=0; i < TCPSET_MAX; i++) {
             if (set->tcp_conn[i]->fd == -1) {
@@ -1451,8 +1453,8 @@ xfrd_tcp_obtain(xfrd_type* xfrd, tcp_set
         return;
     }
     /* wait, at end of line */
-    ods_log_verbose("[%s] max number of tcp connections (%d) reached",
-        xfrd_str, TCPSET_MAX);
+    ods_log_info("[%s] max number of TCP connections (%d) reached: %d",
+        xfrd_str, TCPSET_MAX, set->tcp_count);
     xfrd->tcp_waiting = 1;
     xfrd_unset_timer(xfrd);
     return;
@@ -1596,6 +1598,8 @@ xfrd_tcp_release(xfrd_type* xfrd, tcp_se
     }
     set->tcp_conn[conn]->fd = -1;
     set->tcp_count --;
+    ods_log_info("[%s] releasing TCP connection, active: %d",
+               xfrd_str, set->tcp_count);
     return;
 }
 
Egrep'ing for "tcp|TCP" in the log after a restart gives the
following result:

The connections rise up to the limit (50):

Dec  3 21:32:24 ods-host ods-signerd: [xfrd] allocating new TCP connection: 48
Dec  3 21:32:24 ods-host ods-signerd: [xfrd] allocating new TCP connection: 49
Dec  3 21:32:24 ods-host ods-signerd: [xfrd] allocating new TCP connection: 50
Dec  3 21:32:24 ods-host ods-signerd: [xfrd] max number of TCP connections (50) reached: 50
Dec  3 21:32:24 ods-host ods-signerd: [xfrd] max number of TCP connections (50) reached: 50
Dec  3 21:32:25 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 49
Dec  3 21:32:25 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 48
Dec  3 21:32:25 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 47
Dec  3 21:32:25 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 46

connections are then released down to 0:

Dec  3 21:32:25 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 6
Dec  3 21:34:12 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 5
Dec  3 21:34:12 ods-host ods-signerd: [xfrd] allocating new TCP connection: 6
Dec  3 21:34:12 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 5
Dec  3 21:34:13 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 4
Dec  3 21:34:13 ods-host ods-signerd: [xfrd] allocating new TCP connection: 5
Dec  3 21:34:13 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 4
Dec  3 21:34:15 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 3
Dec  3 21:34:15 ods-host ods-signerd: [xfrd] allocating new TCP connection: 4
Dec  3 21:34:15 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 3
Dec  3 21:34:16 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 2
Dec  3 21:34:16 ods-host ods-signerd: [xfrd] allocating new TCP connection: 3
Dec  3 21:34:16 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 2
Dec  3 21:34:17 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 1
Dec  3 21:34:17 ods-host ods-signerd: [xfrd] allocating new TCP connection: 2
Dec  3 21:34:17 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 1
Dec  3 21:34:19 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 0
Dec  3 21:34:19 ods-host ods-signerd: [xfrd] allocating new TCP connection: 1
Dec  3 21:34:19 ods-host ods-signerd: [xfrd] releasing TCP connection, active: 0

but on a subsequent notify you can (will?) get:

Dec  3 21:42:26 ods-host ods-signerd: [xfrd] zone 2.39.128.in-addr.arpa skips retry: tcp connections full

What?!??  Who overwrote xfrd->tcp_waiting?!?

That field is only supposed to be set when we reach 50 active TCP
sessions, and is cleared on every release action!

In the source, tcp_waiting is only mentioned in xfrd.c, and none
of the places where it's set look suspicious.

Very mysterious!

Additional patching (intermediate step not shown) reveals that
the tcp_waiting variable is set even though there are 0 active
TCP sessions!

Dec  3 22:11:20 ods-host ods-signerd: [xfrd] zone <zone> timeout
Dec  3 22:11:20 ods-host ods-signerd: [xfrd] zone <zone> skips retry: tcp connections full (0)

More patching of signer/src/wire/xfrd.c:

@@ -2025,10 +2029,21 @@ xfrd_handle_zone(netio_type* ATTR_UNUSED
         xfrd_udp_release(xfrd);
     }
     if (xfrd->tcp_waiting) {
-        ods_log_deeebug("[%s] zone %s skips retry: tcp connections full",
-            xfrd_str, zone->name);
-        xfrd_unset_timer(xfrd);
-        return;
+       int count = ((xfrhandler_type*)xfrd->xfrhandler)->tcp_set->tcp_count;
+        ods_log_deeebug("[%s] zone %s skips retry: "
+           "tcp connections full (%d), waiting %d",
+            xfrd_str, zone->name, count, xfrd->tcp_waiting);
+       if (count >= TCPSET_MAX) {
+               xfrd_unset_timer(xfrd);
+               return;
+       } else {
+               /* Hogwash!  Someone trampled on tcp_waiting... */
+               ods_log_deeebug("[%s] ... but tcp connections is NOT full!  Resetting!", xfrd_str);
+               xfrd->tcp_waiting = 0;
+               /* make a new request */
+               xfrd_make_request(xfrd);
+               return;
+       }
     }
     if (xfrd->udp_waiting) {
         ods_log_deeebug("[%s] zone %s skips retry: udp connections full",


Restarting with this and sending notifies appears to trigger this
new log message every time:

Dec  3 22:33:16 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:38:01 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:02 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:13 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:20 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:29 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:36 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:42 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:48 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:47:57 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:48:08 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:48:19 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:48:28 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:48:39 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:48:46 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!
Dec  3 22:53:24 ods-host ods-signerd: [xfrd] ... but tcp connections is NOT full!  Resetting!


The next problem I'm facing is that some small zones appear to
fail to transfer because OpenDNSSEC is apparently unhappy about
some of the UDP IXFR packets our BIND 9.9.7-P2 emits:

Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> request udp/ixfr=2015091803 to <hidden-master>
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] send 192 bytes over udp to <hidden-master>
Dec  3 22:43:17 ods-host ods-signerd: [xfrhandler] netio dispatch
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> event udp read
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> sets timer timeout now
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> read data from udp
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> got update indicating current serial 2015091803 from <hidden-master>
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> try next master
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> xfr packet parsed (res 0)
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] bad ixfr packet from <hidden-master>
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> sets timer timeout retry 3600
Dec  3 22:43:17 ods-host ods-signerd: [xfrd] zone <zonename> make request wait retry
Dec  3 22:43:17 ods-host ods-signerd: [xfrhandler] netio dispatch

I'm suspecting it'll be equally unhappy in an hour, on the next
retry(?)

I can't find anything in the documentation about whether to use
IXFR or AXFR -- I suppose that's supposed to be automatic, except
it isn't(?)  Well, more on that tomorrow, I guess.


The end of all this is that out of the zones which remained in
the "I will [read] zone" state in ods-signer's "queue" output, I
could now send DNS notify messages from the master and actually
trigger a zone transfer where I earlier could not, which from an
operational point of view is the only sensible thing, IMHO.


So I'm now finally in a position where "queue" in ods-signer says
"I will [sign] zone" for all 366 zones.  Whew!


Regards,

- Håvard