[Opendnssec-develop] [OpenDNSSEC] #262: Possible race condition causing CPU-bound loop in signerd?

OpenDNSSEC owner-dnssec-trac at kirei.se
Mon Sep 26 12:46:17 UTC 2011

#262: Possible race condition causing CPU-bound loop in signerd?
Reporter:  goeran@…            |       Owner:  matthijs      
    Type:  defect              |      Status:  new           
Priority:  major               |   Component:  Signer        
 Version:  1.3.0               |    Keywords:  CPU-bound loop
 I've a problem with ODS 1.3.2 (or rather the 1.3-branch, as of rev. 5653,
 but I've seen it for all 1.3-versions) running on a RHE 5.7 system.

 The ods-signerd now and then (every second week or so) becomes stuck in a
 CPU loop. Most of the threads use CPU. It looks as if the problem is
 related to a mutex lock (futex, to be more specific)..
 (The pthread_cond_timedwait call in ods_thread_wait in

 While stuck in the loop, it also keeps a lock on the kasp database.
 I've some other processes (backups of the key database for example) that
 wait for the same database lock.  If it is a race condition when accessing
 the kasp database or if it is something internal to ods-signerd is
 It may even be a RHE Linux bug. It looks as if futex locks have had some
 problems in earlier Linux versions (and programming using them is
 tricky in any version).

 I attach some gdb- and other output from the process and its threads.  I
 planned to leave ODS in the "loop" state to allow further info collection
 but after some strace commands, the hang was resolved. However, the next
 sign got stuck in a waitpid-call and the only way to resolve that was
 to stop (ods-control stop and kill of some processes). Maybe some
 internal state got confused by the long time in a CPU-bound loop.

 Known problem? Linux problem? Some resource problem? Other?

                         / Göran Bengtson
                           Chalmers Univ. of Technology

Ticket URL: <http://trac.opendnssec.org/ticket/262>
OpenDNSSEC <http://www.opendnssec.org/>

