[Opendnssec-develop] Sessions with network HSM:s

Tue Nov 16 09:41:10 UTC 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 16-11-10 09:07, Rick van Rein wrote:
> 
> What is the problem in reopening the connection if it died?  

According to the SafeNet documentation, this is I think the way to go.
It not only solves the issue when the connection simply timed out, but
can also recover network glitches.
The network glitches don't need to be between the enforcer and the HSM.
We use  HSM's in a pool, and when one of the connections between the
HSM's is lost, it will also give an error back to the enforcer, refusing
to accept the command, and then restoring the connection between the
HSM's using autorecovery. Reissuing the command by the enforcer a little
bit later is better than simply dying:

- --- ORIGINAL DOCUMENTATION: ---
Ch11 - HA

Implementing HA

If you use the Luna SA HA feature then the calls to the Luna SAs are
load-balanced. The session handle that the application receives when it
opens a session is a virtual one and is managed by the HA code in the
library. The actual sessions with the HSM are established by the HA code
in the library and hidden from the application and will come and go as
necessary to fulfill application level requests.

Before the introduction of HA AutoRecovery, bringing a failed/lost group
member back into the group (recovery) was a manual procedure.

The Administration & Maintenance section contains a general description
of the how the HA AutoRecovery function works, in practice.

For every PKCS11 call, the HA recover logic will check to see if we need
to perform auto recovery to a disconnected appliance. If there is a
disconnected appliance then it will try to reconnect to that appliance
before it proceeds with the current PKCS11 call.

The HA recovery logic is designed in such a way that it will only try to
reconnection to an appliance every X secs and N number of times where X
and N are configurable via the "VTL" utility. The following is the
pseudo code of the HA logic

if (disconnected_member > 0 and recover_attempt_count < N and time_now -
last_recover_attempt > X) then
   performance auto recovery
   set last_recover_attempt equal to time_now
   if (recovery failed) then
      increment recover_attempt_count by 1
   else
      decrement disconnected_member by 1
      reset recover_attempt_count to 0
   end if
end if

The HA auto recovery design runs within a pkcs11 call. The
responsiveness of recovering a disconnected member is greatly influenced
by the frequency of PKCS11 calls from the user application. Although the
logic shows that it will attempt to recover a disconnected client in X
secs, in reality, it will not run until the user application makes the
next PKCS11 call.

How Does Your Software Know That a Member Has Failed?

When an HA Group member first fails, the HA status for the group shows
"device error" for the failed member. All subsequent calls return "token
not present", until the member (HSM Partition or PKI token) is returned
to service.

Here is an example of two such calls using CKDemo:

Enter your choice : 52

Slots available:
        slot#1 - LunaNet Slot

   slot#2 - LunaNet Slot

   slot#3 - HA Virtual Card Slot

Select a slot: 3

HA group 1599447001 status:

   HSM 599447001      - CKR_DEVICE_ERROR
   HSM 78665001       - CKR_OK
Status: Doing great, no errors (CKR_OK)

<SNAP>

Enter your choice : 52

Slots available:
        slot#1 - LunaNet Slot
   slot#2 - LunaNet Slot
  slot#3 - HA Virtual Card Slot

Select a slot: 3

HA group 1599447001 status:

   HSM 599447001      - CKR_TOKEN_NOT_PRESENT
   HSM 78665001       - CKR_OK
Status: Doing great, no errors (CKR_OK)
- --- End of ORIGINAL DOCUMENTATION ---

> Why not just reconnect a lost connection?  That solves such HSM problems and,
> at the same time, network disruptions.  We have a redundany layer underneath
> PKCS #11 doing this for us, so I hadn't noticed this problem on our SafeNet
> HSMs.

The reason why you probably don't see this often is because you maintain
more signatures. We only have one KSK that only changes once every 5
years, or surprise rollovers, and one ZSK changing every 3 months.
Roland told me he had seen the CKR_TOKEN_NOT_PRESENT error once,
probably that was also due to a lost connection because of a network
glitch between more HSM's in a pool.

- -- 
Antoin Verschuren

Technical Policy Advisor SIDN
Utrechtseweg 310, PO Box 5022, 6802 EA Arnhem, The Netherlands

P: +31 26 3525500  F: +31 26 3525505  M: +31 6 23368970
mailto:antoin.verschuren at sidn.nl  xmpp:antoin at jabber.sidn.nl
http://www.sidn.nl/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iQEcBAEBAgAGBQJM4lGvAAoJEDqHrM883Agn0dQH/RRAy7Wttpot2Ca1DX1Oqc+2
1vXtIR0DQRYcJjPAOIk3EtCTtPArOYh1LXw1G7FtgE4crEPQpk5MmBhsBf63a9BJ
AqVN7lUm6m7nHmk61O8DdoAIKZCLzLLDLVd0P6vumbT5c8CAWJpg6GW1LVpx7Wgu
vrFK7EC1bHMopqL/nZbabFY/4H9e/wg075AdBmqyX4XOXfnufUffhWURoF3KAijz
MwIv0rhc6lHAze/YdCsLwJxAvfNcGIMq4kDhIaJMWSeBLKW3nHgqYA2XFGczljsG
nM8gAAIEYw5fXTkFwdGysffoeITsOJIKKJMKssnc+lHucjolS3TtliIvMVJIgqU=
=8vqH
-----END PGP SIGNATURE-----