[Opendnssec-develop] HSMs use UTF-8 characters

Tue May 20 14:20:50 UTC 2014

Hi Matthijs and Sion,

I am working on the libhsm code, auditing it.  One thing I am running into is character sets.  PKCS #11 uses RFC 2279 strings (older UTF-8 style) and the other code assumes ASCII.

There are two ways out of this:
 - only support ASCII — thus constraining token labels and PIN codes
 - pass UTF-8 codes to the libhsm-user as wide characters

When we only wish to support ASCII, we should reject other content, or remove character codes > 0x80 because we do not interpret them along the lines of RFC 2279.

When we decide to support rfc2279, we should use the facilities in C to represent strings in Unicode, using wchar_t.  This type is supported with a lot of compiler functions, including printf (“%ls”, my_wide_string).  It is defined in a compiler-dependent manner, but must be able to carry all compiler-supported locales.

We cannot ignore UTF-8 like we have to date.  There are a few openings for potential abuse, possibly in token labels or entered PINs:
 * Describe the ‘\0’ character in an UTF-8 code of more than one byte, none of which is 0x00, and cause confusion elsewhere
 * Place a more-bytes-to-follow code before the ‘\0’ (ASCII NUL) that ends a C-string — except when using a (bad but imagineable) UTF-8 interpreter
 * Strings may be provided under RFC 2279 and interpreted under RFC 3629 or ASCII (which are both stricter, a subset of RFC 2279)

I think we should continue to accept the UTF-8 coding of PKCS #11 but then communicate to libhsm using programs with wchar_t instead of char, and change the routines that print it to %ls instead of %s, and perhaps a few other changes are needed to integrate with the locale.  Does this sound like the right choice?

Cheers,
 -Rick