precis, idna and icu: internationalization mess
how are we going to handle PRECIS?
In most cases, authoritative entities such as servers are responsible for enforcement, whereas entities such as clients are responsible only for preparation.
which means, I'll be covering preparation only (for now) but I should keep in mind that current design must support future enforcement extensions.
- PRECIS IdentifierClass
- PRECIS FreeformClass
- IDNA2008
icu has some chink way to init itself, make sure you follow the steps in safely initialising it.
Great care needs to be exercised when using
u_cleanup()
and should only be implemented by those who know what they are doing.
oopsie doopsie! we're fucked, not to mention that it's thread-unsafe and will most likely cause problems ehh find a way to deal with this later?
UErrorCode
is of type enum, use U_SUCCESS
and U_FAILURE
macros.
okay they seem to have an interesting feature for error handling extensibility
such as providing your own conversion callback functions when an error occurs.
icu handles string internally in UTF-16, so we'll have to do some conversion operations because we rely on UTF-8.
make sure to revisit the strings tab because it explains how icu represents them
and handles them, both using \0
and capacity/length defined which makes it
fine for the string to contain multiple \0
.
for input strings, typically two paramters are passed (UChar *s, int32_t len)
.
if the length argument is -1
, then the string must be NULL-terminated.
for output strings, same arguments as input and returns length of the output as
int32_t
as well. if space was available, output will be NULL-terminated and
always have the length not include the NULL character. use UErrorCode
to check
for errors, you may get
U_BUFFER_OVERFLOW_ERROR
U_STRING_NOT_TERMINATED_WARNING
which does not trigger failure withU_SUCCESS()
preflighting: returned length is always the full output length even if output
buffer is too small. pass (NULL, 0)
to determine the necessary output buffer
size. add one to make the output NULL-terminated.
most string/stdlib functions are the same as standard c ones, just prefixed with
u_
.
you also have access to functions like U16_NEXT(s, i, length, c)
and
U8_APPEND(s, i, length, c, isError)
.
unicode string storage model methods:
UnicodeString
object for short strings.- Allocate
Uchar *
for longer strings. UnicodeString
can be constructed to alias a read-only buffer -which comes with its own issues-.UnicodeString
can be constructed to alias a writeable buffer.
in short, it's a mess. re-read about the above later.
UTF-8 as default charset by #define U_CHARSET_IS_UTF8 1
in or before
unicode/utypes.h
which changes most of the implementation code to use
dedicated UTF-8 code paths.
let's try to summon up what we need to cover with icu
- JID's domainpart is an IDNA-aware domain name slot
- JID's localpart is an instance of the UsernameCaseMapped profile of the PRECIS IdentifierClass
- JID's resoucepart is an instance of the OpaqueString profile of the PRECIS FreeformClass
so yeah, IDNA is implemented under uidna.h and should be used as it is since it implements IDNA2008 and UTS#46, string comparison is IDNA2003-only and maybe should be implemented by the user of the API which in theory is just an octet-by-octet comparison according to the rfc.
I'll cover precis classes later I suppose.