precis, idna and icu: internationalization mess

10-20-2019

how are we going to handle PRECIS?

In most cases, authoritative entities such as servers are responsible for enforcement, whereas entities such as clients are responsible only for preparation.

which means, I'll be covering preparation only (for now) but I should keep in mind that current design must support future enforcement extensions.

PRECIS IdentifierClass
PRECIS FreeformClass
IDNA2008

icu has some chink way to init itself, make sure you follow the steps in safely initialising it.

Great care needs to be exercised when using u_cleanup() and should only be implemented by those who know what they are doing.

oopsie doopsie! we're fucked, not to mention that it's thread-unsafe and will most likely cause problems ehh find a way to deal with this later?

UErrorCode is of type enum, use U_SUCCESS and U_FAILURE macros. okay they seem to have an interesting feature for error handling extensibility such as providing your own conversion callback functions when an error occurs.

icu handles string internally in UTF-16, so we'll have to do some conversion operations because we rely on UTF-8.

make sure to revisit the strings tab because it explains how icu represents them and handles them, both using \0 and capacity/length defined which makes it fine for the string to contain multiple \0.

for input strings, typically two paramters are passed (UChar *s, int32_t len). if the length argument is -1, then the string must be NULL-terminated.

for output strings, same arguments as input and returns length of the output as int32_t as well. if space was available, output will be NULL-terminated and always have the length not include the NULL character. use UErrorCode to check for errors, you may get

U_BUFFER_OVERFLOW_ERROR
U_STRING_NOT_TERMINATED_WARNING which does not trigger failure with U_SUCCESS()

preflighting: returned length is always the full output length even if output buffer is too small. pass (NULL, 0) to determine the necessary output buffer size. add one to make the output NULL-terminated.

most string/stdlib functions are the same as standard c ones, just prefixed with u_.

you also have access to functions like U16_NEXT(s, i, length, c) and U8_APPEND(s, i, length, c, isError).

unicode string storage model methods:

UnicodeString object for short strings.
Allocate Uchar * for longer strings.
UnicodeString can be constructed to alias a read-only buffer -which comes with its own issues-.
UnicodeString can be constructed to alias a writeable buffer.

in short, it's a mess. re-read about the above later.

UTF-8 as default charset by #define U_CHARSET_IS_UTF8 1 in or before unicode/utypes.h which changes most of the implementation code to use dedicated UTF-8 code paths.

let's try to summon up what we need to cover with icu

JID's domainpart is an IDNA-aware domain name slot
JID's localpart is an instance of the UsernameCaseMapped profile of the PRECIS IdentifierClass
JID's resoucepart is an instance of the OpaqueString profile of the PRECIS FreeformClass

so yeah, IDNA is implemented under uidna.h and should be used as it is since it implements IDNA2008 and UTS#46, string comparison is IDNA2003-only and maybe should be implemented by the user of the API which in theory is just an octet-by-octet comparison according to the rfc.

I'll cover precis classes later I suppose.