precis, idna and icu: internationalization mess cont.
oh yeah we're trying to handle PRECIS, aren't we?
In most cases, authoritative entities such as servers are responsible for enforcement, whereas entities such as clients are responsible only for preparation.
which means, I'll be covering preparation only (for now) but I should keep in mind that current design must support future enforcement extensions.
- PRECIS IdentifierClass
- PRECIS FreeformClass
well we've covered IDNA in the previous post, now we have to handle the actual PRECIS stuff.
okay so let's start with working with RFC 8264 section 8 because it's kind of the core of PRECIS if you think about it which can be visualized with
let's start with PRECIS-specific unicode code points categories, cp refers to the code point
- (K) ASCII7: cp in {0021..007E}
- (L) Controls:
u_iscntrl()
- (M) PrecisIgnorableProperties:
UCHAR_NONCHARACTER_CODE_POINT
orUCHAR_DEFAULT_IGNORABLE_CODE_POINT
- (N) Spaces:
U_GC_ZS_MASK
- (O) Symbols:
U_GC_SM_MASK
,U_GC_SC_MASK
,U_GC_SK_MASK
andU_GC_SO_MASK
in that order - (P) Punctuation:
U_GC_PC_MASK
,U_GC_PD_MASK
,U_GC_PS_MASK
,U_GC_PE_MASK
,U_GC_PI_MASK
,U_GC_PF_MASK
andU_GC_PO_MASK
in that order - (Q) HasCompat: get
unorm2_getNFKCInstance()
and use it to check ifNFKC(cp) != cp
- (R) OtherLetterDigits:
U_GC_LT_MASK
,U_GC_NL_MASK
,U_GC_NO_MASK
,U_GC_ME_MASK
in that order
now back to the IDNA ones, well the thing with these is that icu already offers IDNA API so having to define them means doing what icu had already done for me.
- (A) LetterDigits:
U_GC_LL_MASK
,U_GC_LU_MASK
,U_GC_LO_MASK
,U_GC_ND_MASK
,U_GC_LM_MASK
,U_GC_MN_MASK
andU_GC_MC_MASK
in that order - (F) Exceptions
- (G) BackwardCompatible is an empty set
- (H) JoinControl:
UCHAR_JOIN_CONTROL
- (I) OldHangulJamo:
UBLOCK_HANGUL_JAMO
in addition to extended A and B maybe? re-visit this - (J) Unassigned:
U_GC_CN_MASK
and cp inUCHAR_NONCHARACTER_CODE_POINT
is false
okay I've realized that icu4c is actually written in c++ which is somewhat retarded and disappointing because I can't actually reuse the IDNA parts mentioned above to cover PRECIS, and of course I'm not as experienced as icu guys to produce the same quality treatment but I have to do it anyway. not to mention that while reading their code I realized that they surprisingly write chink code for a library that is used by many big software and even OSs as Windows fully integrates it. I feel like it could've been designed better than this.
remind me (to myself I guess) to edit color and/or font of quote and code elements.