Forums » #117 - DFDL v1.0 Revision »
utf16Width and unpaired surrogate codepoint
Added by Michael Beckerle about 9 years ago
The spec says: "When utf16Width is 'variable', then on parsing an un-paired surrogate codepoint causes a decode error, which can be controlled via dfdl:encodingErrorPolicy described below."
Is this what the ICU library does?
We have generally tried to make it possible to use ICU when implementing DFDL, but does it in fact work this way? The alternative is that they are passed through and treated as 16-bit characters. Similarly, a string containing an unpaired surrogate codepoint could be unparsed.
This behavior needs to be clarified for parsing and unparsing, and for both C and Java ICU libraries if we are to have confidence we're not specifying something very hard to implement.
Replies (4)
RE: utf16Width and unpaired surrogate codepoint - Added by Steve Hanson about 9 years ago
Needs some investigation
Action 236 RE: utf16Width and unpaired surrogate codepoint - Added by Michael Beckerle almost 9 years ago
Added action 236 to subject.
Resolved - RE: utf16Width and unpaired surrogate codepoint - Added by Michael Beckerle almost 9 years ago
Verified behavior for unpaired surrogate is ICU creates Malformed input callback and one can control how it is then treated.
This means the spec, as worded, is implementable using ICU.
Closed: utf16Width and unpaired surrogate codepoint - Added by Steve Hanson over 8 years ago
No update to experience documents or specification needed.
(1-4/4)