This is a static archive of the previous Open Grid Forum Redmine content management system saved from host redmine.ogf.org file /boards/15/topics/49?r=271 at Thu, 03 Nov 2022 15:43:28 GMT utf16Width and unpaired surrogate codepoint - Public Comments Archive - Open Grid Forum

utf16Width and unpaired surrogate codepoint

Added by Michael Beckerle about 9 years ago

The spec says: "When utf16Width is 'variable', then on parsing an un-paired surrogate codepoint causes a decode error, which can be controlled via dfdl:encodingErrorPolicy described below."

Is this what the ICU library does?

We have generally tried to make it possible to use ICU when implementing DFDL, but does it in fact work this way? The alternative is that they are passed through and treated as 16-bit characters. Similarly, a string containing an unpaired surrogate codepoint could be unparsed.

This behavior needs to be clarified for parsing and unparsing, and for both C and Java ICU libraries if we are to have confidence we're not specifying something very hard to implement.


Replies (4)

Resolved - RE: utf16Width and unpaired surrogate codepoint - Added by Michael Beckerle almost 9 years ago

Verified behavior for unpaired surrogate is ICU creates Malformed input callback and one can control how it is then treated.

This means the spec, as worded, is implementable using ICU.

Closed: utf16Width and unpaired surrogate codepoint - Added by Steve Hanson over 8 years ago

No update to experience documents or specification needed.

(1-4/4)

This is a static archive of the previous Open Grid Forum Redmine content management system saved from host redmine.ogf.org file /boards/15/topics/49?r=271 at Thu, 03 Nov 2022 15:43:28 GMT