utf16Width and unpaired surrogate codepoint

Added by Michael Beckerle about 9 years ago

The spec says: "When utf16Width is 'variable', then on parsing an un-paired surrogate codepoint causes a decode error, which can be controlled via dfdl:encodingErrorPolicy described below."

Is this what the ICU library does?

We have generally tried to make it possible to use ICU when implementing DFDL, but does it in fact work this way? The alternative is that they are passed through and treated as 16-bit characters. Similarly, a string containing an unpaired surrogate codepoint could be unparsed.

This behavior needs to be clarified for parsing and unparsing, and for both C and Java ICU libraries if we are to have confidence we're not specifying something very hard to implement.

Replies (4)

RE: utf16Width and unpaired surrogate codepoint - Added by Steve Hanson about 9 years ago

Needs some investigation

Action 236 RE: utf16Width and unpaired surrogate codepoint - Added by Michael Beckerle almost 9 years ago

Added action 236 to subject.

Resolved - RE: utf16Width and unpaired surrogate codepoint - Added by Michael Beckerle almost 9 years ago

Verified behavior for unpaired surrogate is ICU creates Malformed input callback and one can control how it is then treated.

This means the spec, as worded, is implementable using ICU.

Closed: utf16Width and unpaired surrogate codepoint - Added by Steve Hanson over 8 years ago

No update to experience documents or specification needed.

(1-4/4)

Operations » Editor » Public Comments » Public Comments Archive