document #319

Action 242 part 4 - length units 'characters' and encoding error detection - dfdl:valueLength and dfdl:contentLength

Added by Michael Beckerle over 2 years ago. Updated about 1 year ago.

Status:submitted Start date:08/02/2016
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:-
Target version:DFDL v1.0
Document Type:Proposed Recommendation

Description

The functions dfdl:valueLength and dfdl:contentLength take a second argument giving the units.

These functions when measured in units 'characters', are allowed to compute the length without checking for decode errors when (a) the encoding is fixed width - so a character unit is just an alias for some number of bytes (b) encodingErrorPolicy='error'.

Just because encodingErrorPolicy is 'error', doesn't mean all data in scope of that will be scanned to be sure there are no decode errors.

Other features with similar issues are regex pattern asserts. In this case the regex is matching against text, and that match might or might not encounter a decode error, but the entire scope of data it's talking about is NOT going to get converted just to insure no chance of a decode error.

DFDL implementations can optimize performance by avoiding character decoding when possible. This means that some character decode errors may not be detected even though dfdl:encodingErrorPolicy is 'error'.

The Spe should be clarified to say that only character decoding that results in a character being placed into the DFDL Infoset, or that is necessary to identify a delimiter is guaranteed to cause an error should that character not be decodable.

When unparsing, it is only if we actually unparse an unmapped character from the infoset to the output stream, that a encoding error is guaranteed to occur.

DFDL implementations are free to exploit fixed character width, and just jumping around the bytes and not decoding/encoding anything - whenever they can, because we all expect, and think our users will expect, this level of performance.

Because utf-8 encoding is required by DFDL, and is not fixed width, one may also get cases where switching from utf-16 to utf-8 for a data format changes the behavior of the processor because utf-16 wouldn't detect some decoding errors because of fixed width, whereas when using utf-8 will have to measure length in characters by decoding and so will detect the error.

The objective here is to make a consistent position around a messy area: The schema contains a complex type element which is a mixture of character encodings, binary stuff, and may have decode errors in the corresponding data stream (or characters in the infoset that have no mapping into the representation of the encoding - for unparsing).

Given this mess, there are ways that a schema can look at it, and foolishly-perhaps, treat it as characters. Asserts with test patterns are one. Specified length with units of 'characters' is another, and the dfdl:contentLength and dfdl:valueLength functions is yet another, since one can specify the units 'characters' as that 2nd argument.

The consistent position is that dfdl:contentLength or dfdl:valueLength of an element, with units 'characters' does NOT necessarily imply those characters will be decoded/encoded if the length can be determined without doing such decoding.

History

Updated by Michael Beckerle about 1 year ago

  • Target version set to DFDL v1.0

Also available in: Atom PDF