lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns

Section 12.3.5 dfdl:lengthKind 'pattern'

The latest proposal claims that "If the pattern matching of the regular expression reads data that cannot be decoded into characters of the current encoding, then the behavior is controlled by the dfdl:encodingErrorPolicy property".

This doesn't seem right to me. The pattern matching engine is going to want scan the input data stream until the pattern is satisfied, and it's up to the pattern matching engine to decide whether it wants more data or not. With a greedy pattern (so something like .* or [a-z]{1,10}) then it is possible for a parser to reach a point in the input data where the next 'character' cannot be decoded, but that the data already decoded would satisfy the pattern. So consider the following scenario:

codepage: US ASCII
pattern: [a-z]{1,10}
data: abcdef{data that cannot be decoded}

How can a DFDL parser tell the difference between:
- "I found 'abcdef' and that matches your pattern, therefore I'm done and your element is 6 characters long".
- "I found 'abcdef' and then I hit a malformed character, so I'm going to follow the encodingErrorPolicy".

With the specification as proposed, the behaviour of the parser when parsing an element with a greedy pattern is now dependent on what's in the input data.

I can see sense in catching the following error scenario:
- The pattern match failed
- The pattern had insufficient data to match - this is typically the "hitEnd()" function on a matching engine that says the matcher tried to go beyond the last character
- The parser knows that the next character in the input cannot be decoded

But for the case where the pattern match passes and the next character cannot be decoded, is that really an error? The pattern match passed.

Replies (6)

RE: lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns - Added by Michael Beckerle about 9 years ago

The way you do what you are suggesting here is to use dfdl:encodingErrorPolicy='replace', and then exclude the unicode replacement character (\xFFFD) from your match. Your original pattern [a-z]{1,10} works for this as it does not allow the unicode replacement character in the match.

As you point out, patterns like ".*" are problematic. With encodingErrorPolicy='replace' this will happily match right up to the end of the data stream. This is expected behavior. The spec suggests using [^\xFFFD]* instead - match anything but the replacement character. In addition, I'd say from a 'best practice' perspective, * and + are to be avoided and bounded quantifiers (like in your example {1,10} ) are preferred.

I believe the behavior even with dfdl:encodingErrorPolicy='error' is not problematic. It will even be portable across implementations. Greedy is greedy, and any conforming regex capability of a DFDL processor will have to attempt to decode characters if the pattern requires it, and will hit the same decode error on the same data.

RE: lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns - Added by Michael Beckerle about 9 years ago

Syntax error: in what I said in prior reply, the notation in regex for the unicode replacement character is \uFFFD, not \xFFFD.

RE: lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns - Added by Andy Edwards about 9 years ago

Okay - I see what you're saying. This gives us one story for reading in data to get to a text representation, and then a second subsequent story for pattern matching against that text. Conceptually, that's nice and neat.

Having gone back through all of the explanations for the 'encodingErrorPolicy' though, I have one question. Replacing data that cannot be decoded with 0xFFFD opens up the possibility of regular expressions being able to include non-text parts of the input stream, and also to see beyond them in the input. This could be used in conjunction with look-ahead non-capturing regular expressions, so that the characters that cannot be decoded can be included in a regular expression, but not necessarily in the element with a length kind of pattern. Currently we say that the number of Unicode Replacement Characters that are used to represent the non-decoded characters is implementation specific. Are we happy with this? It would mean that look-ahead patterns that incorporate \uFFFD may now have different behaviour in different implementations. The pattern must then incorporate a bounded length to avoid this.

Resolved: lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns - Added by Michael Beckerle about 9 years ago

Current specification ok.

Resolved: lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns - Added by Steve Hanson over 8 years ago

No update to experience documents or specification needed.

Closed - RE: lengthKind 'pattern' behaviour on codepage error doesn't allow greedy patterns - Added by Michael Beckerle over 8 years ago

Marking closed as no change needed.

(1-6/6)

Operations » Editor » Public Comments » Public Comments Archive