[m-rev.] [m-dev.] io.read_named_file_as_*

Peter Wang novalazy at gmail.com
Fri Apr 21 15:46:01 AEST 2023
Previous message: [m-rev.] [m-dev.] io.read_named_file_as_*
Next message: [m-rev.] [m-dev.] io.read_named_file_as_*
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 20 Apr 2023 20:18:35 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> 2023-04-11 16:00 GMT+10:00 "Peter Wang" <novalazy at gmail.com>: 
> >> I note that the read_line predicate *does* check for non-well-formed chars
> >> (in the C foreign_proc for read_char_code_2), but that this code is not linked
> >> to other code that does the same check, such as unsafe_index_next_repl
> >> in string.m.
> > 
> > Yes, a Mercury char can't represent a code unit in an non-well-formed
> > UTF-8 sequence so the predicates returning chars must return U+FFFD
> > or report an error, and the behaviour should be documented.
> 
> The attached diff does an end-run around these concerns by
> adding predicates that do a check at the end whether the string being read
> is well formed. For your review.
> 
> >> We can either
> >> 
> >> - change mercury_term_parser.m to *always* report an error when
> >>   it finds a non-well-formed character, or
> >> 
> >> - change it to report an error *only* if some flag says it should.
> >> 
> >> The first breaks backwards compatibility (though only in a hopefully-rare
> >> situation), while the second would require either an additional argument
> >> on read_term and its variants (breaking compatibility for everyone though
> >> in an easily-visible and easy-to-fix way), or additional variants of existing
> >> predicates.
> >> 
> > 
> > I think the first option is acceptable.
>  
> Thanks. But before implementing it, I looked into what the lexer currently does
> for non-well-formed input.  (The parser never looks at raw characters; the lexer does.) 
> I found a mess. The attached lexer_utf file documents a 3x2 matrix: target languages
> C, C# and Java, and reading data from a file or from a string. I can tell what is happening
> in five of those combos, but the sixth left me stumped, because (what I take to be)
> the official Microsoft documentation is silent on what the relevant C# method does
> when given non-well-formed input. Can anyone fill in this slot in the matrix?

> diff --git a/library/string.m b/library/string.m
> index dd5721dcc..a180978e1 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -284,33 +284,68 @@
...
> -    % Fails if Index is out of range (negative, or greater than or equal to
> -    % the length of String).
> +    % - On platforms the encode strings using UTF-8 (i.e. when targeting C)
> +    %   Char will be set to U+FFFD (the Unicode replacement character).
> +    %
> +    % - On platforms the encode strings using UTF-16 (i.e. when targeting
> +    %   C# or Java), Char will be set to the unpaired surrogate code point
> +    %   at Index. (For more details, see the comment just below.)
>      %
>  :- pred index_next(string::in, int::in, int::out, char::uo) is semidet.

s/the/that

>      % index_next_repl(String, Index, NextIndex, Char, MaybeReplaced):
>      %
> -    % Like index_next/4 but also returns MaybeReplaced on success.
> -    % When Char is not U+FFFD, then MaybeReplaced is always `not_replaced'.
> -    % When Char is U+FFFD (the Unicode replacement character), then there are
> -    % two cases:
> +    % Does the same job as index_next/4 but on success, it also returns
> +    % MaybeReplaced, which will specify whether Char is the result
> +    % of the replacement of a non-well-formed UTF-8 character with U+FFFD.
>      %
> -    % - If there is a U+FFFD code point encoded in String at
> -    %   [Index, NextIndex) then MaybeReplaced is `not_replaced'.
> +    % On platforms that encode strings using UTF-8 (i.e. when targeting C),
> +    % there are three cases.
>      %
> -    % - Otherwise, MaybeReplaced is `replaced_code_unit(CodeUnit)' where
> -    %   CodeUnit is the code unit in String at Index.
> +    % - If Char is not U+FFFD, then MaybeReplaced will be `not_replaced'.
> +    %
> +    % - If char is U+FFFD because there is a well-formed code point encoded
> +    %   in String starting at Index, and that code point is U+FFFD, then
> +    %   MaybeReplaced will also be `not_replaced'.
> +    %
> +    % - If char is U+FFFD but there is *no* well formed code point encoded
> +    %   in String starting at Index, then MaybeReplaced will be
> +    %   `replaced_code_unit(CodeUnit)', where CodeUnit is the code unit
> +    %   at offset Index in String.

s/char/Char

> +    %
> +    % On platforms that encode strings using UTF-16 (i.e. when targeting C#
> +    % or Java), MaybeReplaced will always be bound to `not_replaced'.
> +    % The only ways that a UTF-16 string may be non-well-formed are
> +    %
> +    % - by having a high surrogate code unit (between 0xD800 and 0xDBFF)
> +    %   that is not immediately followed by a low surrogate code unit
> +    %   (between 0xDC00 and 0xDFFF), or
> +    %
> +    % - by having a low surrogate code unit that is not immediately preceded
> +    %   by a high surrogate code unit.
> +    %
> +    % In both cases, index_next_repl will return the unpaired surrogate
> +    % unchanged as Char. There is no replacement required, because
> +    %
> +    % - surrogate code units are all the range 0xDC00 to 0xDFFF, and

0xD800 to 0xDFFF

> +    % - the Unicode standard deliberately does not assign any characters
> +    %   to the code points in this range.
> +    %
> +    % This means that if Char is in this range, then it must be an unpaired
> +    % surrogate, but since Char actually appears in String, it won't be
> +    % a *replacement* of another character.
>      %
>  :- pred index_next_repl(string::in, int::in, int::out, char::uo,
>      maybe_replaced::out) is semidet.

> @@ -3168,6 +3219,7 @@ is_empty("").
>  
>  %---------------------%
>  
> +/*
>  :- pragma foreign_proc("C",
>      is_well_formed(S::in),
>      [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> @@ -3217,6 +3269,75 @@ is_empty("").
>          }
>      }
>  ").
> +*/

(You can delete it now)

> +
> +is_well_formed(String) :-
> +    find_first_ill_formed_pos(String, FirstIllFormedPos),
> +    FirstIllFormedPos < 0.
> +
> +check_well_formedness(String, Result) :-
> +    find_first_ill_formed_pos(String, FirstIllFormedPos),
> +    ( if FirstIllFormedPos < 0 then
> +        Result = well_formed
> +    else
> +        Result = ill_formed(FirstIllFormedPos)
> +    ).

The diff looks good.

I'll have a look at the C# thing.

Peter
Previous message: [m-rev.] [m-dev.] io.read_named_file_as_*
Next message: [m-rev.] [m-dev.] io.read_named_file_as_*
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list