[m-rev.] for post-commit review: improve string.m

Peter Wang novalazy at gmail.com
Tue Mar 26 11:20:19 AEDT 2024

Previous message: [m-rev.] for post-commit review: improve string.m
Next message: [m-rev.] for post-commit review: improve string.m
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 25 Mar 2024 17:56:19 +1100 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> Check for surrogate chars in conversions to utf8.
> 
> library/char.m:
>     Export a new predicate, char_int_is_surrogate, the duplicates
>     the job of the MR_is_surrogate macro in the runtime, for use by
>     the new check in string.m. Add a comment about the code duplication.
>     The new predicate is not documented for users.
> 
> runtime/mercury_string.h:
>     Add a comment about the code duplication.
> 
> library/string.m:
>     Use the new predicate in char.m to check for surrogates when converting
>     a code unit list to an utf8 string.

> diff --git a/library/char.m b/library/char.m
> index 66703e77c..075f8c15c 100644
> --- a/library/char.m
> +++ b/library/char.m
> @@ -403,6 +403,17 @@
>  
>  :- implementation.
>  
> +:- interface.
> +
> +    % A version of is_surrogate that takes the character in its integer form.
> +    % Exported for use by string.m.
> +    %
> +:- pred char_int_is_surrogate(int::in) is semidet.
> +
> +%---------------------------------------------------------------------------%
> +
> +:- implementation.
> +
>  :- import_module int.
>  :- import_module require.
>  :- import_module uint.
> @@ -1068,7 +1079,7 @@ to_utf8_code_units(Char, NumCodeUnits, A, B, C, D) :-
>          D = 0u8,
>          NumCodeUnits = 2
>      else if Int =< 0xffff then
> -        not is_surrogate(Char),
> +        not char_int_is_surrogate(Int),
>          A = uint8.cast_from_int(0xe0 \/ ((Int >> 12) /\ 0x0f)),
>          B = uint8.cast_from_int(0x80 \/ ((Int >>  6) /\ 0x3f)),
>          C = uint8.cast_from_int(0x80 \/  (Int        /\ 0x3f)),
> @@ -1139,8 +1150,7 @@ to_utf16_code_units(Char, NumCodeUnits, A, B) :-
>  
>  is_surrogate(Char) :-
>      Int = char.to_int(Char),
> -    Int >= 0xd800,
> -    Int =< 0xdfff.
> +    char_int_is_surrogate(Int).
>  
>  is_leading_surrogate(Char) :-
>      Int = char.to_int(Char),
> @@ -1249,6 +1259,14 @@ hash(C) = H :-
>  hash(C, H) :-
>      H = hash(C).
>  
> +%---------------------------------------------------------------------------%
> +
> +char_int_is_surrogate(Int) :-
> +    % This code is sort-of duplicated, in C, in runtime/mercury_string.h,
> +    % in the macro MR_is_surrogate.
> +    Int >= 0xd800,
> +    Int =< 0xdfff.

I don't think the comment is required.

> diff --git a/library/string.m b/library/string.m
> index bbf08e1c2..7d0882bad 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -2438,7 +2438,8 @@ acc_rev_chars_from_utf8_code_units([A | FollowA], !RevChars) :-
>          CodePointInt = (A /\ 0x0f) << 12
>                      \/ (B /\ 0x3f) << 6
>                      \/ (C /\ 0x3f),
> -        CodePointInt >= 0x800
> +        CodePointInt >= 0x800,
> +        not char.char_int_is_surrogate(CodePointInt)
>      else if A =< 0xf4 then  % 4-byte sequence
>          FollowA = [B, C, D | Rest],
>          utf8_is_trail_byte(B),

Also update the XXX in from_utf8_code_unit_list.

> diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> index ef512e069..b4fd58385 100644
> --- a/runtime/mercury_string.h
> +++ b/runtime/mercury_string.h
> @@ -426,6 +426,8 @@ extern MR_bool MR_escape_string_quote(MR_String *ptr, const char * string);
>  #define MR_is_ascii(c)              ((unsigned) (c) <= 0x7f)
>  
>  // True if c is a Unicode surrogate code point, i.e. U+D800..U+DFFF.
> +// This code is sort-of duplicated, in mercury, in char.char_int_is_surrogate
> +// in the Mercury standard library.
>  
>  #define MR_is_surrogate(c)          (((unsigned) (c) & 0x1FF800) == 0xD800)
>  

There are plenty of other instances where Unicode-related tests are
implemented in both C and Mercury. There's no need to call out this
particular one. It's not like the definition of surrogates is going to
change.

That's fine, otherwise.

Peter

Previous message: [m-rev.] for post-commit review: improve string.m
Next message: [m-rev.] for post-commit review: improve string.m
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the reviews mailing list