[m-rev.] for post-commit review: improve string.m

Peter Wang novalazy at gmail.com
Tue Mar 26 11:20:19 AEDT 2024


On Mon, 25 Mar 2024 17:56:19 +1100 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> Check for surrogate chars in conversions to utf8.
> 
> library/char.m:
>     Export a new predicate, char_int_is_surrogate, the duplicates
>     the job of the MR_is_surrogate macro in the runtime, for use by
>     the new check in string.m. Add a comment about the code duplication.
>     The new predicate is not documented for users.
> 
> runtime/mercury_string.h:
>     Add a comment about the code duplication.
> 
> library/string.m:
>     Use the new predicate in char.m to check for surrogates when converting
>     a code unit list to an utf8 string.

> diff --git a/library/char.m b/library/char.m
> index 66703e77c..075f8c15c 100644
> --- a/library/char.m
> +++ b/library/char.m
> @@ -403,6 +403,17 @@
>  
>  :- implementation.
>  
> +:- interface.
> +
> +    % A version of is_surrogate that takes the character in its integer form.
> +    % Exported for use by string.m.
> +    %
> +:- pred char_int_is_surrogate(int::in) is semidet.
> +
> +%---------------------------------------------------------------------------%
> +
> +:- implementation.
> +
>  :- import_module int.
>  :- import_module require.
>  :- import_module uint.
> @@ -1068,7 +1079,7 @@ to_utf8_code_units(Char, NumCodeUnits, A, B, C, D) :-
>          D = 0u8,
>          NumCodeUnits = 2
>      else if Int =< 0xffff then
> -        not is_surrogate(Char),
> +        not char_int_is_surrogate(Int),
>          A = uint8.cast_from_int(0xe0 \/ ((Int >> 12) /\ 0x0f)),
>          B = uint8.cast_from_int(0x80 \/ ((Int >>  6) /\ 0x3f)),
>          C = uint8.cast_from_int(0x80 \/  (Int        /\ 0x3f)),
> @@ -1139,8 +1150,7 @@ to_utf16_code_units(Char, NumCodeUnits, A, B) :-
>  
>  is_surrogate(Char) :-
>      Int = char.to_int(Char),
> -    Int >= 0xd800,
> -    Int =< 0xdfff.
> +    char_int_is_surrogate(Int).
>  
>  is_leading_surrogate(Char) :-
>      Int = char.to_int(Char),
> @@ -1249,6 +1259,14 @@ hash(C) = H :-
>  hash(C, H) :-
>      H = hash(C).
>  
> +%---------------------------------------------------------------------------%
> +
> +char_int_is_surrogate(Int) :-
> +    % This code is sort-of duplicated, in C, in runtime/mercury_string.h,
> +    % in the macro MR_is_surrogate.
> +    Int >= 0xd800,
> +    Int =< 0xdfff.

I don't think the comment is required.

> diff --git a/library/string.m b/library/string.m
> index bbf08e1c2..7d0882bad 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -2438,7 +2438,8 @@ acc_rev_chars_from_utf8_code_units([A | FollowA], !RevChars) :-
>          CodePointInt = (A /\ 0x0f) << 12
>                      \/ (B /\ 0x3f) << 6
>                      \/ (C /\ 0x3f),
> -        CodePointInt >= 0x800
> +        CodePointInt >= 0x800,
> +        not char.char_int_is_surrogate(CodePointInt)
>      else if A =< 0xf4 then  % 4-byte sequence
>          FollowA = [B, C, D | Rest],
>          utf8_is_trail_byte(B),

Also update the XXX in from_utf8_code_unit_list.

> diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> index ef512e069..b4fd58385 100644
> --- a/runtime/mercury_string.h
> +++ b/runtime/mercury_string.h
> @@ -426,6 +426,8 @@ extern MR_bool MR_escape_string_quote(MR_String *ptr, const char * string);
>  #define MR_is_ascii(c)              ((unsigned) (c) <= 0x7f)
>  
>  // True if c is a Unicode surrogate code point, i.e. U+D800..U+DFFF.
> +// This code is sort-of duplicated, in mercury, in char.char_int_is_surrogate
> +// in the Mercury standard library.
>  
>  #define MR_is_surrogate(c)          (((unsigned) (c) & 0x1FF800) == 0xD800)
>  

There are plenty of other instances where Unicode-related tests are
implemented in both C and Mercury. There's no need to call out this
particular one. It's not like the definition of surrogates is going to
change.

That's fine, otherwise.

Peter


More information about the reviews mailing list