[m-rev.] for post-commit review: improve string.m

Peter Wang novalazy at gmail.com
Mon Mar 25 12:35:20 AEDT 2024


On Sun, 24 Mar 2024 19:18:10 +1100 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> Improve several aspects of string.m.
> 

> diff --git a/library/string.m b/library/string.m
> index e1b78cc2a..41f56bb09 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -2390,11 +2398,13 @@ from_code_unit_list_allow_ill_formed(CodeList, Str) :-
>  
>  %---------------------%
>  
> -from_utf8_code_unit_list(CodeList, String) :-
> +from_utf8_code_unit_list(CodeUnits, String) :-
>      ( if internal_encoding_is_utf8 then
> -        from_code_unit_list(CodeList, String)
> +        from_code_unit_list(CodeUnits, String)
>      else
> -        decode_utf8(CodeList, [], RevChars),
> +        decode_utf8(CodeUnits, [], RevChars),
> +        % XXX This checks whether RevChars represents a well-formed string.
> +        % Why? The call to decode_utf8 should have ensured that already.
>          semidet_from_rev_char_list(RevChars, String)
>      ).

decode_utf8 doesn't reject code points in the Unicode surrogate range
(D800-DFFF, strictly speaking not valid to encode in UTF-8),
which ARE rejected by semidet_from_rev_char_list.
The change to semidet_from_rev_char_list was made later.

If decode_utf8 (acc_rev_chars_from_utf8_code_units) detected surrogates,
then there could be a variant of semidet_from_rev_char_list that assumes
validity. I think it would not be worthwhile.

Peter


More information about the reviews mailing list