[m-rev.] [m-dev.] io.read_named_file_as_*

Peter Wang novalazy at gmail.com
Sun Apr 9 12:02:50 AEST 2023


On Sat, 08 Apr 2023 11:28:14 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> >> The relevant documentation warns about the parser
> >> that operates on the whole file as a string not being expected
> >> to do the right thing given a malformed UTF-{8,16} string.
> >
> > I don't think I can find the warning you are referring to.
> 
> The warnings on read_named_file_as_string and read_named_file_as_lines.

Oh, I thought you meant the Mercury term parser.

read_named_file_as_string and read_named_file_as_lines
do the right thing, for a definition of "right thing".

> > The parser should be able to report a malformed code unit sequence as a
> > syntax error. Currently, each code unit in a malformed sequence will be
> > silently treated as the Unicode replacement character; we can change
> > that.
> 
> I presume you mean "change it to report an error".
> 

Yes.

> > If a malformed sequence occurs within a comment, I think it's
> > preferable to ignore it instead of reporting an error.
> 
> Analysis files cannot contain comments, so your proposal seems
> isomorphic to mine.
>  

Right, I was thinking about the Mercury term parser in general.

> > If we do those things, there is no need to call is_well_formed.
> > 
> > ** Note that UTF-16 strings cannot represent malformed UTF-8 code unit
> > sequences, so by the time the input string gets to the parser, the
> > malformed sequences will already have been replaced by the Unicode
> > replacement character, making the string well formed. I assume it's
> > possible to make the C# and Java backends report malformed UTF-8 when
> > reading from a file, if that's what we wanted (I'm not sure that's
> > preferable in general).
> 
> I don't know nearly enough about what real code that really cares
> about i18n would want to do about malformed UTF-N for any N,
> but for me, a policy of "any non-well-formed character in any file
> processed by the Mercury compiler is an error that we report
> together with its location" seems both simpler, and just as good,
> as any other policy. Specifically, I cannot see a situation in which
> people would purposefully want to put non-well-formed chars
> into e.g. a .m file. Can you?

Not deliberately, no.

I'm mainly pointing out that checking for malformed sequences can be
done during parsing instead of a preceding step. I do think it's worth
doing, but later.

The changes look fine.

Peter


More information about the reviews mailing list