[m-rev.] for review: speed up find_closest_seq

Julien Fischer jfischer at opturion.com
Fri Jan 26 14:51:43 AEDT 2024


On Fri, 26 Jan 2024, Zoltan Somogyi wrote:

> On 2024-01-26 13:36 +11:00 AEDT, "Julien Fischer" <jfischer at opturion.com> wrote:
>> That's fine.
>
> Thanks. Would you mind looking at this diff as well?
> It is the last in this series.

...

> Add find_best_close_enough_seqs.
> 
> library/edit_distance.m:
>     Add the above predicate, and its specialized version
>     find_best_close_enough_strings, which allow their callers to specify
>     a edit distance beyond which the results would not be useful.
> 
> compiler/error_spec.m:
>     Use the new functionality to restrict the set of "close enough"
>     candidates to those that replace at most half the characters in the name
>     for which we are generating a "did you mean" message.
> 
> diff --git a/compiler/error_spec.m b/compiler/error_spec.m
> index 77572a68f..429a94552 100644
> --- a/compiler/error_spec.m
> +++ b/compiler/error_spec.m
> @@ -777,15 +777,39 @@ maybe_construct_did_you_mean_pieces(BaseName, CandidateNames,
>      % Note: name_is_close_enough below depends on all costs here
>      % except for case changes being 2u.
>      Params = edit_params(2u, 2u, case_sensitive_replacement_cost, 2u),
> -    find_closest_strings(Params, BaseName, CandidateNames,
> -        Cost, HeadBestName, TailBestNames),
> -    BestNames = [HeadBestName | TailBestNames],
> +    string.count_code_points(BaseName, BaseNameLen),
> +    BaseNameLenU = uint.cast_from_int(BaseNameLen),
> +    % The algorithm we use here to set MaxCost has two purposes.
> +    %
> +    % One is to speed up the process of finding close enough candidates,
> +    % by allowing candidates with too-large edit distances to be rejected
> +    % without having to finish the computation of those edit distances.
> +    %
> +    % The other is to require edits to replace at most half of any base name
> +    % that exceeds one character. Note that the heuristic for "close enough"
> +    % that name_is_close_enough uses, which is originally from gcc, does *not*
> +    % impose that requirement.
> +    ( if BaseNameLenU < 2u then
> +        % If BaseName consists of a single character, allow a suggestion
> +        % to replace that character.
> +        MaxCost = 2u
> +    else
> +        % If BaseName consists of two or more characters, allow suggestions
> +        % to replace at most half of those characters. 2u is the replacement
> +        % cost.
> +        % XXX Should we round up? That would allow a suggestion to replace
> +        % e.g. three characters out of a five char name.
> +        MaxCost = (BaseNameLenU / 2u) * 2u

I am inclined to say yes, we should round up.

That looks fine otherwise.

Julien.


More information about the reviews mailing list