Split resolveMessage() into format() & formatToParts() #22

eemeli · 2023-07-22T18:32:27Z

This is a pretty deep rewrite of the proposed API. I started by drafting this as a sequence of smaller steps, but this is the smallest first step that I could separate out; further PRs will add support for markup expressions, bidi isolation, and additional default formatters. Much of this has been informed by offline conversations with @littledan in particular, as well as recent progress and conversations in the Unicode MFWG.

Fundamentally, this set of changes reconsiders how the API deals with objects. These need to be address by Intl.MessageFormat in three different places:

On input, we need to support values that contain e.g. a number and a currency or unit specifier. Matching the style of Intl.NumberFormat, this means a value combined with a basket of options.
During processing, the inputs and output of a custom message function need to be defined.
On output, the formatted-part representation of a message needs to support non-string values.

Previously with the resolveMessage() approach, a single definition was shared for all three use cases. Technically, this works, but not without at least three significant drawbacks:

It forces every user to consider values that are resolved but not yet formatted. They're not getting a ready output, but something that may provide them with an output. The mental load here is pretty heavy.
It's different from the current behaviour of other Intl formatters.
The relatively common use case of formatting a message to a string requires the construction of a bunch of intermediate JS objects that are then immediately dismissed, which is pretty inefficient.

The alternative approach proposed here matches existing Intl formatters much more closely, with format() providing a string and formatToParts() an array of static message parts, rather than values with functions attached to them. The match isn't perfect, but it's pretty close. This change does sacrifice some of the expressibility and re-use that resolveMessage() enabled, but it should be easier to use and reason about.

Compared to the previous, this is how object values work here:

On input, the definition and requirements are much looser, and separately defined for each function.
For custom message functions, the updated MessageValue is very similar to its previous incarnation, but in most use cases, these objects won't ever need to be reified.
On output, the formatted parts do not contain any methods, only property values. Their structure is still a bit more varied and complex than the { type: string, value: string } used by other Intl formatters. This is unfortunately unavoidable, as not all the values formatted by Intl.MessageFormat are representable as strings, and some (such as numbers) may contain sub-parts, rather than a single string value.

Rather than looking at the diff, this PR may be easier to review by looking at its resulting shape on the updates branch. I've asked for reviews from a number of people for this, as I figured you might be interested parties in this significant a change. I don't really expect everyone to reply.

stasm

I like the direction of this change. I recall being a bit confused by the resolveMessage() API. (In retrospect, I should have filed an issue about it.) This PR addresses most of the reasons for my confusion and makes the API similar to what I imagined in my implementation experiment for MessageFormat 2 — but more complete with errors and resolution fallback. Some more early feedback below.

README.md

stasm · 2023-07-24T14:05:39Z

README.md

+it resolves instead to the result of calling the `string` function
+with the value as input and no options.
+
+Otherwise, un-annotated values resolve to the following shape:


I'd like to suggest that format and formatToParts take a dictionary whose values are either unknown or instances of MessageValue. This allows two interesting use-cases:

Pre-set certain formatting options on input values.

Pass instances of custom types which extend MessageValue.

Then, you'd first check if an un-annotated value is an instance of MessageValue. If so, carry on. If not, resolve to MessageUnknownValue. This way, you don't have to try to resolve every expression to a number and string first.

I'd like to suggest that format and formatToParts take a dictionary whose values are either unknown or instances of MessageValue. This allows two interesting use-cases:

Pre-set certain formatting options on input values.

This is possible without making MessageValue a fully-fledged class. To see what I mean, take a look at how the proposed behaviour of the :number formatter constructs options.

Pass instances of custom types which extend MessageValue.

What would be the benefit of allowing that to happen without a custom annotation on the expression? As proposed, it's possible to return custom MessageValue extensions from a user-defined function, but it does require an annotation on the MF2 expression.

Then, you'd first check if an un-annotated value is an instance of MessageValue. If so, carry on. If not, resolve to MessageUnknownValue.

How would you propose that we "check if an un-annotated value is an instance of MessageValue"? As currently proposed, that's just an interface description for plain objects with some expected properties.

This way, you don't have to try to resolve every expression to a number and string first.

The number/string/unknown resolution that I think you're referring to here is what's applied with an unannotated expression, like {$foo}. It's not applied when resolving e.g. {$foo :custom}, which instead calls the custom function that's defined in the constructor options, which is expected to return a MessageValue.

What would be the benefit of allowing that to happen without a custom annotation on the expression? As proposed, it's possible to return custom MessageValue extensions from a user-defined function, but it does require an annotation on the MF2 expression.

"Require" isn't the best word here, because there's nothing we can do on the syntax or data model validation level to prevent such uses. So we're left with runtime errors or MessageUnknownValues. Instead, I suggest we allow passing instances of MessageValues as input variables, so that:

they produce known MessageParts when used in a placeholder without an annotation,

they can be used as arguments and options to custom functions.

For example, it may be useful to be able to pass as input variables data that's not supposed to be formatted at all and instead be positioned in the correct spot inside the translation: see the WrappedValue example from my implementation.

How would you propose that we "check if an un-annotated value is an instance of MessageValue"? As currently proposed, that's just an interface description for plain objects with some expected properties.

It should be a base class instead, and custom function authors should be able to subclass it for the purposes of their custom logic.

Filed #24 to continue the discussion about whether MessageValue should be subclassable, and if it should be allowed to be passed as input variables.

README.md

Co-authored-by: Stanisław Małolepszy <[email protected]>

README.md

catamorphism · 2023-07-26T07:43:14Z

README.md

+#### Default Functions
+
+Two commonly used message functions `number` and `string` are provided as a starting point,
+and as handlers for placeholders without an annotation,


Suggested change

and as handlers for placeholders without an annotation,

and as handlers for expressions without an annotation,

No, this is deliberate. Selectors without an annotation do not get a default handler.

README.md

Co-authored-by: Tim Chevalier <[email protected]>

dminor · 2023-07-26T13:47:11Z

The reviews from Tim and Stas are quite insightful and I don't think there's much more I can add here.

ryzokuken

Overall LGTM, thanks. I know its not from your PR but the onError method would certainly make things tricky by making users employ two different error handling mechanisms. Could this behavior be replaced using regular JS error handling instead? Allowing people to catch a subclassed "MessageFormatError" instead?

README.md

ryzokuken · 2023-07-31T16:03:55Z

README.md

 type: 'fallback';
- value: undefined;
+ locale: 'und';


is it better to return something like null that makes it obvious that it's not a useful value? Otherwise someone might write code that extracts this and passes this elsewhere without checking for "und" (I suppose they could still do that with null but atleast the error would be more obvious then).

I actually think we ought to make und into a working reference for the root locale, but that's rather from this proposal. I need to organise my thoughts on this some more and raise an issue about this in TG2.

README.md

ryzokuken · 2023-07-31T16:27:42Z

README.md

+and as MF2 considers all literals to be strings,
+the JSON string representation should be supported for numerical and boolean values
+when used as an input or as an option value.
+Each function will need to parse these separately from their string representations.


We could probably link to the grammar in ECMA-404 for this

README.md

eemeli · 2023-07-31T17:29:23Z

From @ryzokuken:
[...] I know its not from your PR but the onError method would certainly make things tricky by making users employ two different error handling mechanisms. Could this behavior be replaced using regular JS error handling instead? Allowing people to catch a subclassed "MessageFormatError" instead?

This is actually mandated by the MF2 spec, which currently includes in its Error Handling section:

In all cases, when encountering an error, a message formatter MUST provide some representation of the message. An informative error or errors MUST also be separately provided.

To be fair, I did write that part of the MF2 spec, but did so rather intentionally and with the support of the MFWG.

Intl.MessageFormat is different from the existing Intl formatters in that it relies on data (a message) provided by a user, and for any given message, the data is often not directly controlled by a developer, but provided by a translator. In other words, this interface ends up relying on inputs which will fail more often than existing formatters, and will do so without being caught in automated tests.

From experience, it turns out to be really important for localization not to fail catastrophically, because that will lead to broken user experiences because not all formatting calls will be wrapped in a try...catch. Hence the default behaviour of warning on error, rather than throwing.

stasm

Like I said in my previous review, the direction of this PR is good, but I have reservations about certain design choices here still. If you prefer to file new issues about these, I'll be happy to discuss them further outside this PR. (They will definitely be easier to discuss against this PR's spec than the current main.)

A summary of my feedback:

There's more work needed around the MessageValue type and its use, both internal and potentially external; I think it should be a subclassable type available to users and authors of custom functions.
The wording of the spec would benefit from defining abstract operations that convert unknown values to known interfaces.
MessageValue should not have a locale property. Instead, the formatting methods should take locale as argument.

stasm · 2023-08-11T08:20:23Z

README.md

+it resolves instead to the result of calling the `string` function
+with the value as input and no options.
+
+Otherwise, un-annotated values resolve to the following shape:


What would be the benefit of allowing that to happen without a custom annotation on the expression? As proposed, it's possible to return custom MessageValue extensions from a user-defined function, but it does require an annotation on the MF2 expression.

"Require" isn't the best word here, because there's nothing we can do on the syntax or data model validation level to prevent such uses. So we're left with runtime errors or MessageUnknownValues. Instead, I suggest we allow passing instances of MessageValues as input variables, so that:

they produce known MessageParts when used in a placeholder without an annotation,

they can be used as arguments and options to custom functions.

For example, it may be useful to be able to pass as input variables data that's not supposed to be formatted at all and instead be positioned in the correct spot inside the translation: see the WrappedValue example from my implementation.

How would you propose that we "check if an un-annotated value is an instance of MessageValue"? As currently proposed, that's just an interface description for plain objects with some expected properties.

It should be a base class instead, and custom function authors should be able to subclass it for the purposes of their custom logic.

README.md

stasm · 2023-08-11T08:52:07Z

README.md

- localeContext?: LocaleContext;
- source?: string;
- meta?: Record<string, string>;
+ locale: string;


The MessageValue should not have a locale property. That's what Parts are for.

Consider {$count :number}. The number stored in $count may be decorated with formatting options like minimumFractionDigits etc., but it's not in any particular locale just yet. It's the locale of the message (or the placeholder if we allow overriding it) that decides how $count will be formatted.

There's nothing inherently language-specific about the mathematical value of 5 :) Same goes for many string literals, like user names or Wi-Fi network names.

I understand that we need the locale somewhere for toString and toParts to work. I think we should just pass it as argument: toString(locale). Perhaps also call it toLocaleString() for consistency with other JS APIs.

Alternatively, we can consider passing in a larger structure: toString(fmtctx). Such "formatting context" may be useful if we want to allow implementations to cache formatters.

(A formatting context should be passed into formatting methods rather than be referenced by a property of MessageValue to help avoid memory leaks and facilitate IPC use-cases.)

We need the locale here to support messages which embed parts that are not in the message's locale. Consider for example:

The number 1,234.5 is formatted in French as "1 234,5".

When formatting that to parts, the last two parts end up as:

[ ..., { type: 'number', locale: 'fr', parts: [{ type: 'integer', value: '1' }, ...] }, { type: 'literal', value: '".' } ]

To get there, the fr locale needs to be available in the corresponding MessageValue. It does not need to be available in $count, because :number defines its own expected input shape, and for that the locale is optional.

Fundamentally, MessageValue is a resolved value that contains everything needed to format it, or to do something else with it like feed it into some custom function. That includes its locale.

Filed #25 to continue this, because I still think locale is special and should not be part of MessageValue.

README.md

Co-authored-by: Ujjwal Sharma <[email protected]>

eemeli · 2023-08-12T15:59:31Z

@stasm I applied some changes in response to your comments, in particular wrt. the MessageFunction types. What I had before was not really workable, and misleading.

As it's currently structured, MessageValue is not intended as an input data type, but as a primarily internal "resolved value" shape. It is a superset of the interfaces required by value+options input, so that e.g. the resolved value of a :number expression can be used as input for another :number expression, but most of the time it doesn't show up in the user's JS code at all.

I think the value gained by making it a real class is marginal, and not worth the cost, but I'm open to being persuaded otherwise. Would it be possible for you to open a separate issue discussing the possibility of doing so as a further step, so that we don't block this PR on it?

…ession parts

stasm · 2023-08-18T09:29:28Z

Thanks for the updates. I filed the following issues to continue the discussion without blocking this PR:

It's going to be easier to discuss them against this PR rather than the current state of main, so I'm in favor of merging it.

Jack-Works · 2023-08-18T11:26:11Z

README.md

 const mf = new Intl.MessageFormat(source, ['en']);
-const notifications = mf.resolveMessage({ count: 1 });
-notifications.toString(); // 'You have one new notification'
+const notifications = mf.format({ count: 1 });
+// 'You have 1 new notification'


The relatively common use case of formatting a message to a string requires the construction of a bunch of intermediate JS objects that are then immediately dismissed, which is pretty inefficient.

Looks like this PR still needs a lot of intermediate JS objects.

I thought it would be something like this:

const mf = new Intl.MessageFormat('en'); const notifications = mf.format(source, { count: 1 }); const another_string = mf.format(source2, { interop: 1 });

That approach would leave the greatest part of the work to the format() calls, though, and that's quite different from how existing Intl formatters work.

Consider for instance how unit formatting works:

const nf = new Intl.NumberFormat('en-GB', { style: 'unit', unit: 'kilometer', unitDisplay: 'long' }); nf.format(42); // "42 kilometres"

There, everything except for the final runtime values is included in the constructor. Now, let's say that we wanted that distance to be embedded in a longer message. With the currently proposed API that would look like this:

const src = ` let $dist = {$dist :number style=unit unit=kilometer unitDisplay=long} {Your destination is in {$dist}.}`; const mf = new Intl.MessageFormat(src, 'en-GB'); mf.format({ dist: 42 }); // "Your destination is in 42 kilometres."

In other words, when getting either a NumberFormat or MessageFormat instance, we should be able to tell what sort of a thing it's formatting. If we leave the source to be a format() argument for MessageFormat, we won't be able to really say anything at all about the MessageFormat instance, or whether the message source is even syntactically valid.

@Jack-Works As moving the source argument to the format() and formatToParts() methods could be considered as a next step from this PR, please open a separate issue to consider that. I would be happy to continue there.

ok, I agree, but does that mean we need to create a new object for every message we have?

Yeah, that's what it means.

Split resolveMessage() into format() & formatToParts()

69a4f57

eemeli requested review from ljharb, littledan, stasm, dminor, catamorphism, zbraniecki, sffc and ryzokuken July 22, 2023 18:32

stasm reviewed Jul 24, 2023

View reviewed changes

eemeli commented Jul 24, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

1d747fc

Co-authored-by: Stanisław Małolepszy <[email protected]>

catamorphism reviewed Jul 26, 2023

View reviewed changes

eemeli and others added 2 commits July 26, 2023 11:04

Apply suggestions from code review

e0eb309

Co-authored-by: Tim Chevalier <[email protected]>

Apply suggestions from code review

37de6cd

dminor removed their request for review July 26, 2023 13:45

ryzokuken approved these changes Jul 31, 2023

View reviewed changes

stasm reviewed Aug 11, 2023

View reviewed changes

eemeli commented Aug 12, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

eemeli commented Aug 12, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

881fcc6

Co-authored-by: Ujjwal Sharma <[email protected]>

Add missing MessagePart definition; always include the source in expr…

a11cf6c

…ession parts

stasm approved these changes Aug 18, 2023

View reviewed changes

Jack-Works reviewed Aug 18, 2023

View reviewed changes

eemeli merged commit 089a87c into main Aug 18, 2023

eemeli deleted the updates branch August 18, 2023 13:44

eemeli mentioned this pull request Sep 17, 2023

Apply recent Intl.MessageFormat spec updates messageformat/messageformat#410

Merged

eemeli mentioned this pull request May 14, 2024

Please remove onError parameter from format/formatToParts #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split resolveMessage() into format() & formatToParts() #22

Split resolveMessage() into format() & formatToParts() #22

eemeli commented Jul 22, 2023

stasm left a comment

stasm Jul 24, 2023

eemeli Jul 24, 2023

stasm Aug 11, 2023

stasm Aug 18, 2023

catamorphism Jul 26, 2023

eemeli Jul 26, 2023

dminor commented Jul 26, 2023

ryzokuken left a comment

ryzokuken Jul 31, 2023

eemeli Jul 31, 2023

ryzokuken Jul 31, 2023

eemeli commented Jul 31, 2023

stasm left a comment

stasm Aug 11, 2023

stasm Aug 11, 2023

eemeli Aug 12, 2023

stasm Aug 18, 2023

eemeli commented Aug 12, 2023 •

edited

Loading

stasm commented Aug 18, 2023

Jack-Works Aug 18, 2023

eemeli Aug 18, 2023

eemeli Aug 18, 2023

Jack-Works Aug 18, 2023

eemeli Aug 18, 2023

	and as handlers for placeholders without an annotation,
	and as handlers for expressions without an annotation,

Split resolveMessage() into format() & formatToParts() #22

Split resolveMessage() into format() & formatToParts() #22

Conversation

eemeli commented Jul 22, 2023

stasm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dminor commented Jul 26, 2023

ryzokuken left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli commented Jul 31, 2023

stasm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli commented Aug 12, 2023 • edited Loading

stasm commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli commented Aug 12, 2023 •

edited

Loading