Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved UTF-8 support #528

Open
benruijl opened this issue May 26, 2024 · 11 comments
Open

Improved UTF-8 support #528

benruijl opened this issue May 26, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@benruijl
Copy link
Collaborator

benruijl commented May 26, 2024

To allow for easier communication with other programs, FORM should support UTF-8 variable and expression names. Currently, it does (likely by accident) if one wraps the variable names in []:

S [xሴ];

L [Fɱ] = [xሴ]^2+2; * test π^2

Print +s;

.end

This program outputs:

FORM 4.3.1 (Apr 11 2023, v4.3.1) 64-bits         Run: Mon May 27 08:32:48 2024
    S [xሴ];
    
    L [Fɱ] = [xሴ]^2+2; * test π^2
    
    Print +s;
    
    .end

Time =       0.00 sec    Generated terms =          2
           [Fɱ]         Terms in output =          2
                         Bytes used      =         52

   [Fɱ] =
       + 2
       + [xሴ]^2
      ;

  0.00 sec out of 0.00 sec

As you can see, the message is not properly aligned, because the byte size does not equal the number of UTF-8 characters. The relevant code is in message.c is around line 704 is:

else if ( *s == 's' ) {
	u = va_arg(ap,char *);
	i = 0;
	while ( *u ) { i++; u++; }
	if ( i > x ) i = x;
	while ( x > i ) { *t++ = ' '; x--; }
	t += x;
	while ( --i >= 0 ) { *--t = *--u; }
	t += x;
}

where x is the desired width. One part of the fix is changing the counter i to exclude unicode repeat characters (and keeping track of the full byte count in a separate variable or a pointer difference):

while ( *u ) { if  ((*u & 0xC0) != 0x80) i++; u++; }

but extra care has to be taken that when the input name is larger than x, the truncation must happen at the start of a unicode character (a position satisfying *u & 0xC0) == 0x80). I haven't had time to fix that, so that's why I am posting the issue here.

In the comments you can report other places where UTF-8 support has to be added.

@vermaseren
Copy link
Owner

vermaseren commented May 26, 2024 via email

@tueda tueda added the enhancement New feature or request label May 26, 2024
@benruijl
Copy link
Collaborator Author

The compilers are agnostic to this, as for UTF-8 you can still use the regular char type. You just need to change some functions that involve string lengths.

Your terminal should render UTF-8 properly (you can try it with the example, but your own text editor may not properly render it).

@vermaseren
Copy link
Owner

vermaseren commented May 27, 2024 via email

@tueda
Copy link
Collaborator

tueda commented May 27, 2024

I guess math & physics people may prefer using Greek letters for variables, just like when they write equations by hand. This is possible in languages such as Python, Julia and Mathematica.

@vermaseren
Copy link
Owner

vermaseren commented May 27, 2024 via email

@tueda
Copy link
Collaborator

tueda commented May 27, 2024

Usually, they are typed using a software keyboard... And the dictionaries may fail to handle Unicode characters, I guess?

@vermaseren
Copy link
Owner

vermaseren commented May 27, 2024 via email

@benruijl
Copy link
Collaborator Author

The FG.cTable is skipped for all names that start with [, so these issues are avoided. I think one of the few changes is the string length for the alignment.

@vermaseren
Copy link
Owner

vermaseren commented May 27, 2024 via email

@benruijl
Copy link
Collaborator Author

I am thinking of a cross-tool unified format that can be used to import and export symbol definitions and expressions between Form/Mathematica/Symbolica. For this format it would be great if unicode symbols can be understood by all tools. So for example, if you use π in Mathematica, it can be converted to [π] in Form.

@vermaseren
Copy link
Owner

vermaseren commented May 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants