Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

New UTF8String APIs #1751

Closed
A-And opened this issue Sep 11, 2017 · 45 comments
Closed

New UTF8String APIs #1751

A-And opened this issue Sep 11, 2017 · 45 comments

Comments

@A-And
Copy link
Contributor

A-And commented Sep 11, 2017

Rationale

As it stands UTF8String lacks full feature parity with System.String. To help alleviate this and taking Issue #358 as a guideline, below are a number of proposed methods to be added to UTF8String.

API Proposal

public partial ref struct Utf8String {

    public static Utf8String operator+(Utf8String s1, Utf8String s2);

    public bool IsNullOrWhiteSpace();

    public Utf8String Remove(int startIndex);
    public Utf8String Remove(int startIndex, int count);

    public Utf8String Replace(Utf8String oldValue, Utf8String newValue);

}

partial static class Utf8Helper {

    public static bool IsDigit(uint codePoint);

}

Discussion

In the original proposal @krwq mentioned that + operator as well as Remove and Replace methods may cause issues with views of strings. However, currently functionality referring to substrings create new Utf8String instances instead of referring to the original. It's entirely possible I may have overlooked something, but as it stands the methods add a lot of convenience without an obvious clear detriment.

@A-And
Copy link
Contributor Author

A-And commented Sep 14, 2017

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Sep 14, 2017

Thanks. Some questions/comments:

  1. System.String is immutable. Do we want Utf8String to be immutable too?
  2. System.String can be put on the heap. Utf8String currently cannot. Should Utf8String be heapable?
    • BTW, I think concatenation only makes sense for heapable strings.
  3. Do we have APIs to do ToUpper/ToLower?
  4. We should consider adding something like String.Split. It would be great if the API was non allocating, i.e. it would return struct collection like type.
  5. Do you think Utf8Helper would have more members? If not, it might not be worth having a type with one member.
  6. Would IsDigit be culture-sensitive, or invariant? I think most of the existing Utf8String APIs are invariant. If yes, how will we add support for culture-dependent operations?
  7. The proposal above shows Utf8String as a class. Is that intentional?

@migueldeicaza, @jaredpar, @VSadov

@VSadov
Copy link
Member

VSadov commented Sep 14, 2017

I think Utf8String should be as close as possible to String. Strings are units of data - want to store/compose them, use as keys in hashtables and so on.

Also it would be useful to have Utf8Extensions that work with ReadOnlySpan and Span, if In-place mutating like Replace.

Maybe add similar Utf16Extensions over spans of Char for completeness.

@A-And
Copy link
Contributor Author

A-And commented Sep 14, 2017

@KrzysztofCwalina Answers below:

  1. It probably should be. I think deviating behavior between Utf8String and System.String is unnecessary and confusing. To that end it would be worth it to change Remove and Replace to fit with the System.String API, i.e.
    public Utf8String Remove(int startIndex);
    public Utf8String Remove(int startIndex, int count);

    public Utf8String  Replace(Utf8String oldValue, Utf8String newValue);
  1. Heap allocation for Utf8String would definitely fit its performance focus, but might be outside of the scope of the issue. With the risk of sounding naïve - what would be the risks of concatenation on stack allocated Utf8Strings?

  2. Not yet, but I think they would be a good addition to this issue.

  3. Useful, but that would quite possibly either make Remove and Replace functionality unfeasible or requiring a redesign to fit within memory constraints. Working off of @VSadov 's suggestion - implementations of Remove and Replace could be added in Utf8Helper that work in-place on Span<T> and ReadOnlySpan<T>. I'm a bit unsure about maintaining separate implementations, particularly because of degrading usability - the two would need to be clearly distinguishable.

  4. IsDigit would be invariant. Culture-sensitivity could be implemented similarly to System.Char, i.e. use Unicode Categories for classification of characters. I don't think culture-sensitive implementation would necessarily make sense in the Utf8String, but Utf8Helper could be a good place.

  5. & 7. Both are oversights and now fixed above. IsDigit should be correctly marked as static and Utf8String as a partial struct.


The suggestions in 4. and 7. could bloat `Utf8Helper` a bit. Currently it handles code point en- and decoding, but also has a `IsWhiteSpace` method. Would it make sense to spin off methods dealing with operations on an existing `Utf8String` into a separate class?

@migueldeicaza
Copy link

Utf8String needs to be usable on the heap, otherwise its usefulness greatly decreases

I wrote my own version precisely to work around that current limitation:

https://github.com/migueldeicaza/NStack/blob/master/NStack/strings/ustring.cs

The above implementation (ustring) is to a large extent a port of the Go string, and contains some useful features. Docs here:

https://migueldeicaza.github.io/NStack/api/NStack/NStack.ustring.html

  • Use the naming "rune" instead of codePoint, as I think the Go guys make a compelling case for this.
  • Supports Utf8 strings from C# strings, blobs of memory, byte arrays
  • Has nice Range API that returns a subset of the string
  • Does not enforce that the contents of Utf8 string are valid - key for dealing later with Unix filenames. This permeates the surface of the API.
  • Rune-ification of a utf8string

The IsDigit implementation in Go and NStack are Unicode aware, and they are not limited to the ascii values.

@ahsonkhan
Copy link
Member

Use the naming "rune" instead of codePoint, as I think the Go guys make a compelling case for this.

From: https://blog.golang.org/strings

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.
The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression
'⌘'
is rune with integer value 0x2318.

That is an interesting term for code point. cc @tarekgh

@blowdart
Copy link

Code point is the phrase in the standards. Just because someone else made something up is not a good reason to use it. Especially as there is an actual runic block.

@migueldeicaza
Copy link

Some of us lack taste, but we try to learn from the masters.

The same people that came up with "rune" came up with the encoding that revolutionized the Unicode world by introducing the UTF encoding. This is a great -and completely unrelated- read:

http:https://doc.cat-v.org/bell_labs/utf-8_history

@Drawaes
Copy link
Contributor

Drawaes commented Sep 16, 2017

I had a somewhat similar argument about bigendian network encoding when almost everything is litte endian. The only good reason seems to be "because a million years ago someone said it would be so". And now we have to flip everything that comes off the network......

@KrzysztofCwalina
Copy link
Member

So, what do people think about the following:

// Ideally, the representation would be just like System.String,
// but this cannot be done without runtime support, and so would not work
// on existing runtimes.
// This representation is close. The main difference is that it can box.
public struct Utf8String : IEquatable<Utf8String>
{
    byte[] _codingPoints;

    public Utf8Span Slice(...) ...
}
public ref struct Utf8Span {
    ReadOnlySpan<byte> _codingPoints;
}

// Existing type.
public class String {
    public Utf16Span Slice(...)
}

public ref struct Utf16Span {
    ReadOnlySpan<char> _codingPoints;
} 

@migueldeicaza
Copy link

If I follow the proposal, is the idea that we would have two data types, one heap-friendly, one stack-only? I like that plan.

I would like to steal the idea that was used in Span recently where a naive implementation was provided for older runtimes, but modern runtimes were made aware of the type and a better implementation was available.

So the idea would be to make Utf8String a class and runtime optimize it so:

  • Modern runtimes could create Utf8Strings that have the string data inlined in the body of the object instance, like System.String.

  • We could provide alternative storage systems (current byte buffer, no-copy byte buffer, wrap an existing unmanaged blob).

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Sep 16, 2017

If I follow the proposal, is the idea that we would have two data types, one heap-friendly, one stack-only?

Yes.

a naive implementation was provided for older runtimes, but modern runtimes were made aware of the type and a better implementation was available.

What would be the representation of the naive representation? An array inside a class? It would mean two GC objects per naive string.

@migueldeicaza
Copy link

Yes, for the native representation it would cause an extra allocation.

@KrzysztofCwalina
Copy link
Member

The difference between fast and slow span is very small. The difference between these two string representations would be very significant. I worry that people would just not want to use it.

@benaadams
Copy link
Member

Is mapping an array inside a struct wrapper to a class too much of a stretch?

e.g.
Older

public struct Utf8String : IEquatable<Utf8String>
{
    byte[] _codingPoints;

    public Utf8Span Slice(...) ...
}

Newer

public class Utf8String : IEquatable<Utf8String>
{
    byte _firstCodingPoint;

    public Utf8Span Slice(...) ...
}

@KrzysztofCwalina
Copy link
Member

@benaadams, I am not sure I follow.

@benaadams
Copy link
Member

Type forwarding from struct to class; whether it would make things too weird

@migueldeicaza
Copy link

@KrzysztofCwalina we have a high-performance version, it just does not go into the heap ;-)

In a moment of insanity, but mostly because I am going to sleep now and what better end of the day than throwing a grenade and leaving, what if we had "interface IUtf8String" and have "struct SUtf8String : IUtf8String" and "Utf8String : IUtf8String"

/me runs from @benaadams

@benaadams
Copy link
Member

Need a [CanOnlyBeUsedAsGenericConstraint] attribute for the interface

@VSadov
Copy link
Member

VSadov commented Sep 16, 2017

I would prefer Utf8String to be a class if at all possible.
Experience with ImmutableArray being struct shows that while perf is good, there is a number of unfortunate scenarios around boxing/interfaces/IntrlockedExchange/nullable and so on.

ImmutableArray could not be no-indirection class without VM support since elements may contain references that must be reported to GC. Utf8String does not have this problem so, perhaps with some unsafe trickery it can be done without direct VM help.

I am thinking of something like:

[StructLayout(LayoutKind.Sequential)]
public sealed class Utf8String : IEquatable<Utf8String> //, other interfaces
{
    // same field layout as in String
    private int m_stringLength;
    private byte m_firstChar;

    public static readonly Utf8String Empty = new Utf8String();

    // never call publicly
    private Utf8String() { }

    // factory allocates System.String instance with the right data, 
    // but then unsafely casts it to Utf8String and patches its VTable.
    public static Utf8String(byte[] data)
    {
        string stringInst;
        fixed (char* charPtr = &data[0])
        {
            stringInst = new String(charPtr, 0, data.Length / 2);
        }

        Utf8String inst = Unsafe.Cast<Utf8String>(stringInst);
        inst.m_stringLength = data.Length;

        //unsafely replace VTable of inst with one from Utf8String.Empty
        fixed (void** p1 = &inst.m_stringLength)
        {
            fixed (void** p2 = &Empty.m_stringLength)
            {
                // no idea if this is the right offset
                p1[-1] = p2[-1];
            }
        }

        return inst;
    }

    //... the rest of implementation assumes that "chars" follow m_firstChar contiguously.
}

@jkotas - can something like the above work?

@jkotas
Copy link
Member

jkotas commented Sep 16, 2017

This is what @migueldeicaza suggested above - modern runtimes could create Utf8Strings that have the string data inlined in the body of the object. It can be done, but it requires runtime support. I do not think we would implement it by patching the vtable - it is dirty and it would not actually work as written above.

The difference between these two string representations would be very significant.

I think it would be about the same as difference between slow span and fast span overall. Depends on microbenchmarks you pick to make the point.

@jkotas
Copy link
Member

jkotas commented Sep 16, 2017

E.g. The difference on .NET Core x64 is: The fast version allocates AlignUp(21 + length, 8) bytes on GC heap and the slow version allocates AlignUp(49 + length, 8) on GC heap.

@VSadov
Copy link
Member

VSadov commented Sep 16, 2017

Of course the new runtimes can mplement it natively.

The suggestion above is for "slow" Utf8String on legacy runtimes.
I wonder why VTable patching would not work. Other that it is "dirty" on which i agree.

@KrzysztofCwalina
Copy link
Member

See SixLabors/ImageSharp#327 for an example how some people dislike small differences in performance

@jkotas
Copy link
Member

jkotas commented Sep 16, 2017

I wonder why VTable patching would not work

The GC needs to be able to compute size of an object. Variable sized objects have to have special vtable that make them variable sized.

See SixLabors/ImageSharp#327 for an example how some people dislike small differences in performance

Real-world portable libraries tend to have workarounds for both functional and performance issues in older runtimes. There is nothing to like about those, but it is the reality.

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Sep 16, 2017

Yeah, I am not saying that two object string is a no-go, just pointing out that some people are very sensitive about it.

@KrzysztofCwalina
Copy link
Member

With the risk of sounding naïve - what would be the risks of concatenation on stack allocated Utf8Strings?

Not naive at all, I probably worded it too strong. I was thinking that if people are willing to live with the limitations of a stack only string, they probably would not be happy with concat allocating like crazy.

@VSadov
Copy link
Member

VSadov commented Sep 16, 2017

I did not know that variable-sized objects have special vtable to indicate size. I thought GC knows the size and only needs info about layout to know what objects are reachable from current.
Oh well...

@KrzysztofCwalina
Copy link
Member

Another option I thought about for slow string: abstract class with multiple implementations like follows:

public abstract Utf8String { 
    int _length;

    protected abstract Span<byte> Bytes;

    public static Utf8String Create(bytes[] utf8){
        if(utf8.Length == 0) return Utf8String.Empty;
        if(utf8.Length == 1) return new Utf8String1(utf8[0]);
        if(utf8.Length == 2) return new Utf8String2(utf8[0], utf8[1]);
        ...
        else return new Utf8StringArray(utf8.Clone()); 
    }
}
internal class Utf8String1 : Utf8String {
    byte _b0;
    
    protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
internal class Utf8String2 : Utf8String {
    byte _b0;
    byte _b1;

    protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
internal class Utf8String4 : Utf8String {
    byte _b0;
    byte _b1;
    byte _b2;
    byte _b3;

    protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
///...
internal class Utf8StringArray : Utf8String {
    byte[] _bytes;

    protected override Span<byte> Bytes => _bytes;
}

But it will make calls on the fast string virtual

@KrzysztofCwalina
Copy link
Member

Anyway, I think we all agree that we want Utf8String to be heapable.

@A-And, if you'd like to keep running with this project, please:

  1. Rename Utf8String to Utf8Span
  2. Add a class called Utf8String that for now just stores an array of bytes (UTF8 coding units). We can optimize it later
  3. I think the class class should have roughly the same APIs like the Utf8Span type, in fact, the Utf8Span's implementation (for most members) could be reused.

@A-And
Copy link
Contributor Author

A-And commented Sep 18, 2017

@KrzysztofCwalina Great! I'll get on it. Also glad this caused so much discussion.

@tannergooding
Copy link
Member

As a reference, I have a proposal to support declaring UTF8String constants using data declarations here: dotnet/csharplang#909

I believe the only requirement would be that Utf8String be allowed to map directly to a some format that contains a length and a byte*

@migueldeicaza
Copy link

I am so happy to hear about the move for Utf8String to become heapable @KrzysztofCwalina!

@KrzysztofCwalina
Copy link
Member

I am so happy you are pushing for UTF8 string in general @migueldeicaza! :-)

@jaredpar
Copy link
Member

jaredpar commented Oct 2, 2017

Why are we using byte[] as the storage here instead of Memory<byte>.

@KrzysztofCwalina
Copy link
Member

I agree. We should consider Memory.

@jkotas
Copy link
Member

jkotas commented Oct 2, 2017

Memory would add 8 more bytes to the instance footprint and it would make primitive operations like indexing significantly slower...

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Oct 2, 2017

We need to decide if we copy the data in the ctor or not. If we don't copy, Utf8String will not be immutable. If we do copy, we can take Memory<byte>, or even Span<byte>. If we don't copy, I agree with @jkotas that the overhead would be too high.

@benaadams
Copy link
Member

We need to decide if we copy the data in the ctor or not.

Copying leads to user behavior side effects. For example with string, creating N size and mutating in place with fixed at creation 😔

Shady non-(additionally)-allocating approach (data written 1 time):

  1. Fast allocate empty string size N (new string('\0', N)) - Pre 0-init
  2. Fix string
  3. (Write) Write N chars

Alternative non-(additionally)-allocating approach (data including zeroinit written 3 times, read 2 times):

  1. stackalloc N size char*
  2. (Write) zeroinit N
  3. (Write) write N chars
  4. (Read) string .ctor - scan char* to determine length (WszMultiByteToWideChar)
  5. string .ctor - Fast allocate empty string size N - Pre 0-init
  6. (Read+Write) string .ctor - copy char* input (WszMultiByteToWideChar)

Copy in char* string .ctor - isn't a great sell; especially if you have to go unsafe anyway to call the method (safe stackalloc -> Span<char> overload + ReadOnlySpan<char> -> string.ctor may reduce this a little)

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Oct 2, 2017

we could solve the problem with something like: https://github.com/dotnet/corefx/issues/21472

var text = Utf8String.Create(20, (utf8Buffer)=> {
    for(int i=0; i<data.Length; i++) utf8Buffer[i] = 65+i;
});

@KrzysztofCwalina
Copy link
Member

If we only had function pointers, i.e. allocation free delegates :-)

@tannergooding
Copy link
Member

If we only had function pointers, i.e. allocation free delegates :-)

@KrzysztofCwalina, there is a championed proposal that would add such functionality: dotnet/csharplang#302

I just don't know where it is in the priority queue for language features

@migueldeicaza
Copy link

My take is that we should not copy, this is what I did for ustring. A convenience method could be used by those that desire a copy of the data to take place.

That design decision has the side effect of making Utf8String a proxy for interpreting data encoded in Utf8 format. This also means that like NStack.ustring it should also allow for invalid UTF8 data to be specified, and provide the APIs necessary to cope with it, this allows the UTF8String to be used as a transport correctly, even when the underlying encoding might be wrong (see, every Unix file system in the world, and also one of the reasons that Go supports this).

@tannergooding
Copy link
Member

My take is that we should not copy

I think forced copying would also provide a perf hit for UTF8 string literals (dotnet/csharplang#909), if they (or some variation) get championed/approved in the future.

I will say, no matter which is the default (copying or not), there should be an API that explicitly allows the other to be possible.

@ahsonkhan
Copy link
Member

cc @GrabYourPitchforks, fyi.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests