Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NamedArrays with AbstractString names #140

Closed
orenbenkiki opened this issue Jun 10, 2024 · 3 comments
Closed

NamedArrays with AbstractString names #140

orenbenkiki opened this issue Jun 10, 2024 · 3 comments

Comments

@orenbenkiki
Copy link
Contributor

If one creates a named_array = NamedArray(values_vector; names = (names_vector,)) then the keytype(named_array.dicts[1]) == eltype(names_vector). This seems reasonable on its face, except when the eltype(names_vector) <: AbstractString.

Consider for example names_vector = split(one_name_per_line, "\n"). In this case the keytype is Substring. Trying to index into the named array with a regular String will fail because a String isn't a Substring. This also fails the other way around: if eltype(names_vector) == String, you can't index into the named array with a Substring.

A better behavior would be to test whether the eltype of the names vector is <: AbstractString, and if so, force the keytype to be AbstractString.

@davidavdav
Copy link
Owner

Yes, this is a problem. I see that the regular Dict doesn't have this problem (anymore? I recall I've run into this issue with DataFrames or CSV, which reads files and splits strings such that the column types are SubString).

I don't know how they have solved it for Dict, I've tried to interpret the sources, traced it down to ht_keyindex().

One solution for now is to use names_vector = String.(split(one_name_per_line, "\n")).

The strange thing is that it does work for Int8 vs Int64

@orenbenkiki
Copy link
Contributor Author

Sure, I can (and do) work around it by making a copy of the names array, which is a pity when this is a memory-mapped file I just split on a line break. I have ~40K names in there. It isn't a major performance issue but it isn't nice.

Instead of fixing it in OrderedDict, you could check if eltype(names) <: AbstractString and if so, explicitly create OrderedDict{AbstractString,Int}(...). This would ensure that you could query it for any string type without a problem, and shouldn't break any existing code.

@hsseung
Copy link

hsseung commented Aug 1, 2024

FYI I just ran into the same problem when using fixed-width string types from InlineStrings.jl.
These are common when reading DataFrames from CSV files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants