Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in props parsing #3193

Closed
ed2050 opened this issue Jun 9, 2024 · 7 comments
Closed

Bug in props parsing #3193

ed2050 opened this issue Jun 9, 2024 · 7 comments
Labels
bug Something isn't working
Milestone

Comments

@ed2050
Copy link

ed2050 commented Jun 9, 2024

Description

There's a bug in props parsing. Looks like a flaw in PROPS_PATTERN regex of element.py.

Equal signs are parsed incorrectly. Equals in the value portion of "key=value" are split and key is thrown away. This can be triggered by passing a filename with equal sign in the value.

I suggest using shlex instead of a regex, as explained below:

def props (add) :
    for token in shlex.split (add) :
        key, value = token.split ('=', 1)
        ... now do whatever props does with key and value

Example

imgfile = '007_eXfm=YuqZ.jpg'
ui.element ('img').props (f'src={ imgfile }')
# produces img with no src attribute : 
# <img 007_eXfm="YuqZ.jpg">

# regex in props() drops the "src" part :
import nicegui.element as e
e.PROPS_PATTERN.search ('src=foo=bar.jpg').groups ()
> ('foo', None, None, 'bar.jpg')

The src attribute is thrown away and the filename is incorrectly split into a key=value pair, which is then added to the <img> tag.

This behavior is incorrect because the docs specify that props takes "whitespace-delimited list of either boolean values or key=value pair to add". But values are not just whitespace delimited in the code. They are also split on equal sign, even when = is part of the value.

Suggestions

  1. Fix the code to split on whitespace as described in docs
  2. Don't throw away key when value contains =
  3. Update the docs to say that props takes a "whitespace-delimited string of ...". Saying it takes a list suggests a python list, which is misleading.

Regex is probably the wrong approach for this task. You're better off using string split so you can do key, value = text.split ('=', 1) after splitting the arg on whitespace. But that gets complicated by first detecting matching quotes for values that contain whitespace.

A better approach is to use shlex to split the first stage. Shlex respects and removes quotes. For instance:

>>> shlex.split ('src=foo=bar.jpg')
['src=foo=bar.jpg']

>>> shlex.split ('src="foo bar.jpg"')
['src=foo bar.jpg']

>>> shlex.split ("src='foo bar.jpg'")
['src=foo bar.jpg']

In each case shlex removes the quotes, making second step trivial.

def props (add) :
    for token in shlex.split (add) :
        key, value = token.split ('=', 1)
        ... now do whatever props does with key and value
@ed2050
Copy link
Author

ed2050 commented Jun 9, 2024

Sorry I accidentaly hit post before text was finished. Please see modified text above.

@falkoschindler
Copy link
Contributor

Hi @ed2050, thanks for reporting this issue!

Actually, we originally used shlex for parsing props, but replaced it with regex for performance reasons: #341. Since the .props() method is potentially called thousands of times, performance does matter quite a lot. So even though I'd like to fix the bug with equal signs in prop values, it might be better to improve the regex rather than switching to shlex. But maybe there's also a way to reach similar performance with shlex. I'm totally open for suggestions.

Just for the record: A trivial workaround would be to use quotes around the value, like .props(f'src="{ imgfile }"').

@falkoschindler
Copy link
Contributor

falkoschindler commented Jun 10, 2024

I just experimented with this simple implementation (see "props" branch):

@staticmethod
def _parse_props(text: Optional[str]) -> Dict[str, Any]:
    dictionary: dict[str, Any] = {}
    for token in shlex.split(text or ''):
        words = token.split('=', 1)
        dictionary[words[0]] = True if len(words) == 1 else words[1]
    return dictionary

This is around 50% slower for long strings like 'dark color=red label="First Name" hint="Your \"given\" name" input-style="{ color: #ff0000 }"', and even 80% slower for short strings like 'dark':

       short  long
shlex  0.030  0.260
regex  0.006  0.130

@falkoschindler falkoschindler added help wanted Extra attention is needed bug Something isn't working labels Jun 10, 2024
@falkoschindler falkoschindler removed the help wanted Extra attention is needed label Jun 10, 2024
@falkoschindler falkoschindler added this to the 1.4.27 milestone Jun 10, 2024
@falkoschindler
Copy link
Contributor

I think I fixed the problem by adding "=" to the list of allowed characters in an unquoted props string.

@ed2050
Copy link
Author

ed2050 commented Jun 10, 2024

This post contains a modest improvement if you want even more speedup to your regex solution (2x on simple strings). _parse_props code at end.

Timings

I did timing tests for shlex vs nicegui regex. Looks like a 25x speed difference on my machine:

> python3 -m timeit -s 'import shlex ; text = """src=foo=bar.jpg href="foo bar.jpg" rab=oof"""' -c 'shlex.split (text)'
5000 loops, best of 5: 49 usec per loop

> python3 -m timeit -s 'import nicegui.element as e; regex = e.PROPS_PATTERN ; text = """src=foo=bar.jpg href="foo bar.jpg" rab=oof""" ' -c 'regex.search (text).groups ()'
200000 loops, best of 5: 2 usec per loop

Interesting that your tests with current regex version have closed that gap. Only about 2:1 difference for long strings and 5:1 for short. But you prob measured complete function not just splitting the string.

Improvement

Speeding up shlex doesn't appear to be possible. At least without lib hacking.

If you really want a speedup, you can handle simple cases separately. Checking a string for a quote char is very fast: 25 ns for a 100 char string on my machine. It's much faster to check for quote and use str.split () if none present. Otherwise fall back to regex as desired.

if '"' not in text :
    words = text.split ()
else :
    words = [ m.groups () for m in PROPS_PATTERN.finditer (text) ]

This speeds up splitting simple cases by 10x:

# has quotes, parse with regex
> python3 -m timeit -s 'import nicegui.element as e; text = """src=foo=bar.jpg href="foo bar.jpg" rab=oof""" ; t2 = "foo=bar rab=oof" ; regex = e.PROPS_PATTERN ; quote = """_"_""" [1]' -c 'quote in text and regex.search (text).groups () or text.split ()'
200000 loops, best of 5: 1.98 usec per loop

# no quotes, parse with str.split
> python3 -m timeit -s 'import nicegui.element as e; text = "foo=bar rab=oof" ; regex = e.PROPS_PATTERN ; quote = """_"_""" [1]' -c 'quote in text and regex.search (text).groups () or text.split ()'
2000000 loops, best of 5: 186 nsec per loop

Is it worth it? Well... maybe. By the time you add all the other code in _parse_props, the difference is only 2x on simple strings (1.7 us per call vs 3.3 us). Significant, but _parse_props is probably not a bottleneck in the overall app. But you were concerned about a 2x difference between shlex and regex, so maybe.

Implementation

Here's an implemention of _parse_props with this speedup. It's 2x faster on simple strings (no quotes) by skipping the regex. On strings with quotes, speed is unchanged.

def _parse_props (text) :

    props = {}
    # first split text into list of pairs
    if '"' in text :
        # get match groups from regex and keep values that aren't None
        pairs = [ [ x for x in w.groups () if x ] for w in e.PROPS_PATTERN.finditer (text) ]
    else :
        pairs = [ t.split ('=', 1) for t in text.split () ]

    # now transform pairs into props dict
    for key, *value in pairs :
        
        value = value and value [0] or ''    # handle empty values
        if value and value.startswith ('"') and value.endswith ('"') :
            value = json.loads(value)
        props [key] = value or True
        
    return props

@ed2050
Copy link
Author

ed2050 commented Jun 10, 2024

@falkoschindler

Actually, we originally used shlex for parsing props, but replaced it with regex for performance reasons: #341.

That's unfortunate. I hate when the clear, simple, obvious solution isn't fast enough.
Regexes are tricky. I've used and abused them a lot. Very easy for bugs to sneak in.

@falkoschindler
Copy link
Contributor

Thanks for your detailed investigation, @ed2050!
We'll leave the implementation as it is for now, since it is working fine and the performance is good enough. Using a simple .split() for strings without quotes is an interesting idea though. 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants