Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvements to regex.list #18

Closed
wants to merge 2 commits into from
Closed

Minor improvements to regex.list #18

wants to merge 2 commits into from

Conversation

niklasea
Copy link

I would also have suggested blocking subdomains of IDNs (i.e. subdomains of xn--), but I see you had issues with Google Play. What kind of issues did you have and with what domains? The regex would be much better if it at least blocked the www subdomain as well to catch domains like www.ɡooɡle.com (in punycode: www.xn--oole-z7bc.com).

@mmotti
Copy link
Owner

mmotti commented Jan 13, 2020

Hi,

What are the reasons for removing underscores and lazy evaluation?

I believe, although I could be wrong, underscores are valid characters especially for subdomains? And the lazy evaluation can save rather a lot of steps in some cases which may save precious processing time.

I didn't personally have issues but I had issues reported presumably from international users where they were experiencing problems with Google play; I can't recall whether they specified if it was general access or downloading apps. I can certainly add the optional www subdomain to these though :-)

@niklasea
Copy link
Author

niklasea commented Jan 13, 2020

As I wrote in the commit message, lazy evaluation is not possible with POSIX extended regex. I don't know how it gets parsed exactly, but it certainly does not do what you think it does.

With regards to underscores, I may be wrong on that front. I did some more reading and although underscore is valid in DNS records, AFAIU it is not valid in hostnames. I can't think of a situation where these regexes would be used to block DNS records that do not point at a web host. However, in case we do want to include underscores, I wonder why they were not included in every regex?

EDIT: So I actually did the legwork and I can see that, indeed, I do have names with underscores on my blocklist and some of them do resolve to web servers. So I'll concede that they should remain, but I wonder if they should be applied to every regex or if there was a reason why you used them only where you did?

@mmotti
Copy link
Owner

mmotti commented Jan 13, 2020 via email

@niklasea
Copy link
Author

niklasea commented Jan 13, 2020

I see you replied via email. I edited my comment after I tested and confirmed that underscores can in fact be used and should be left in.

For reference, this seems to be the line that parses regexes in Pi-hole.

Pi-hole uses 'grep -E' so does not support lazy evaluation
Block simple punycode homograph attacks,
hopefully without too many false positives
@mmotti
Copy link
Owner

mmotti commented Jan 13, 2020

@niklasea take a look through 32d97e8. I think it should cover everything in this pull request, in addition to moving the - within the character classes to the end to avoid syntax errors relating to selecting an invalid range.

@niklasea
Copy link
Author

I think it looks good.

@mmotti
Copy link
Owner

mmotti commented Jan 14, 2020

Resolved in 32d97e8

Thanks @niklasea

@mmotti mmotti closed this Jan 14, 2020
mmotti referenced this pull request Jan 16, 2020
- Remove non-greedy matching as potentially not compatible with Pi-hole
- Add www., www1. etc subdomain support for Punycode domains
- Make use of underscores more consistent
- Move hyphen to the end of the character se to avoid range issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants