Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid host names #129

Closed
Odyseus opened this issue Jan 15, 2018 · 11 comments
Closed

Invalid host names #129

Odyseus opened this issue Jan 15, 2018 · 11 comments

Comments

@Odyseus
Copy link

Odyseus commented Jan 15, 2018

Hello, everybody.

I have found what seems to be invalid host names.

247?realmedia.com
adblade.com,popup
adtrackone.eu$document
durl=px.moatads.com
free‐celebrity‐tube.com
javakiba.org*
mailto:[email protected]
moherland.pl,
public‐sluts.net
px.moatads.compx.moatads.com
px.moatads.com,z.moatads.com
telemetry.appex.bing.net:443
websitealive[0-9].com
www.free‐celebrity‐tube.com
www.just-anchor.com?
www.public‐sluts.net

Note that I said "seems to be" because I'm not entirely sure if all of the listed host names are invalid. It's for the experts to decide.

Thanks for your work, @mitchellkrogza and contributors. 👍

@funilrys
Copy link
Member

Well thanks to you @Odyseus for reporting !! 👍🌟

@mitchellkrogza I think that we both (Py||Funceble && Ultimate) have to review the way we handle those typos or invalid domains !

Thanks again @Odyseus 👍🌟

P.S: I added this to my workflow for future commit or update of Py||Funceble.

@Odyseus
Copy link
Author

Odyseus commented Jan 16, 2018

Hello, @funilrys.

I'm just glad that I could be of some help. 👍

I found these invalid host names while developing my own Hosts Manager application (a CLI app written in Python). I investigated a little more about valid host names and I found some more invalid host names.

The following table list the host names as found on the hosts file that can be found in this repository, the possible error/s and the possibly correct host name.

Host name Possible error Possibly correct name
1-1-1-.ib.adnxs.com Labels can't start/end with a hyphen. ("1-1-1-") -
247?realmedia.com Invalid character. ("?") -
adblade.com,popup Invalid character. (",") adblade.com
adtrackone.eu$document Invalid character. ("$") adtrackone.eu
allosexe-.myblox.fr Labels can't start/end with a hyphen. ("allosexe-") -
alyssa-.ifrance.com Labels can't start/end with a hyphen. ("alyssa-") -
durl=px.moatads.com Labels can't start/end with a hyphen. ("durl=") px.moatads.com
film-porno-.ze.cx Labels can't start/end with a hyphen. ("film-porno-") -
free‐celebrity‐tube.com Invalid character. ("‐") (1) -
goreanharbourreysa-.4ya.nl Labels can't start/end with a hyphen. ("goreanharbourreysa-") -
i-52b.-xxx.ut.bench.utorrent.com Labels can't start/end with a hyphen. ("-xxx") -
javakiba.org* Invalid character. ("*") -
mailto:[email protected] Invalid characters. (":", "@") -
mangas-porno-.has.it Labels can't start/end with a hyphen. ("mangas-porno-") -
moherland.pl, Invalid character. (,") moherland.pl
paris-.blogspot.ca Labels can't start/end with a hyphen. ("paris-") -
paris-.blogspot.ch Same as above. -
paris-.blogspot.co.id Same as above. -
paris-.blogspot.com Same as above. -
paris-.blogspot.com.ar Same as above. -
paris-.blogspot.com.br Same as above. -
paris-.blogspot.com.es Same as above. -
paris-.blogspot.com.tr Same as above. -
paris-.blogspot.co.uk Same as above. -
paris-.blogspot.de Same as above. -
paris-.blogspot.gr Same as above. -
paris-.blogspot.it Same as above. -
paris-.blogspot.mx Same as above. -
paris-.blogspot.no Same as above. -
paris-.blogspot.pt Same as above. -
paris-.blogspot.sk Same as above. -
preview-.stripchat.com Labels can't start/end with a hyphen. ("preview-") -
public‐sluts.net Invalid character. ("‐") (1) -
px.moatads.com,z.moatads.com Invalid character. (",") px.moatads.com
sexchat-.startspin.nl Labels can't start/end with a hyphen. ("sexchat-") -
telemetry.appex.bing.net:443 Invalid character. (":") telemetry.appex.bing.net
websitealive[0-9].com Invalid characters. ("[", "]") -
www.free‐celebrity‐tube.com Invalid character. ("‐") (1) -
www.just-anchor.com? Invalid character. ("?") -
www.public‐sluts.net Invalid character. ("‐") (1) -
xmr.-eu1.nanopool.org Labels can't start/end with a hyphen. ("-eu1") -

(1): This is a character that looks like the minus sign. The following is a table with info from the GNOME Character Map application about the possibly invalid character.

Representations "‐" (U+2010 HYPHEN) (The invalid character) "-" (U+002D HYPHEN-MINUS) (The minus sign)
UTF-8 0xE2 0x80 0x90 0x2D
UTF-16 0x2010 0x002D
C octal escaped UTF-8 \342\200\220 \055
XML decimal entity ‐ -

In case that it could be useful, here is the Python 3 function that I use to check for valid host names.

#!/usr/bin/python3

import re

HOSTNAME_REGEX = re.compile(r"(?!-)[\w-]{1,63}(?<!-)$")

def is_valid_host(host):
    """IDN compatible domain validation.
    """
    host = host.rstrip(".")

    return all([len(host) > 1, len(host) < 253] + [HOSTNAME_REGEX.match(x) for x in host.split(".")])

This function is based on several functions that I found in this StackOverflow question. I just mainly translated it into a one-liner.

The function basically does the following:

  • It strips the last dot from the host name before checking it.
  • It checks that its length is greater than 1 character.
  • It checks that its length is less than 253 characters.
  • It checks with a regular expression each host name label. The regular expression checks that:
    • The label doesn't start with a minus sign ((?!-)).
    • The label contains only alpha-numeric characters and minus signs. ([\w-]).
    • The label is greater than 1 character and less that 63 characters in length ({1,63}).
    • The label doesn't end with a minus sign ((?<!-)).

Side note 1: The explanation above is based on my own understanding. Since I'm not a professional developer of any kind, I could be horribly wrong. LOL

Side note 2: The regular expression is outside the function for performance reasons.

  • Outside the function: ~80000 (eighty thousands) hosts per second.
  • Inside the function: ~70000 (seventy thousands) hosts per second.

@mitchellkrogza
Copy link
Member

Thanks @Odyseus for reporting this. I will make some tweaks to the cleaning functions to deal with these errors. Thanks for your very detailed information it helps a lot. @funilrys yes indeed this needs some good looking at, a lot of the input sources seem to make typos on a frequent basis.

@xxcriticxx
Copy link

this would be great add to any tool i think

@mitchellkrogza
Copy link
Member

@xxcriticxx and @Odyseus I will get this in the works early next week

@xxcriticxx
Copy link

@mitchellkrogza you slacking lately :(

@mitchellkrogza
Copy link
Member

@xxcriticxx been a rough start to the year and had a loss in the family so I've been out of town for a while but back in action. Have no fear all issues will be addressed.

@smed79
Copy link
Contributor

smed79 commented Jan 27, 2018

@mitchellkrogza We are sorry to hear that :(
Deepest condolences.

@xxcriticxx
Copy link

@mitchellkrogza sorry my friend

@funilrys
Copy link
Member

funilrys commented Feb 9, 2018

Hello @Odyseus @mitchellkrogza ,

Thanks to this issue I discovered an issue in PyFunceble. Please note that I reinforced the way check for an inactive domain with the previously referenced patch.

So that way @mitchellkrogza, we can work efficiently when we are going to do further structure development.

@Odyseus I did not use your snippets but as you indirectly contributed, into that patch, I would like, if you accept, to add you to the list of PyFunceble's contributors.

Thanks again.

Cheers,
Nissar


Before the patch

bildschirmfoto vom 2018-02-09 12-38-07

After the patch

bildschirmfoto vom 2018-02-09 12-30-10

@Odyseus
Copy link
Author

Odyseus commented Feb 9, 2018

Hello, everybody.

@funilrys: Thanks for the thought, but don't feel obligated to do so. I'm just happy to contribute however I can to any open source initiative. 👍

I looked at the code on your patch and, in case that it could be useful for you to know, I also use a function to validate IP addresses. It uses the ipaddress module from Python's standard library. It checks for both IP types (IPv4 and IPv6), but it would be easy to just check for IPv4 only.

from ipaddress import ip_address

def is_valid_ip(address):
    """Validate IP address (IPv4 or IPv6).

    Parameters
    ----------
    address : str
        The IP address to validate.

    Returns
    -------
    bool
        If it is a valid IP address or not.
    """
    try:
        ip_address(address)
    except ValueError:
        return False

    return True

I'm pretty sure that I got this function from StackOverflow, but the exact source got lost in my browsing history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants