-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved cleaning of domains.list and domains-dotted-format.list #93
Comments
A few extra commits and tweaks and the lists are looking almost perfect. @maravento, have a look later and let me know what you think. |
give me the link |
domains-dotted-format.list: sed -r 's:(^.?(www|ftp)[^.]?.|^..?)::gi' domains-dotted-format.list | awk '{print "."$1}' | sort -u > newlist newlist: PD: "sed" command removes anything after www or ftp to the first "dot". it does not matter if it is uppercase or lowercase (::gi global and ignorecase) |
Thanks @maravento I will double check in the morning and we will get this perfect. Had a problem with one of your sed’s which was taking the domains at the top of the file which are like 0—�—�—�-0—�—�—�-.com and stripping the 0 off. Could of been something I did wrong but I’ll test again tomorrow and get it perfect. Thanks again. |
@maravento please download those same two files again now and see if you see a difference. Discovered a problem with the way I update the raw files before the commit has actually occurred. Let me know. |
To admit only domains that contain letters, numbers and dash - ( not underscore _ so sed '/_/d' is not necessary ) domains-dotted-format.list sed -r '/[^a-zA-Z0-9.-]/d' domains-dotted-format.list test.com Maybe, the complete command is: sed -r '/[^a-zA-Z0-9.-]/d' domains-dotted-format.list | sed -r 's:(^.?(www|ftp)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > newlist |
Sorry @maravento I typed that message on my iPhone and it really messed things up. Can you believe the iPhone keyboard doesn't have a backtick, either that or I am blind. Let me try again. Domains like
Were being stripped like this
which has been resolved in my latest lists but I still need to address removing the www. www99. ftp. etc which I will test with your latest recommendation. |
sed '/0--/d' |
Thanks @maravento I will try that, let me expand a bit as this is getting complex for me. Input list
Output should be
|
@maravento I tried this again and this seems very close to what we need But still a domain like
should be left as
But it's being stripped completely from the list. similarly a domain like |
sed '/0--/d' input | sed -r 's:(^.?(www|ftp|xxx|wvw)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > output .0-3.us PD: you can change this sed '/0--/d' for this sed '/0-/d' to delete .0-3.us, but is a risk |
Thanks @maravento tried that but still |
input: |
@maravento I think you have nailed it I did this in this order
|
Soooo close, we are still ending up with these in the list (just a few examples there are many many)
|
oldlist: sed '/0--/d' oldlist | sed -r '/[^a-zA-Z0-9.-]/d' | sed -r 's:(^.?(www|ftp|xxx|wvw)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > newlist newlist: second run: PD: You can also delete any domain that contains uppercase letters, since they are also invalid |
Thanks @maravento I'll try that in the morning .... my head is all sed'd and awk'd out for one day 😁 🤣 |
Hi @mitchellkrogza . In previous conversations, you explained to me that you validated the domains (urls) of your list. I think you should try these scripts, they can be useful. https://raw.githubusercontent.com/maravento/blackweb/master/tools/httpstatus.sh |
Closing for now feel free to reopen once you tested the new system @mitchellkrogza 👍 |
Thanks so much to @maravento for all his continuous help with this project.
Over a period of weeks he has helped me made numerous improvements to cleaning the lists of invalid characters and today we have finally reached that point. Most of this was discussed in #72 but I decided to start this as new topic instead.
This morning after a few hours of tweaking all the sed and awk commands that @marvento has contributed to the project I have finally perfected the creation of 2 very clean lists for use in Pi-Hole's and Dnsmasq systems.
The two files domains.list and domains-dotted-format.list and now void of the following:
Any domain that has a string attached ie.
domain.com/something/blog/index.php
is now justdomain.com
or.domain.com
depending on which list we refer to.Any beginning of a domain name that contains any underscore
_
characters. The underscore character is illegal and invalid in any DNS system anywhere in the world. So domains like10000_gratis_sexfoto.jouwpagina.nl
will now just bejouwpagina.nl
or.jouwpagina.nl
- similarly a domain like___id___.c.mystat-in.net
now becomesc.mystat-in.net
or.c.mystat-in.net
Any domains beginning with
www.
,ftp.
,ww.
,zzz.
are now also stripped out. HOWEVER for the hosts file they will not be stripped out due to a hosts file not being able to do wildcarding.No doubt we will be tweaking and perfecting this even more over time but we are getting ever closer to producing the cleanest hosts file anywhere out there and we are also able to show how many input sources out there are VERY BADLY maintained as they simply add anything to their hosts files without ever checking them, nor do they ever clean their lists. Anyone who knows anything should at least know that underscore characters are illegal and invalid in DNS but clearly the people producing hosts blocker files have no idea about simple networking principles and fill their lists with utter junk.
Forwards and Upwards we move !!!
The text was updated successfully, but these errors were encountered: