Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved cleaning of domains.list and domains-dotted-format.list #93

Closed
mitchellkrogza opened this issue Oct 30, 2017 · 20 comments
Closed
Assignees

Comments

@mitchellkrogza
Copy link
Member

Thanks so much to @maravento for all his continuous help with this project.

Over a period of weeks he has helped me made numerous improvements to cleaning the lists of invalid characters and today we have finally reached that point. Most of this was discussed in #72 but I decided to start this as new topic instead.

This morning after a few hours of tweaking all the sed and awk commands that @marvento has contributed to the project I have finally perfected the creation of 2 very clean lists for use in Pi-Hole's and Dnsmasq systems.

The two files domains.list and domains-dotted-format.list and now void of the following:

  • Any domain that has a string attached ie. domain.com/something/blog/index.php is now just domain.com or .domain.com depending on which list we refer to.

  • Any beginning of a domain name that contains any underscore _ characters. The underscore character is illegal and invalid in any DNS system anywhere in the world. So domains like 10000_gratis_sexfoto.jouwpagina.nl will now just be jouwpagina.nl or .jouwpagina.nl - similarly a domain like ___id___.c.mystat-in.net now becomes c.mystat-in.net or .c.mystat-in.net

  • Any domains beginning with www., ftp., ww., zzz. are now also stripped out. HOWEVER for the hosts file they will not be stripped out due to a hosts file not being able to do wildcarding.

No doubt we will be tweaking and perfecting this even more over time but we are getting ever closer to producing the cleanest hosts file anywhere out there and we are also able to show how many input sources out there are VERY BADLY maintained as they simply add anything to their hosts files without ever checking them, nor do they ever clean their lists. Anyone who knows anything should at least know that underscore characters are illegal and invalid in DNS but clearly the people producing hosts blocker files have no idea about simple networking principles and fill their lists with utter junk.

Forwards and Upwards we move !!!

@mitchellkrogza
Copy link
Member Author

A few extra commits and tweaks and the lists are looking almost perfect. @maravento, have a look later and let me know what you think.

@maravento
Copy link

give me the link

@mitchellkrogza
Copy link
Member Author

@maravento
Copy link

maravento commented Oct 30, 2017

domains-dotted-format.list:
.www.klimexs.com
.www8.paypopup.com
.www89.chunguang100.com
.www8-ssl.effectivemeasure.net
.www99.zapto.org
.wwwa024.infonegocio.com
.www99*+-_.mytest.com # This is a domain invented by me, only for testing purposes
.WWW01*+-_22.mytest2.com # This is a domain invented by me, only for testing purposes

sed -r 's:(^.?(www|ftp)[^.]?.|^..?)::gi' domains-dotted-format.list | awk '{print "."$1}' | sort -u > newlist

newlist:
.chunguang100.com
.effectivemeasure.net
.infonegocio.com
.klimexs.com
.mytest2.com
.mytest.com
.paypopup.com
.zapto.org

PD: "sed" command removes anything after www or ftp to the first "dot". it does not matter if it is uppercase or lowercase (::gi global and ignorecase)

@mitchellkrogza
Copy link
Member Author

Thanks @maravento I will double check in the morning and we will get this perfect. Had a problem with one of your sed’s which was taking the domains at the top of the file which are like 0—�—�—�-0—�—�—�-.com and stripping the 0 off. Could of been something I did wrong but I’ll test again tomorrow and get it perfect. Thanks again.

@mitchellkrogza
Copy link
Member Author

@maravento please download those same two files again now and see if you see a difference. Discovered a problem with the way I update the raw files before the commit has actually occurred. Let me know.

@maravento
Copy link

maravento commented Oct 30, 2017

To admit only domains that contain letters, numbers and dash - ( not underscore _ so sed '/_/d' is not necessary )

domains-dotted-format.list
.www99*+-_.mytest.com
.WWW01*+-_22.mytest2.com
.0—�—�—�-0—�—�—�-.com
.test.com

sed -r '/[^a-zA-Z0-9.-]/d' domains-dotted-format.list

test.com

Maybe, the complete command is:

sed -r '/[^a-zA-Z0-9.-]/d' domains-dotted-format.list | sed -r 's:(^.?(www|ftp)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > newlist

@mitchellkrogza
Copy link
Member Author

Sorry @maravento I typed that message on my iPhone and it really messed things up. Can you believe the iPhone keyboard doesn't have a backtick, either that or I am blind.

Let me try again.

Domains like

0----0.0----0.1596.hk
0--0.0--0.hnun.net
0------------0-------------0.0n-line.info

Were being stripped like this

----0.0----0.1596.hk
--0.0--0.hnun.net
------------0-------------0.0n-line.info

which has been resolved in my latest lists but I still need to address removing the www. www99. ftp. etc which I will test with your latest recommendation.

@maravento
Copy link

sed '/0--/d'

@mitchellkrogza
Copy link
Member Author

Thanks @maravento I will try that, let me expand a bit as this is getting complex for me.

Input list

0------------0-------------0.0n-line.info
0-0--0-000.com
0-3.us
aw.dermalmask.com
idolstudio.free.fr
idolstudio2.free.fr
something.blogspot.com
anything.blogspot.com
xxx.blogspot.ca
www.hola.org
www10.a8.net
www11.alsto.com
www148.myquicksearch.com
ftp.thaitattoo.nl
ftp01.pornocrawler.ws
ftp04.pornocrawler.ws
g.blogads.com
wvw.tielecreidito-pe.com

Output should be

0n-line.info
0-0--0-000.com
0-3.us
dermalmask.com
idolstudio.free.fr
idolstudio2.free.fr
something.blogspot.com
anything.blogspot.com
xxx.blogspot.ca
hola.org
a8.net
alsto.com
myquicksearch.com
thaitattoo.nl
pornocrawler.ws
pornocrawler.ws
blogads.com
tielecreidito-pe.com

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Oct 31, 2017

@maravento I tried this again and this seems very close to what we need sed -r '/[^a-zA-Z0-9.-]/d' input.txt | sed -r 's:(^.?(www|ftp)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > output.txt

But still a domain like

0------------0-------------0.0n-line.info

should be left as

.0n-line.info

But it's being stripped completely from the list.

similarly a domain like idontgiveafuck.com is being completely stripped out, gone.

@maravento
Copy link

maravento commented Oct 31, 2017

sed '/0--/d' input | sed -r 's:(^.?(www|ftp|xxx|wvw)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > output

.0-3.us
.a8.net
.alsto.com
.anything.blogspot.com
.aw.dermalmask.com
.blogspot.ca
.g.blogads.com
.hola.org
.idolstudio2.free.fr
.idolstudio.free.fr
.myquicksearch.com
.pornocrawler.ws
.something.blogspot.com
.thaitattoo.nl
.tielecreidito-pe.com

PD: you can change this sed '/0--/d' for this sed '/0-/d' to delete .0-3.us, but is a risk

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Oct 31, 2017

Thanks @maravento tried that but still 0------------0-------------0.0n-line.info and domain like idontgiveafuck.com completely stripped out. 🙄 this is getting ever more complex. I swear I am having sed & awk attack me in my nightmares at night 😁

@maravento
Copy link

input:
0------------0-------------0.0n-line.info
test.com
sed '/0--/d' input > output
output:
test.com

@mitchellkrogza
Copy link
Member Author

@maravento I think you have nailed it I did this in this order

sed '/0--/d' domains-in.txt  > domains-out.txt
sed -r 's:(^.?(www|ftp|xxx|wvw)[^.]?.|^..?)::gi' domains-out.txt | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > domains-dotted.txt

@mitchellkrogza
Copy link
Member Author

Soooo close, we are still ending up with these in the list (just a few examples there are many many)

.www13.glam.com
.www146.lewwwz.com
.www1.scat-sex.org
.www.zootravel.com
.www.ftp.thaitattoo.nl
.ftp.track4.com

@maravento
Copy link

maravento commented Oct 31, 2017

oldlist:
.www13.glam.com
.www146.lewwwz.com
.www1.scat-sex.org
.www.zootravel.com
.www.ftp.thaitattoo.nl
.ftp.track4.com
0------------0-------------0.0n-line.info # This is an invalid domain, and should be removed

sed '/0--/d' oldlist | sed -r '/[^a-zA-Z0-9.-]/d' | sed -r 's:(^.?(www|ftp|xxx|wvw)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > newlist

newlist:
.ftp.thaitattoo.nl # You can delete this domain manually or run the command again
.glam.com
.lewwwz.com
.scat-sex.org
.track4.com
.zootravel.com

second run:
sed '/0--/d' newlist | sed -r '/[^a-zA-Z0-9.-]/d' | sed -r 's:(^.?(www|ftp|xxx|wvw)[^.]?.|^..?)::gi' | awk '{print "."$1}' | sed -r '/^.\W+/d' | sort -u > newlist2
newlist2:
.glam.com
.lewwwz.com
.scat-sex.org
.thaitattoo.nl
.track4.com
.zootravel.com

PD: You can also delete any domain that contains uppercase letters, since they are also invalid
sed '/[A-Z]/d'

@mitchellkrogza
Copy link
Member Author

Thanks @maravento I'll try that in the morning .... my head is all sed'd and awk'd out for one day 😁 🤣

@maravento
Copy link

maravento commented Nov 28, 2017

Hi @mitchellkrogza . In previous conversations, you explained to me that you validated the domains (urls) of your list. I think you should try these scripts, they can be useful.

https://raw.githubusercontent.com/maravento/blackweb/master/tools/httpstatus.sh
or
https://gist.github.com/felipepodesta/28434fed6e92ac3d011090deb87cfd17

@funilrys
Copy link
Member

funilrys commented Mar 13, 2018

Closing for now feel free to reopen once you tested the new system @mitchellkrogza 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants