Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ever Growing ips.list file #72

Closed
mitchellkrogza opened this issue Oct 9, 2017 · 20 comments
Closed

Ever Growing ips.list file #72

mitchellkrogza opened this issue Oct 9, 2017 · 20 comments
Assignees
Labels

Comments

@mitchellkrogza
Copy link
Member

mitchellkrogza commented Oct 9, 2017

I discovered an issue with fetching fresh IP blocklist data from blocklist.de which was appending to the file each time the files were generated causing the input data source of blocklist.de to grow at an alarming rate. This script error has been addressed in 3427158 and you will notice a dramatic decline in the size of the ips.list file.

@maravento

@mitchellkrogza
Copy link
Member Author

@maravento I kept the original list of blocklist.de which was being appended to in case you need it for removal purpose or comparison against current list.

Current list > https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist/blob/master/.input_sources/_www.blocklist.de/ips.txt

List that was being appended to > https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist/blob/master/.input_sources/_www.blocklist.de/all.txt

@mitchellkrogza mitchellkrogza reopened this Oct 9, 2017
@maravento
Copy link

@mitchellkrogza Hi. I see that now there are only ips (not CIDR) addresses. Is this correct? (if so, better since the debugging is faster)

@mitchellkrogza
Copy link
Member Author

Hi @maravento yes indeed, because of me (stupidly) appending to the file when collecting data from blocklist.de it was also accumulating any debugging / removals they were doing on their own side. The ips.list now looks a lot cleaner and as you can see has reduced in size by +- 300,000 entries.

I also used your pipe and sort for the ips.list file

sort -t . -k 1.1n -k 2.2n -k 3,3n -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n $_input2 | sed 's/^[ \s]//;s/[ \s]$//' | sed "/:/d" | uniq > $_input4 && mv $_input4 $_input2

@mitchellkrogza
Copy link
Member Author

@maravento have you got a good pipe to clean the domains.list file of any unwanted characters? Currently I am only running a simple pipe sed '/\./!d' to clean out any lines not containing a . but tested the latest domains.list now and found about 64 spaces which will lead to dupes.

@maravento
Copy link

maravento commented Oct 9, 2017

I think it is better to "extract domains" since there will always be unwanted characters, because if you remove one, another appears:
regexdomains='([a-zA-Z0-9][a-zA-Z0-9-]{1,61}.){1,}(.?[a-zA-Z]{2,}){1,}' (this "regex" needs evaluation. You may not get the desired results)
egrep -oi "$regexdomains" domains.list > new

Also (if you want) you can put "dot" at the beginning:
egrep -oi "$regexdomains" domains.list | awk '{print "."$1}' > new

And also remove www, ftp ... (which are not part of the domain):
egrep -oi "$regexdomains" domains.list | awk '{print "."$1}' | sed 's:(www[[:alnum:]].|WWW[[:alnum:]].|ftp.|...|/.*)::g' | sort -u > new

everything depends on what you want to "extract" from the list (what you want to "accept" in your list).
classic format: letters:https://letters-and-numbers.letters-and-numbers.letters

@mitchellkrogza
Copy link
Member Author

@maravento I created a new separate domain list file for testing using your pipe's and regex as above. Have a look at it and let me know your thoughts > https://hosts.ubuntu101.co.za/domains-dotted-format.list

I think it might need some tweaking, some domains like zzz.domain.com, we should probably strip out the zzz.

Let me know your feedback.

@maravento
Copy link

maravento commented Oct 25, 2017

hi @mitchellkrogza. what is the difference between:
https://hosts.ubuntu101.co.za/domains.list
vs
https://hosts.ubuntu101.co.za/domains-dotted-format.list

pd: the list contains some www. (.www.com-nx30.net , .www.com-om88.net, etc) which is not part of the domain
suggestion regex to delete www. (and ftp and WWW):
sed 's:(www[[:alnum:]].|WWW[[:alnum:]].|ftp.|...|/.*)::g' domains-dotted-format.list > new-domains-dotted-format.list

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Oct 25, 2017

@maravento the second is as per your earlier regex example, stripping out all www, WWW, ftp and starting each line with a . ... I did use this sed 's:(www[[:alnum:]].|WWW[[:alnum:]].|ftp.|...|/.*)::g' domains-dotted-format.list > new-domains-dotted-format.list but let me check again.

@mitchellkrogza
Copy link
Member Author

@maravento I changed the way the file is processed, was using the wrong pipe. Check now at the latest list. I can spot 2 errors in it which is easy to fix ..doubleclick.com and ..doubleclick.net

@maravento
Copy link

maravento commented Oct 25, 2017

The previous list simply runs the following command and solves everything:
sed 's:(www[[:alnum:]].|WWW[[:alnum:]].|ftp.|...|/.*)::g' domains-dotted-format.list | sort -u > new-domains-dotted-format.list
in the result no longer exists .. or .www ... etc

@mitchellkrogza
Copy link
Member Author

Thanks @maravento I will update it tomorrow and notify you when done so you can double check I got it right 👍

@maravento
Copy link

maravento commented Oct 25, 2017

@mitchellkrogza. Debug TLDs. there are some on the list (.com.ai, .com.br, .com.cn, .com.es, etc)
example:
grep -Pri '^.com.es$' domains-dotted-format.list
.com.es

You can use IANA and Mozilla lists:
https://publicsuffix.org/list/public_suffix_list.dat
https://data.iana.org/TLD/tlds-alpha-by-domain.txt

@mitchellkrogza
Copy link
Member Author

Hi @maravento I am currently using this which I checked and looks exactly the same as the sed you suggested.

cat $TRAVIS_BUILD_DIR/domains.list | sed 's:(www[[:alnum:]].|WWW[[:alnum:]].|ftp.|...|/.*)::g' | awk '{print "."$1}' | sort -u > $TRAVIS_BUILD_DIR/domains-dotted-format.tmp && mv $TRAVIS_BUILD_DIR/domains-dotted-format.tmp $TRAVIS_BUILD_DIR/domains-dotted-format.list

How do I deal with it adding an extra dot onto ..doubleclick.com and ..doubleclick.net ?

@maravento
Copy link

maravento commented Oct 26, 2017

give me an hour to review it and give you a definitive solution

@mitchellkrogza
Copy link
Member Author

Thanks @maravento I made that change and did a re-gen, can you check the latest output file it created as I am away from my desk now. Let me know and thanks for your help.

@maravento
Copy link

maravento commented Oct 26, 2017

list:
www.example1.com
wwwABC88.example3.com
.wwwWx09.example2.com
.example4.com
..example5.com
example5.com

sed -r 's:(^.?(www|ftp)[[:alnum:]]?.|^..?)::gi' list | awk '{print "."$1}' | sort -u > newlist

newlist:
.example1.com
.example2.com
.example3.com
.example4.com
.example5.com

@mitchellkrogza
Copy link
Member Author

Thanks @maravento I’ll give it another shot on Saturday as I am out all day tomorrow, thank for all the advice 👍

@mitchellkrogza
Copy link
Member Author

@maravento I tired that but now getting a lot of .. output. Also found another issue of some domains being pulled in which contain _ underscore characters like ___id___.c.mystat-in.net and gay_sexo__anal.midaxcom.com and galerie_femme_mure_poilue.paris-transsexuelle.info which all need to be stripped out as underscores are not valid in dns. ???

@maravento
Copy link

maravento commented Oct 28, 2017

list:
.--example1.com
.-example2.com
.*example3.com
.example4.com
.___id___.c.mystat-in.net
.gay_sexo__anal.midaxcom.com

sed '/_/d' list | sed -r '/^.\W+/d' > newlist

newlist:
.example4.com

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Oct 29, 2017

@maravento Houston ... the Eagle has landed ..... I have it sorted now using

sed -r 's:(^.?(www|ftp)[[:alnum:]]?.|^..?)::gi' $TRAVIS_BUILD_DIR/domains.list | awk '{print "."$1}' | sed '/_/d' | sed 's/\.\././g' | sort -u > $TRAVIS_BUILD_DIR/domains-dotted-format.list

👍

Now to clean the actual domains input list which generates the hosts file so it also removes underscores as they are simply not valid in any DNS system and this will significantly reduce the list size.

Shows how many people producing hosts lists pay any sort of attention to what they are blacklisting, neither are they even testing or cleaning their lists as @funilrys will attest to 😁

Thanks again for all your constant help and input to the project. We will only go from strength to strength.

The dotted format list will become very useful for people using dnsmasq which allows wildcarding with a dotted format list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants