-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ever Growing ips.list file #72
Comments
@maravento I kept the original list of blocklist.de which was being appended to in case you need it for removal purpose or comparison against current list. Current list > https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist/blob/master/.input_sources/_www.blocklist.de/ips.txt List that was being appended to > https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist/blob/master/.input_sources/_www.blocklist.de/all.txt |
@mitchellkrogza Hi. I see that now there are only ips (not CIDR) addresses. Is this correct? (if so, better since the debugging is faster) |
Hi @maravento yes indeed, because of me (stupidly) appending to the file when collecting data from blocklist.de it was also accumulating any debugging / removals they were doing on their own side. The ips.list now looks a lot cleaner and as you can see has reduced in size by +- 300,000 entries. I also used your pipe and sort for the ips.list file
|
@maravento have you got a good pipe to clean the domains.list file of any unwanted characters? Currently I am only running a simple pipe |
I think it is better to "extract domains" since there will always be unwanted characters, because if you remove one, another appears: Also (if you want) you can put "dot" at the beginning: And also remove www, ftp ... (which are not part of the domain): everything depends on what you want to "extract" from the list (what you want to "accept" in your list). |
@maravento I created a new separate domain list file for testing using your pipe's and regex as above. Have a look at it and let me know your thoughts > https://hosts.ubuntu101.co.za/domains-dotted-format.list I think it might need some tweaking, some domains like zzz.domain.com, we should probably strip out the zzz. Let me know your feedback. |
hi @mitchellkrogza. what is the difference between: pd: the list contains some www. (.www.com-nx30.net , .www.com-om88.net, etc) which is not part of the domain |
@maravento the second is as per your earlier regex example, stripping out all www, WWW, ftp and starting each line with a |
@maravento I changed the way the file is processed, was using the wrong pipe. Check now at the latest list. I can spot 2 errors in it which is easy to fix |
The previous list simply runs the following command and solves everything: |
Thanks @maravento I will update it tomorrow and notify you when done so you can double check I got it right 👍 |
@mitchellkrogza. Debug TLDs. there are some on the list (.com.ai, .com.br, .com.cn, .com.es, etc) You can use IANA and Mozilla lists: |
Hi @maravento I am currently using this which I checked and looks exactly the same as the sed you suggested.
How do I deal with it adding an extra dot onto |
give me an hour to review it and give you a definitive solution |
Thanks @maravento I made that change and did a re-gen, can you check the latest output file it created as I am away from my desk now. Let me know and thanks for your help. |
list: sed -r 's:(^.?(www|ftp)[[:alnum:]]?.|^..?)::gi' list | awk '{print "."$1}' | sort -u > newlist newlist: |
Thanks @maravento I’ll give it another shot on Saturday as I am out all day tomorrow, thank for all the advice 👍 |
@maravento I tired that but now getting a lot of |
list: sed '/_/d' list | sed -r '/^.\W+/d' > newlist newlist: |
@maravento Houston ... the Eagle has landed ..... I have it sorted now using
👍 Now to clean the actual domains input list which generates the hosts file so it also removes underscores as they are simply not valid in any DNS system and this will significantly reduce the list size. Shows how many people producing hosts lists pay any sort of attention to what they are blacklisting, neither are they even testing or cleaning their lists as @funilrys will attest to 😁 Thanks again for all your constant help and input to the project. We will only go from strength to strength. The dotted format list will become very useful for people using dnsmasq which allows wildcarding with a dotted format list. |
I discovered an issue with fetching fresh IP blocklist data from blocklist.de which was appending to the file each time the files were generated causing the input data source of blocklist.de to grow at an alarming rate. This script error has been addressed in 3427158 and you will notice a dramatic decline in the size of the ips.list file.
@maravento
The text was updated successfully, but these errors were encountered: