Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List Culling and Sorting (WIP) #20

Closed
mitchellkrogza opened this issue Aug 16, 2017 · 66 comments
Closed

List Culling and Sorting (WIP) #20

mitchellkrogza opened this issue Aug 16, 2017 · 66 comments
Assignees

Comments

@mitchellkrogza
Copy link
Member

Work in progress, major sorting of input source lists splitting IP's and domain names into separate lists and removing thousands of duplicates.

Should be done by day end and will make managing of this repo much easier from here on out.

A lot of the input sources have VERY bad duplications and structures using full urls including url parameters and file names which are actually totally useless inside a hosts file as are IP addresses which now only get generated into the hosts.deny and superhosts.deny files.

@funilrys
Copy link
Member

👍 You gave me the idea of URL implementation into funceble but it's may be for the future ... :)

By the way, I'm almost ready for the next release 😉
funilrys/funceble#91 ==> You gonna like it 😜

@xxcriticxx
Copy link

@mitchellkrogza
Copy link
Member Author

Thanks @xxriticxx and @funlirys I have custom scripts to do this all but it does take time and cross checking. Almost done and you can see the list has shrunk somewhat. By tomorrow morning it will be in a really nice clean state going forward. @xxcriticxx I also found some other false positives during this process today.

@xxcriticxx
Copy link

@mitchellkrogza will the list be smaller in size(mb) wise?

@mitchellkrogza
Copy link
Member Author

Yes indeed, it will be smaller and quicker to update once all the dupes are truly gone.

@xxcriticxx
Copy link

yes right now its around 60mb should be around 5mb only

why do you have ALL in front of everything?

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Aug 16, 2017 via email

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Aug 16, 2017 via email

@xxcriticxx
Copy link

so i think i am pulling wrong list i only need ip or domains

@xxcriticxx
Copy link

please give me link to correct list

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Aug 16, 2017 via email

@xxcriticxx
Copy link

i will play with it later on today

@mitchellkrogza
Copy link
Member Author

Lists all sorted and clean now, been quite a job.
Fixed In: 9816d45

@funilrys
Copy link
Member

looks really good @mitchellkrogza 👍 good work 💯

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Aug 17, 2017

Thanks @funilrys 👍 it's now the way it should be. The hosts file contains only domains and not IP addresses as that's the way it should actually be for the way DNS works. IP's are now all listed in the hosts.deny file so the combination of using the hosts + hosts.deny on any "nix" system should keep out 99% of utter garbage 👍 I also stripped out all domains starting with www. as it just led to so much duplication and is actually unnecessary as DNS will reject the whole domain no matter what is in front of the root domain name.

@mitchellkrogza
Copy link
Member Author

Now to run the new funceble when it's out against these lists 😬 but I'll do that from my Ubuntu box and not inside Travis

@funilrys
Copy link
Member

You should wait a bit 😉 😅
I did a lot of refactoring/rewriting so the next version will be 2.0.0 😸 😉 😜

@mitchellkrogza
Copy link
Member Author

@xxcriticxx the raw hosts file at https://hosts.ubuntu101.co.za/hosts is now down to 54Mb. It's nice and clean now and will get even cleaner once I run it against funceble and strip out all dead and inactive domains but at least now it's a proper hosts file with domain names only and not IP addresses as it should be. So as explained above to @funilrys if you use the hosts + hosts.deny on any nix system you will be well protected.

@mitchellkrogza
Copy link
Member Author

@funilrys no worries, I'm in no rush, I would rather wait until you have perfected it. Going to be massively interesting to see the results when I do get to run it against this list. Once funceble can help clean out all dead and expired domains this should be the squeakiest clean list out there.

@funilrys
Copy link
Member

I personally can't wait to see if you find some issues 😉 😹

I hope that https://raw.githubusercontent.com/mitchellkrogza/Ultimate.Hosts.Blacklist/9816d45d8f1bc10a8a56271c799580c5910898d0/.input_sources/_urlblacklist.com/spyware/ips.txt will help me improve the handle of IP

Just discovered that funilrys/funceble#83 is not as fixed as I thought 😭

@xxcriticxx
Copy link

::: Getting hosts.ubuntu101.co.za list... done
:::   Status: Success (OK)
:::   List updated, transport successful!
::: 
::: Aggregating list of domains... done!
::: Formatting list of domains to remove comments.... done!
::: 2798632 domains being pulled in by gravity...
::: Removing duplicate domains.... done!
::: 2181269 unique domains trapped in the event horizon.

@mitchellkrogza
Copy link
Member Author

@xxcriticxx Please Pull it again now and repost your report, just updated it again as there was a problem with some duplications which I fixed now. V1.2017.08.109

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Aug 17, 2017

@funilrys if you look in all the directories in .input_sources you will see what work I did yesterday splitting domains and ips into separate files. Lots of IP's to test now at: 465,685

Dos2Unix was the savior for fixing the dupes issue I was running into, now lists have not one dupe as far as I can see.

@xxcriticxx
Copy link

::: Getting hosts.ubuntu101.co.za list... done
:::   Status: Success (OK)
:::   List updated, transport successful!
::: 
::: Aggregating list of domains... done!
::: Formatting list of domains to remove comments.... done!
::: 2758639 domains being pulled in by gravity...
::: Removing duplicate domains.... done!
::: 2181269 unique domains trapped in the event horizon.

pihole will take care of duplication

@mitchellkrogza
Copy link
Member Author

Perfect thanks @xxcriticxx let me know if you find any more false positives.

@mitchellkrogza
Copy link
Member Author

@xxcriticxx raw hosts file now down to 52 Mb.

@funilrys
Copy link
Member

@mitchellkrogza I saw it 😉 funilrys/funceble#83 is now officially fixed 👍

@funilrys
Copy link
Member

And for that issue it's a code side issue :D

whois return Domain Expiration Date: Sun Jul 15 23:59:59 GMT 2018

@mitchellkrogza
Copy link
Member Author

Thats' what I need ...... I have to track bad sites manually using screen recorder on OSX as some of them do 3-10 redirects in the blink of an eye.

Who knows maybe that guy who registered 0000opengate.biz thought he might live to be 5200 years old 🤣

@mitchellkrogza
Copy link
Member Author

Does that mean Mitch uncovered another 🐛 😁

whois return Domain Expiration Date: Sun Jul 15 23:59:59 GMT 2018

@funilrys
Copy link
Member

Does that mean Mitch uncovered another 🐛 😁

Definitely 😹 👍

@mitchellkrogza
Copy link
Member Author

By the way found a really cool one liner yesterday to clean lists. Both lists have to be sorted and dupe free .... works very well but ..... knowing you, you will find and even cleverer way just to outwit me 🤣

comm -13 funceble-inactive.txt domains.txt >> domains.sorted

@mitchellkrogza
Copy link
Member Author

This is how far it's got in 25 minutes 😬 ..... this is going to take till tomorrow morning.

screen shot 2017-08-17 at 2 39 49 pm

@mitchellkrogza
Copy link
Member Author

@xxcriticxx I can see once I remove all these dead and inactive domains I will have my Ultimate hosts down to a much more respectable size for all users and also only filled with stuff that actually exists.

@mitchellkrogza
Copy link
Member Author

@funilrys I tell you one thing, and this was why I started Ultimate Hosts .... 90% of the lists out there are filled with utterly useless garbage that does not exist anymore and nobody cleans or bothers to clean their lists.

@xxcriticxx
Copy link

@mitchellkrogza how many hosts can dnsmasq handle?

@mitchellkrogza
Copy link
Member Author

@xxcriticxx really not sure, don't use it. Try asking on their forums.

@mitchellkrogza
Copy link
Member Author

@funilrys think I'm going to pull the new dev funceble into Badd-Boyz-Hosts and let it loose tonight. See if I can beat the Travis 50 minute timeout.

@xxcriticxx
Copy link

@mitchellkrogza thats what pihole using for the hosts i know it has to have limit

@mitchellkrogza
Copy link
Member Author

@xxcriticxx does your pi-hole ever crash with my list size ???

@mitchellkrogza
Copy link
Member Author

@xxcriticxx trying to find out for you.

@mitchellkrogza
Copy link
Member Author

@xxcriticxx
Copy link

@mitchellkrogza did not crash yet but i am running it on reg computer not pie3

@mitchellkrogza
Copy link
Member Author

Try it and see what happens, only way to find out. Remember once my funceble output of this list finishes, probably by tomorrow morning sometime there will be a LOT of dead stuff culled out of this hosts list.

@xxcriticxx
Copy link

@mitchellkrogza check this webiste out see if you can add any lists https://filterlists.com/

@mitchellkrogza
Copy link
Member Author

mitchellkrogza commented Aug 17, 2017 via email

@mitchellkrogza
Copy link
Member Author

@xxcriticxx added 4 new data sources today so far, so thanks for that link 👍 . Some of them look well maintained, some of them not so much. Also some are useless to a hosts file.

But we are slowly growing this into a really top notch list and once @funilrys finishes his work on the dev branch of funceble we will clean this list of all dead and useless stuff and then have a killer accurate hosts list barre none.

@mitchellkrogza
Copy link
Member Author

@xxcriticxx added 6 new data sources today, more tomorrow then over the weekend once funceble finishes checking there will be a big clean up of dead domains.

@xxcriticxx
Copy link

@mitchellkrogza is my list good to pull?

@mitchellkrogza
Copy link
Member Author

@xxcriticxx which one ?? Please point me to it.

@mitchellkrogza
Copy link
Member Author

Added another new data source today, pre-edited version of someonewhocares.org

@xxcriticxx
Copy link

@mitchellkrogza did you update raw list that i use can i pull new list?

@mitchellkrogza
Copy link
Member Author

@xxcriticxx raw lists always up to date immediately after a build completes.

@mitchellkrogza
Copy link
Member Author

@xxcriticxx just wait 5 minutes, busy with new build and fresh files in 5 minutes

@mitchellkrogza
Copy link
Member Author

@xxcriticxx all raw files at latest version.

@xxcriticxx
Copy link

ok i will pull later when am home

@mitchellkrogza
Copy link
Member Author

Cool let me know, also see if you know anyone who can test the windows version of the hosts file. It seems ok on XP but does require that DNS client is disabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants