Skip to content
This repository has been archived by the owner on Apr 3, 2018. It is now read-only.

Feature Request: Curl Page Title #71

Closed
mitchellkrogza opened this issue Jul 23, 2017 · 22 comments
Closed

Feature Request: Curl Page Title #71

mitchellkrogza opened this issue Jul 23, 2017 · 22 comments

Comments

@mitchellkrogza
Copy link

Perhaps doing a simple curl to get the page title of a page would help people sort and clean their lists. This could be include in results.txt or even separate output files??? So pages that return a name other than 404 or not found go into one result file and ones that return a page name other than 404, missing or not found go into a separate file. This way, especially with lists that have subdomains of a root domain like blogspot.com etc it would be easy to pick up blogs that no longer exist.

The reason for this is that lists that are full of sub-domains of a main domain, like this https://github.com/funilrys/dead-hosts/blob/master/add.2o7Net/tested-list/add.2o7Net.list first of all always show ACTIVE in results because the root domain is active. Getting a page title for each tested domain / sub-domain would quickly reveal which of those dub-domains are no longer actually there. Same applies to some lists that have hundreds of .blogspot.com domains in them.

Something like this?

Domain                                                                                               Status      Expiration Date   Source     Analyse Date                      Page Title        
---------------------------------------------------------------------------------------------------- ----------- ----------------- ---------- -------------------------------   ------------------------------------------------------------------
7wind.ru                                                                                             ACTIVE      12-may-2018       WHOIS      Sat Jul 22 14:52:26 UTC 2017      404 Not Found
accentstudio.co.uk                                                                                   ACTIVE      28-dec-2017       WHOIS      Sat Jul 22 14:52:27 UTC 2017      Hello this web site has a Page Title
acenar.com                                                                                           ACTIVE      19-jan-2018       WHOIS      Sat Jul 22 14:52:27 UTC 2017      Hello this web page also has a page title
subdomain.acenar.com                                                                                 ACTIVE      19-jan-2018       WHOIS      Sat Jul 22 14:52:27 UTC 2017      404 Not Found
@funilrys
Copy link
Owner

Okay this gonna need some time but it's possible 👍

But wouldn't it be better to catch the http code instead of the page title @mitchellkrogza ?

@funilrys funilrys moved this from Waiting to Possible // TODO in Suggestions/Features Jul 23, 2017
@mitchellkrogza
Copy link
Author

Yes possibly easier to decifer between 200 OK and 404 Not Found and 403 Forbidden messages than page titles.

@funilrys
Copy link
Owner

Clear !!
I'm going to release 1.4.0 and it'll be implemented for the release after 1.4.0 👍

@mitchellkrogza
Copy link
Author

👍 Awesome looking forward to giving that a test run.

@mitchellkrogza
Copy link
Author

@funilrys Check this cheap and nasty .csv file output showing domain, status code, content type and redirect url (if any).

With this .csv file, one can take anything with a 000 or 404 or 408 or 403 or 500 or other strange error code and immediately knock them off the list.

Then one can run a separate test on the redirect url's column to see what those produce 😁 and then merge them together and run a test using a list of domains that did not redirect + the ones that did.

See:

https://github.com/mitchellkrogza/Stop.Google.Analytics.Ghost.Spam.HOWTO/blob/master/.dev-tools/_output_source/results.csv

The script .... as I said .... cheap and nasty but .... fast and effective. Took Travis CI 26 minutes to produce the .csv on a list of 5451 domains 👍

@mitchellkrogza
Copy link
Author

@funilrys sent you something on WeTransfer 👍 😀

@funilrys
Copy link
Owner

Let me check I was out with friends ....

@funilrys
Copy link
Owner

funilrys commented Jul 24, 2017

Okay for this issue the hardest thing is to develop some procedure/features and have to think about how data will be shown in a HTML ( #62 ) file which I started to design 😆

about WeTransfer: Answered you:+1: :open_mouth:

@mitchellkrogza
Copy link
Author

mitchellkrogza commented Jul 24, 2017 via email

@funilrys
Copy link
Owner

funilrys commented Jul 24, 2017

The following may be me merged to the wiki after implementation of this issue and if we share the same thought about those codes
Consider the following as not fixed and open to modifications

For your code list, you should consider the following codes @mitchellkrogza :

As active

  • 100 - Continue
  • 101 - Switching Protocols
  • 200 - OK
  • 201 - Created
  • 202 - Accepted
  • 203 - Non-Authoritative Information
  • 204 - No Content
  • 205 - Reset Content
  • 206 - Partial Content

As potentially active

  • 000
  • 300 - Multiple Choices
  • 301 - Moved Permanently
  • 302 - Found
  • 303 - See Other
  • 304 - Not Modified
  • 305 - Use Proxy
  • 307 - Temporary Redirect
  • 403 - Forbidden
  • 405 - Method Not Allowed
  • 406 - Not Acceptable
  • 407 - Proxy Authentication Required
  • 408 - Request Timeout
  • 411 - Length Required
  • 413 - Request Entity Too Large
  • 417 - Expectation Failed
  • 500 - Internal Server Error
  • 501 - Not Implemented
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout
  • 505 - HTTP Version Not Supported

As inactive or potentially inactive

  • 400 - Bad Request
  • 401 - Unauthorized
  • 402 - Payment Required (Not in use but may be seen in the future)
  • 404 - Not Found
  • 409 - Conflict
  • 410 - Gone
  • 412 - Precondition Failed
  • 414 - Request-URI Too Long
  • 415 - Unsupported Media Type
  • 416 - Requested Range Not Satisfiable

@mitchellkrogza
Copy link
Author

mitchellkrogza commented Jul 24, 2017 via email

@funilrys
Copy link
Owner

@mitchellkrogza Updated my last comments about the codes

For the "checking redirection" part imagine that the redirection have a redirection which also have a redirection 😜 🤣 🤣

@mitchellkrogza
Copy link
Author

@funilrys yes indeed those with multiple redirects are the one's I am most interested in. I test all stuff added to my lists manually in a browser before they get added. I always run screen recorder to capture what's happening in the url bar of the browser as I often test a site and it does 1-7 redirects in a split second. So then I play back the screen recording and capture all those redirect links and then add them to my lists. ....... very time consuming as you can imagine 😬

@funilrys
Copy link
Owner

Are you suggesting that I should add a follow redirection for funceble once we have the curl column ? 🤣 🤣 🤣 🤣 🤣

@mitchellkrogza
Copy link
Author

It would probably kill funceble and Travis too 🤣 ..... this should be a separate project, a redirect-redirect checker.

@funilrys
Copy link
Owner

Indeed yeah 🤣 imagine a dead-hosts with follow redirection 🤣 🤣 🤣 🤣

@mitchellkrogza
Copy link
Author

That could go awfully wrong very fast 🤣 🤣 🤣

@funilrys
Copy link
Owner

A bit of teasing 😜 what do you think of the following ? 😸

screenshot from 2017-07-25 18-15-39

@mitchellkrogza
Copy link
Author

YEAH .... now we are heading in the right direction

@funilrys funilrys moved this from Possible // TODO to In Progress in Suggestions/Features Jul 26, 2017
funilrys added a commit that referenced this issue Jul 26, 2017
@funilrys
Copy link
Owner

Another teasing 😉
Can you find the new directories? 😸

.
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── funceble
├── iana-domains-db
├── index.html
├── LICENSE
├── output
│   ├── hosts
│   │   ├── ACTIVE
│   │   ├── INACTIVE
│   │   └── INVALID
│   ├── HTTP_Analytic
│   │   ├── ACTIVE
│   │   ├── POTENTIALLY_ACTIVE
│   │   └── POTENTIALLY_INACTIVE
│   ├── logs
│   │   ├── dateFormat
│   │   ├── noReferer
│   │   ├── percentage
│   │   └── whois
│   └── splited
├── README.md
└── tool

funilrys added a commit that referenced this issue Jul 28, 2017
@mitchellkrogza
Copy link
Author

Nice 👍 question, will running funceble on my repo like https://github.com/mitchellkrogza/Stop.Google.Analytics.Ghost.Spam.HOWTO automatically create the new folders if needed? and populate them with a .keep file so they get added and committed?

Looking forward to what's coming 😁

@funilrys
Copy link
Owner

Thank you for the question @mitchellkrogza !!

Please report to #89 😉 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants