-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
httpspider redesign to only spider each site once #82
Comments
Dan, I haven't fully reviewed your proposal yet but I have one question. In the past we found this a non-trivial problem to solve because scripts' hostrules/portrules are no longer evaluated in one phase and then action functions run in the next phase. Now we have a pipeline architecture, the rule is evaluated and then immediately the action function is run (the rule function has become a sort of useless historical artifact). So the problem we have is that two scripts which want to spider a site may never run concurrently. Said differently, the second script may not get the chance to register the callback before the first script finishes. Does your proposal address this? |
@batrick Yes, because the actual work is done in a completely separate phase from the callback registering. Callbacks are registered in the action of the portrule phase, and the spidering occurs in the action of the hostrule phase. This has the downside of requiring scripts to output results in the host script section, not next to the port they refer to, but I think it could be made to work. Alternatively (and this is getting really wild, with big invasive changes to NSE), we could expose an API for hostrule scripts to add output to the ports directly, or for any script to add output to other ports; this could also be used for things like rpcinfo, mdns-service-discovery, etc. which obtain information about multiple ports on the same host. |
After further discussion with @batrick on IRC, we determined that the current design in One alternative that preserves the same sort of idea would be to register the callbacks in the pre-scanning phase (action for "prerule"), then execute them during the portrule phase. This simplifies the spidering, since the port would be a parameter of the portaction, but means that the data store which accumulates the results for each port cannot be a table which the callbacks are closures over, but instead must be a parameter passed to the httpspider call. This would probably look something like this (in a script): local function preaction()
-- SCRIPT_NAME here and below may not be necessary if there's some other way to tag
-- these callbacks to this script.
httpspider.register_callback(SCRIPT_NAME, some_function, {maxdepth=2, other_stuff="whatever"})
end
local function portaction(host, port)
local results = {}
-- this function would contain the logic from "hostaction" in the earlier proposal
httpspider.run_spider(SCRIPT_NAME, host, port, results)
return results
end
-- again, this is just taking the work out of making a dispatch table for the appropriate action.
-- The prerule would probably just be true every time.
prerule, action = httpspider.wrap(preaction, portaction)
portrule = shortport.http -- or something more complicated. Doesn't require wrapping. |
This is one of those instances I wish we had added a "_SCRIPT" table or similar that all threads for a script share. That could be used instead of SCRIPT_NAME. Anyways, what you've described makes sense to me Dan. |
I attempted to solve this problem with registry variables that kept track of the status of the crawlers and if different scripts were requesting the same URL. Can't remember why it didn't work but the code shows the idea behind it: However, I like this approach much better. The callbacks allow more flexibility to script authors. |
This has been brought up before, like on the Script Ideas wiki page. I thought a little more and got a theoretical framework down in some very basic pseudo-Lua. It uses a multi-phase approach:
true
if the engine has any callbacks registered for this SCRIPT_NAME+host+port combinationExample pseudocode for the httpspider library:
Example of a script using this API:
Note that there are lots of things left to be defined, though I don't anticipate any of them to be particularly hard:
EDITED: I realized my initial proposal code had some errors regarding the port, which is only passed to the portaction, not the hostaction. Not a problem, since wherever the callbacks are stored, they'll be retrieved by port. The code is just a mockup, but should demonstrate a basic feasibility.
The text was updated successfully, but these errors were encountered: