CN101984429A - Method and device for acquiring destination page, search engine and browser - Google Patents

Method and device for acquiring destination page, search engine and browser Download PDF

Info

Publication number
CN101984429A
CN101984429A CN 201010531460 CN201010531460A CN101984429A CN 101984429 A CN101984429 A CN 101984429A CN 201010531460 CN201010531460 CN 201010531460 CN 201010531460 A CN201010531460 A CN 201010531460A CN 101984429 A CN101984429 A CN 101984429A
Authority
CN
China
Prior art keywords
page
path
target page
dom
state path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010531460
Other languages
Chinese (zh)
Other versions
CN101984429B (en
Inventor
潘云泓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2010105314609A priority Critical patent/CN101984429B/en
Publication of CN101984429A publication Critical patent/CN101984429A/en
Application granted granted Critical
Publication of CN101984429B publication Critical patent/CN101984429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a device for acquiring a destination page, a search engine and a browser. The method comprises the following steps of: capturing a foundation page corresponding to a received uniform resource locator (URL) and a script of the foundation page by the search engine; and analyzing the captured foundation page and the captured script to generate over one state path comprising dynamic information and corresponding to the foundation page, and capturing the destination page by using the generated state path, wherein the state path comprises the URL of the foundation page, position information of a document object model (DOM) event for generating the dynamic information in the foundation page and a callback function index corresponding to the DOM event. The search engine can capture dynamic contents in the page when searching the destination page.

Description

Method and device for acquiring target page, search engine and browser
Technical Field
The invention relates to the internet technology, in particular to a method and a device for acquiring a target page, a search engine and a browser.
Background
With the rapid development of networks, the internet becomes a carrier of a large amount of information, and how to effectively extract and utilize the information becomes a great challenge. Search engines, as a tool to assist people in retrieving information, have become portals and guides for users to access the internet. The web crawler (Spider) is a program for automatically extracting web pages and is an important component of a search engine.
The traditional web crawler starts from Uniform Resource Locators (URLs) of one or a plurality of initial web pages, captures basic pages of the URLs, analyzes the content of the current basic pages to obtain the URLs of target pages, and performs data processing, including establishing web page summaries, snapshots, indexes and storage, and then returns the web page summaries, snapshots, indexes and storage to a browser for selection by a user.
However, when the traditional web crawler acquires the URL of the target page, only the static page can be captured, but with the continuous development of the internet technology, the content of the page is converted from the former static mode to the dynamic mode to generate data, and the traditional web crawler technology obviously cannot meet the conversion requirement, that is, cannot capture the dynamic content of the page.
Disclosure of Invention
The invention provides a method and a device for acquiring a target page, a search engine and a browser, so that the search engine can capture dynamic content in the page when searching the target page.
The specific technical scheme is as follows:
a method for acquiring a target page comprises the following steps:
A. capturing a basic page corresponding to the received uniform resource locator URL and a script of the basic page;
B. analyzing the captured basic page and the script, generating more than one state path containing dynamic information corresponding to the basic page, and capturing a target page by using the generated state paths; wherein the state path comprises: the method comprises the steps of obtaining URL of a basic page, position information of a Document Object Model (DOM) event generating dynamic information in the basic page and a callback function index corresponding to the DOM event.
Wherein, the step B specifically comprises:
downloading each DOM node in the grabbing process of the basic page and the script, and sequentially executing steps B11 to B13 on the downloaded DOM nodes until the downloading of all the DOM nodes is finished, and then executing step B14;
b11, judging whether the currently downloaded DOM node is a script tag, if so, transferring to the step B11 for the next downloaded DOM node, otherwise, executing the step B12;
b12, judging whether the currently downloaded DOM node contains a DOM event and a call-back function, if not, transferring to the step B11 for the next downloaded DOM node, and if so, executing the step B13;
b13, generating a state path by using the DOM event contained in the currently downloaded DOM node, storing the generated state path in a state path queue, and turning to the step B11 for the next downloaded DOM node;
b14, acquiring the target pages corresponding to each state path in the state queue one by one, judging whether to generate new page content or generate page jump, and determining the state path generating the new page content or generating the page jump as the state path corresponding to the basic page.
Or, the step B specifically includes:
downloading each DOM node in the grabbing process of the basic page and the script, and sequentially executing the steps B21 to B23 on the downloaded DOM nodes until the downloading of all the DOM nodes is finished;
b21, judging whether the currently downloaded DOM node is a script tag, if so, transferring to the step B21 for the next downloaded DOM node, otherwise, executing the step B22;
b22, judging whether the currently downloaded DOM node contains a DOM event and a call-back function, if not, transferring to the step B21 for the next downloaded DOM node, and if so, executing the step B23;
b23, generating a state path by using the DOM event contained in the currently downloaded DOM node;
b24, acquiring a target page corresponding to the state path, judging whether to generate new page content or generate page jump, if so, determining that the state path is the state path corresponding to the basic page, and turning to the step B21 for the next downloaded DOM node; otherwise, go to step B21 for the next downloaded DOM node.
In the above manner, the determining whether the page jump occurs includes: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.
Specifically, the determining whether to generate new page content includes: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,
and calculating the similarity between the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.
Wherein the position information of the DOM event comprises: DOM node identification, path Xpath of DOM node and DOM event identification.
Still further, after the step B, the method further comprises:
C. and B, storing the state path corresponding to the basic page generated in the step B and the snapshot of the captured target page, and establishing and storing an index of the target page.
A method for obtaining a target page is based on the method and comprises the following steps;
after receiving a search request from a browser, matching keywords contained in the search request with indexes of stored target pages, including a state path corresponding to the matched target page in a search result, and returning the search result to the browser, so that the browser can obtain the corresponding target page by using the state path selected by a user.
In addition, the search result may further include: snapshot information of the matched target page;
and after receiving snapshot information of the target page selected by the user and returned by the browser, returning a corresponding snapshot of the target page to the browser.
Furthermore, after the state path corresponding to the matched target page is included in the search result and returned to the browser, the method further includes:
and after receiving the state path selected by the user and sent by the browser, sending a target page request to a target page site according to the state path selected by the user, so that the target page site can push a target page to the browser.
A method for acquiring a target page comprises the following steps:
the browser receives a search result containing a state path returned by a search engine after sending a search request to the search engine;
sending a target page request to a target page site according to a state path selected by a user;
receiving a target page pushed by the target page site;
wherein the search result containing the state path is returned by the search engine using the method of claim 8.
An apparatus for obtaining a target page, the apparatus comprising:
the first grabbing unit is used for grabbing a basic page corresponding to the received uniform resource locator URL and a script of the basic page;
the analysis unit is used for analyzing the basic page and the script captured by the first capture unit and generating more than one state path which corresponds to the basic page and contains dynamic information; wherein the state path comprises: the method comprises the steps that URL of a basic page, position information of a Document Object Model (DOM) event generating dynamic information in the basic page and a callback function index corresponding to the DOM event are obtained;
and the second grabbing unit is used for grabbing the target page by using the state path generated by the analysis unit.
Wherein, the analysis unit specifically includes: the device comprises a first judgment module, a second judgment module, a first path generation module and a first path determination module;
the first grabbing unit downloads each DOM node in the grabbing process of the basic page and the script thereof, sends the currently downloaded DOM node to the first judging module, and sends a confirmation notice to the first path determining module after finishing the downloading of all the DOM nodes;
the first judging module is used for judging whether the currently downloaded DOM node is a script tag or not, if so, triggering the first grabbing unit to download the next DOM node, and otherwise, sending a judgment notice to the second judging module;
the second judging module is used for judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, the first capturing unit is triggered to download the next DOM node, and if so, an execution notice is sent to the first path generating module;
the first path generation module is used for generating a state path by using the currently downloaded DOM node after receiving the execution notification, storing the generated state path in a state path queue and triggering the first capture unit to download the next DOM node;
the first path determining module is configured to, when receiving the determination notification, trigger the second capturing unit to obtain target pages corresponding to each state path in the state queue one by one, determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit, and determine the generated new page content or the state path in which the page jump occurs as the state path corresponding to the basic page.
Specifically, the analysis unit may include: the device comprises a third judging module, a fourth judging module, a second path generating module and a second path determining module;
the first grabbing unit downloads each DOM node in the grabbing process of the basic page and the script thereof, and sends the currently downloaded DOM node to the third judging module until the downloading of all DOM nodes is finished;
the third judging module is used for judging whether the currently downloaded DOM node is a script tag or not, if so, the first grabbing unit is triggered to download the next DOM node, and otherwise, a judgment notice is sent to the fourth judging module;
the fourth judging module is used for judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, the first capturing unit is triggered to download the next DOM node, and if so, an execution notification is sent to the second path generating module;
the second path generating module is configured to generate a state path by using a DOM event included in a currently downloaded DOM node when receiving the execution notification, and send the generated state path to the second path determining module;
and the second path determining module is used for triggering the second capturing unit to obtain a target page corresponding to the state path when the state path is received, judging whether new page content is generated or page jump is generated according to the obtaining result of the second capturing unit, if so, determining that the state path is the state path corresponding to the basic page, and triggering the first capturing unit to download a next DOM node, otherwise, triggering the first capturing unit to download the next DOM node.
Wherein, judging whether the page jump occurs comprises: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.
Determining whether to generate new page content includes: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,
and calculating the similarity between the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.
Specifically, the location information of the DOM event includes: DOM node identification, path Xpath of DOM node and DOM event identification.
Still further, the apparatus further comprises:
and the storage unit is used for storing the state path corresponding to the basic page generated by the analysis unit and the snapshot of the target page captured by the second capture unit, and establishing and storing the index of the target page.
A search engine, the search engine comprising: the device for acquiring the target page, the user interface unit and the search processing unit are arranged;
the user interface unit is used for receiving a search request from a browser and sending a keyword contained in the search request to the search processing unit; returning the search result sent by the search processing unit to the browser, so that the browser can obtain a corresponding target page by using the state path selected by the user;
and the search processing unit is used for matching the keyword with the index of the target page stored in the storage unit of the device, and sending the state path corresponding to the matched target page to the user interface unit by including the state path in the search result.
Furthermore, the search result further includes: snapshot information of the matched target page;
the user interface unit is also used for sending the snapshot information of the target page selected by the user and returned by the browser to the search processing unit; returning the snapshot of the target page sent by the search processing unit to the browser;
the search processing unit is further configured to obtain a snapshot of the corresponding target page from the storage unit according to the snapshot information of the target page selected by the user, and send the snapshot to the user interface unit.
Still further, the search engine further comprises: a path analysis unit and a network interface unit;
the user interface unit is also used for sending the state path to the path analysis unit after receiving the state path selected by the user and sent by the browser;
the path analysis unit is used for generating a target page request according to the received state path;
and the network interface unit is used for sending the target page request generated by the path analysis unit to a target page site.
A browser, the browser comprising: the system comprises a network side interface unit, a path analysis unit and a user side interface unit;
the network side interface unit, configured to receive a search result including a status path sent by the search engine according to claim 19; sending the target page request sent by the path analysis unit to a target page site;
the user side interface unit is used for displaying the search result received by the network side interface unit to a user; sending the state path selected by the user to the path analysis unit;
and the path analysis unit is used for generating a target page request according to the state path selected by the user and sending the target page request to the network side interface unit.
According to the technical scheme, the concept of the state path is introduced based on the analysis of the basic page and the script thereof, namely the state path containing the dynamic information corresponding to the basic page is generated, and the target page pointed by the state path contains the dynamic content of the page, so that the subsequent search engine can capture the dynamic content in the page when searching the target page.
Drawings
FIG. 1 is a flow chart of the main method provided by the present invention;
FIG. 2 is a flowchart of a detailed method provided in one embodiment of the present invention;
FIG. 3 is a flowchart of generating a status path according to a second embodiment of the present invention;
FIG. 4 is a flowchart of generating a status path according to a third embodiment of the present invention;
fig. 5 is a flowchart illustrating a browser obtaining a target page according to a fourth embodiment of the present invention;
fig. 6 is a flowchart illustrating a process of acquiring a target page by a browser according to a fifth embodiment of the present invention;
fig. 7 is a flowchart of a browser obtaining a target snapshot according to a sixth embodiment of the present invention;
FIG. 8 is a schematic diagram of the structure of the apparatus according to the present invention;
FIG. 9 is a schematic diagram of a structure of the analysis unit of FIG. 8;
FIG. 10 is a schematic view of another structure of the analysis unit of FIG. 8;
FIG. 11 is a schematic diagram of a search engine according to the present invention;
fig. 12 is a schematic view of a browser structure according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The main method provided by the invention can be shown as figure 1, and comprises the following steps:
step 101: and capturing a basic page corresponding to the received URL and a script of the basic page.
Step 102: analyzing the captured basic page and the script to generate more than one state path containing dynamic information corresponding to the basic page; wherein the state path includes: the method comprises the steps of obtaining a URL of a base page, position information of a Document Object Model (DOM) event generating dynamic information in the base page and a callback function index corresponding to the DOM event generating the dynamic information.
Step 103: and grabbing the target page by using the generated state path.
The method flow shown in fig. 1 is an operation performed by a search engine, and further, the search engine stores a generated state path, so that after receiving a search request of a browser, a search result including the state path is returned to the browser, so that the browser obtains a corresponding target page by using the state path selected by a user.
The above method is described in detail below by way of specific examples.
The first embodiment,
Fig. 2 is a flowchart of a detailed method according to a first embodiment of the present invention, and as shown in fig. 2, the method may specifically include the following steps:
step 201: the search engine receives the URL.
The search engine may automatically batch grab URLs in the background.
Step 202: and capturing a basic page corresponding to the received URL and a script of the basic page.
The correspondence between the basic page and the script can have the following two types: in one aspect, a script document exists in an HTML tag contained in the source code of the base page. Secondly, a link of a script document exists in an HTML (hypertext markup language) tag contained in the basic page source code, and the link of the script document points to the script document; that is, the base page and the script document are two different documents, but there is a reference relationship.
Step 203: analyzing DOM nodes downloaded from the captured basic page, judging whether scripts corresponding to DOM events in the DOM nodes generate dynamic information or not, generating more than one state path containing the dynamic information corresponding to the basic page according to the analysis result, and acquiring a target page by using the state paths; wherein the state path includes: the method comprises the steps of obtaining URL of a basic page, position information of a DOM event generating dynamic information in the basic page and a callback function index corresponding to the DOM event generating the dynamic information.
Scripting languages involved in the present invention include, but are not limited to: java script, vbscript, perl, or python.
Wherein, the position information of the DOM event may include: DOM node identification, DOM node path (Xpath) and DOM event identification. Wherein, the DOM node identification may be: ID of DOM node or name of DOM node.
And the callback function index in the state path is used for referencing the callback function corresponding to the DOM event. All callback functions in the script are provided with indexes, and the corresponding relation between the indexes and the specific callback functions can be stored through data structures such as a global function table, a mapping function and the like. And querying a data structure containing the corresponding relation between the index and the specific callback function through the callback function index in the state path, so as to obtain the callback function corresponding to the DOM event. The callback function herein may include: anonymous callback functions and non-anonymous callback functions.
And aiming at the state path, after compiling and executing the callback function corresponding to the DOM event, acquiring the corresponding target page.
The specific implementation of this step will be described in detail in example two and example three.
For a base page, it may correspond to N state paths and to N target pages, where N may be an integer greater than one.
For example, for a base page with a URL of www.baidu.com, the resulting two-bar path may be:
{base_url:https://www.baidu.com,id:idsample1,xpath:html/body/a/,event:click,type:new_content,callback:fun1}
{base_url:https://www.baidu.com,id:idsample2,xpath:html/body/li/a/,event:click,type:new_link,callback:fun2}
it should be noted that the present invention does not limit the specific format of the state path, and the above is only one example.
Step 204: and storing the state path corresponding to the basic page and the target page snapshot corresponding to the state path, and establishing and storing an index of the target page so as to be found by a search engine in the following process and return the index to the browser as a search result.
In this embodiment, the base page and its script captured in step 202 may be stored, the state path generated in step 203 may be stored, and the target page snapshot acquired in step 203 may be stored. The storing of the basic page may specifically include: a base page URL, a base page snapshot, etc.
The process of obtaining the target page by the search engine can be executed periodically or manually. When the state path corresponding to the base page is generated each time, if the stored state path exists, the generated state path corresponding to the base page may be compared with the stored state path corresponding to the base page, and if the state path corresponding to the base page is different from the stored state path corresponding to the base page, the stored state path corresponding to the base page is updated in time.
In addition, the search engine can periodically check whether the target page is updated according to the index of the target page, and update the stored index of the target page in time. Similarly, if the target page snapshot acquired each time is different from the stored target page snapshot, the stored target page snapshot may be replaced with the newly acquired target page snapshot.
The three types of contents stored above may be stored separately or in combination.
The above steps 201 to 204 are all operations of the search engine in the background, and if the search engine receives a search request from the browser, the following steps are continuously executed in the foreground.
Step 205: after receiving a search request from a browser, matching keywords contained in the search request with indexes of all target pages, including a state path corresponding to the matched target page in a search result, and returning the search result to the browser, so that the browser can obtain the corresponding target page by using the state path selected by a user.
When a search engine receives a search request containing a keyword, in addition to the index of a target page participating in matching, the index of a basic page also participates in matching, that is, the basic page is also included in a search result, which is the same as the prior art and is not described in detail again.
Furthermore, the search result may also include snapshot information of the target page, or may also include an index of the target page.
In this step, the browser specifically uses the state path selected by the user to obtain the corresponding target page, which refers to embodiment four and embodiment five.
The manner of generating the state path in step 203 may adopt two manners, i.e., embodiment two and embodiment three.
Example II,
Fig. 3 is a flowchart of generating a status path according to a second embodiment of the present invention, and as shown in fig. 3, the method may specifically include the following steps:
step 301: and downloading each DOM node in the grabbing process of the basic page and the script thereof.
Step 302: judging whether downloading of the DOM node is finished, if so, finishing the capturing process of the basic page, and turning to the step 306; otherwise, step 303 is performed on the currently downloaded DOM node.
Step 303: judging whether the currently downloaded DOM node is a script tag, if so, turning to step 302 for the next downloaded DOM node; otherwise, step 304 is performed.
For a node of a script tag, a script corresponding to the script tag may be sent to a script parsing engine for compiling and executing.
Step 304: judging whether the DOM node contains a DOM event and a call-back function, if not, skipping the analysis of the DOM node, and turning to the step 302 for the next downloaded DOM node; if so, step 305 is performed.
If the DOM node does not contain the DOM event and the call-back function, the DOM node does not cause page jump and new page content, namely, page dynamic information is not generated, the DOM node can be skipped over, and if the next DOM node exists, the analysis of the next DOM node is started.
Step 305: generating a state path by using a DOM event contained in the DOM node, and storing the generated state path in a state path queue; go to step 302 for the next downloaded DOM node.
Step 306: and acquiring target pages corresponding to each state path in the state queue one by one, judging whether new page content is generated or page skipping occurs, and determining the state path generating the new page content or the page skipping as the state path corresponding to the basic page.
And then storing the state path which generates new page content or generates page jump and the corresponding target page.
The method for judging whether the page jump occurs may be: and if the URLs of the target page and the basic page are different, determining that page jump occurs. The manner of determining whether to generate new page content may be: and carrying out sentence signature or character string comparison on the target page and the basic page, or calculating the similarity of the target page and the basic page, and if the comparison result or the similarity calculation result shows that the target page and the basic page have different page contents, determining to generate new page contents. In the comparison of sentence signatures, the calculation of sentence signatures may adopt an existing calculation manner, such as MD5, and is not limited in this respect.
In the second embodiment, all the state paths generated by the DOM events are stored in the state path queue, but since the state paths of the DOM events do not necessarily generate page dynamic information, and some invalid state paths may exist, each state path in the state path queue is further determined one by one, and whether a target page corresponding to the state path queue contains dynamic information is determined. The flow from step 303 to step 305 is a process of analyzing each DOM node to generate a state path preliminarily, that is, step 303 to step 305 are performed on each downloaded DOM node until all DOM nodes are downloaded, and step 306 is performed to determine a state path corresponding to the base page finally.
Example III,
Fig. 4 is a flowchart of generating a status path according to a third embodiment of the present invention, and as shown in fig. 4, the method may specifically include the following steps:
step 401: and downloading each DOM node in the grabbing process of the basic page and the script thereof.
Step 402: judging whether downloading of the DOM node is finished or not, if so, finishing the grabbing process of the basic page; otherwise, step 403 is performed on the currently downloaded DOM node.
Step 403: judging whether the currently downloaded DOM node is a script tag, if so, turning to step 402 for the next downloaded DOM node; otherwise, step 404 is performed.
For a node of a script tag, a script corresponding to the script tag may be sent to a script parsing engine for compiling and executing.
Step 404: judging whether the DOM node contains a DOM event and a call-back function, if not, skipping the analysis of the DOM node, and turning to the step 402 for the next downloaded DOM node; if so, step 405 is performed.
Step 405: a state path is generated using DOM events in the DOM node.
In this step, a state path may be generated for all DOM events, and more preferably, for DOM events in a preset DOM event list. Wherein the DOM events in the preset DOM event list may include: onclick, ondbclick, onmouseover, onmouseove, onmouseout, onblu, onfocus, onchange, onsubmit, onselect, etc., which are all DOM events that may generate dynamic information for a page.
Step 406: acquiring a target page corresponding to the state path, judging whether to generate new page content or generate page jump, and if so, executing step 407; otherwise, go to step 402 for the next DOM node to be downloaded.
Step 407: determining the state path as the state path corresponding to the base page, storing the state path and the target page corresponding to the state path, and going to step 402 for the next downloaded DOM node.
Different from the second embodiment, in the third embodiment, each time a state path is generated, a determination is made as to whether the target page corresponding to the state path queue includes dynamic information (step 406), and if so, the state path and the target page corresponding to the state path queue are stored. Steps 403 to 407 are processes of generating a state path after analyzing each downloaded DOM node, that is, steps 403 to 407 are performed on each downloaded DOM node until all DOM nodes are downloaded.
The flow shown in this third embodiment is ended.
In the second and third embodiments, when the target page corresponding to the state path is obtained and whether new page content is generated or page jump is generated is determined, the callback function index corresponding to the DOM event is sent to the script parsing engine, the script parsing engine obtains the corresponding callback function according to the callback function index, the target page corresponding to the state path is obtained according to the compiling and executing result of the obtained callback function, and whether new page content is generated or page jump is generated is determined. For anonymous functions, the script parsing engine compiles and executes the obtained callback functions in real time after obtaining the corresponding callback functions, and for non-anonymous functions, the script parsing engine can utilize the compiling and executing results of the callback functions before after obtaining the corresponding callback functions.
The mode that the browser acquires the target page by using the state path can be divided into two modes according to whether the browser has the function of analyzing the state path, which are respectively described through the fourth embodiment and the fifth embodiment.
Example four,
When the browser has the function of analyzing the state path, the corresponding flowchart is shown in fig. 5, and includes the following steps:
step 501: the browser sends a search request (Query) containing the keyword to the search engine.
Step 502: the search engine executes step 205 to return search results containing a status path to the browser.
Step 503: and the browser sends a target page request to the target page site according to the state path selected by the user.
When the user clicks the state path of the target page, the browser analyzes the state path clicked by the user and sends a target page request to the target page site according to the state path.
Step 504: and the target page website pushes the target page to the browser.
Example V,
When the browser does not have the function of analyzing the state path, the corresponding flowchart is shown in fig. 6, and includes the following steps:
step 601: the browser sends a search request containing the keyword to a search engine.
Step 602: the search engine executes step 205 to return search results containing a status path to the browser.
Step 603: the browser sends the status path selected by the user to the search engine.
Step 604: and the search engine sends a target page request to the target page site according to the state path selected by the user.
Because the browser does not have a state path analysis function, the browser only sends the state path selected by the user to the search engine, and the search engine analyzes the state path and sends the target page request to the target page site according to the state path.
Step 605: and the target page website pushes the target page to the browser.
The target page request sent by the search engine contains browser information, so that the target page site pushes the target page to the browser.
The flow shown in the fifth embodiment is ended.
In another case, if the search engine includes the target page snapshot information in the search result returned in step 205 of the first embodiment, if the user clicks the target page snapshot, the interaction between the browser and the search engine may be performed according to the sixth embodiment.
Example six,
Fig. 7 is a flowchart of obtaining a target snapshot by a browser according to a sixth embodiment, as shown in fig. 7, the method may include the following steps:
step 701: the browser sends a search request containing the keyword to a search engine.
Step 702: the search engine executes step 205 to return search results containing state path and target page snapshot information to the browser.
Step 703: and the browser sends the target page snapshot information selected by the user to a search engine.
Step 704: and the search engine determines the corresponding target page snapshot and returns the corresponding target page snapshot to the browser.
Because the search engine stores all target page snapshots locally, interaction with a target page site is not needed, and the corresponding target page snapshots are directly acquired from the local and then returned to the browser.
The above is a detailed description of the method provided by the present invention, and the following is a detailed description of an apparatus for acquiring a target page provided by the present invention, as shown in fig. 8, the apparatus may include: a first grasping unit 800, an analyzing unit 810, and a second grasping unit 820.
The first fetching unit 800 is configured to fetch a base page corresponding to the received URL and a script of the base page.
The analysis unit 810 is configured to analyze the basic page and the script captured by the first capture unit 800, and generate one or more status paths including dynamic information corresponding to the basic page; wherein the state path includes: the method comprises the steps of obtaining URL of a basic page, position information of DOM events generating dynamic information in the basic page and callback function indexes corresponding to the DOM events.
And a second fetching unit 820 for fetching the target page using the state path generated by the analyzing unit 810.
The analysis unit 810 may adopt two structures, where the first structure is shown in fig. 9, and specifically includes: a first judging module 811, a second judging module 812, a first path generating module 813, and a first path determining module 814.
The first capture unit 800 downloads DOM nodes in the capture process of the basic page and the script thereof, and sends the currently downloaded DOM nodes to the first determination module 811 until the download of all DOM nodes is finished, and then sends a determination notification to the first path determination module 814.
The first determining module 811 is configured to determine whether the currently downloaded DOM node is a script tag, if so, trigger the first capturing unit 800 to download the next DOM node, otherwise, send a determination notification to the second determining module 812.
The second determining module 812 is configured to determine whether the currently downloaded DOM node contains a DOM event and a callback function, if not, trigger the first capturing unit 800 to download the next DOM node, and if so, send an execution notification to the first path generating module 813.
The first path generating module 813 is configured to generate a state path by using the currently downloaded DOM node after receiving the execution notification, store the generated state path in the state path queue, and trigger the first capturing unit 800 to download the next DOM node.
The first path determining module 814 is configured to, when receiving the determination notification, trigger the second capturing unit 820 to obtain the target pages corresponding to the state paths in the state queue one by one, determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit 820, and determine the generated new page content or the state path with the page jump as the state path corresponding to the base page.
In addition, as shown in fig. 10, the second structure of the analysis unit 810 may specifically include: a third determination module 911, a fourth determination module 912, a second path generation module 913, and a second path determination module 914.
The first capture unit 800 downloads each DOM node in the capture process of the basic page and the script thereof, and sends the currently downloaded DOM node to the third determination module 911 until the downloading of all DOM nodes is finished.
The third determining module 911 is configured to determine whether the currently downloaded DOM node is a script tag, if so, trigger the first capturing unit 800 to download the next DOM node, and otherwise, send a determination notification to the fourth determining module 912.
A fourth determining module 912, configured to determine whether the currently downloaded DOM node contains a DOM event and a callback function, if not, trigger the first capturing unit 800 to download the next DOM node, and if so, send an execution notification to the second path generating module 913.
The second path generating module 913 is configured to generate a state path by using a DOM event included in the currently downloaded DOM node when receiving the execution notification, and send the generated state path to the second path determining module 914.
The second path determining module 914, configured to trigger the second capturing unit 820 to obtain a target page corresponding to the state path when the state path is received, and determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit 820, if so, determine that the state path is the state path corresponding to the base page, and trigger the first capturing unit 800 to download a next DOM node, otherwise, trigger the first capturing unit 800 to download the next DOM node.
Specifically, when the method is applied to the two structures, the determining whether the page jump occurs may include: and if the acquired URLs of the target page and the basic page are different, determining that page jump occurs.
Determining whether to generate new page content may include: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or calculating the similarity of the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.
Wherein, the position information of the DOM event in the state path includes: DOM node identification, XPath of DOM node and DOM event identification.
Still further, the apparatus may further comprise:
the storage unit 830 is configured to store the state path corresponding to the base page generated by the analysis unit 810 and the snapshot of the target page captured by the second capture unit 820, and establish and store an index of the target page.
In addition, the storage unit 830 stores the base page captured by the first capture unit 800, wherein the base page, the state path, and the snapshot of the target page may be stored separately or in a unified manner.
Fig. 11 is a schematic structural diagram of a search engine provided in the present invention, and as shown in fig. 11, the search engine includes: the apparatus shown in fig. 8, a user interface unit 1101, and a search processing unit 1102.
A user interface unit 1101 for receiving a search request from a browser and transmitting a keyword included in the search request to a search processing unit 1102; and returning the search result sent by the search processing unit 1102 to the browser, so that the browser can obtain the corresponding target page by using the state path selected by the user.
The search processing unit 1102 is configured to match the keyword with the index of the target page stored in the storage unit 830, include the state path corresponding to the matched target page in the search result, and send the search result to the user interface unit 1101.
Preferably, the search result may further include: snapshot information of the matched target page. At this time, the process of the present invention,
the user interface unit 1101 is further configured to send snapshot information of a target page selected by the user, which is returned by the browser, to the search processing unit 1102; the snapshot of the target page sent by the search processing unit 1102 is returned to the browser.
The search processing unit 1102 is further configured to obtain a snapshot of the corresponding target page from the storage unit 830 according to the snapshot information of the target page selected by the user, and send the snapshot to the user interface unit 1101.
Furthermore, when the browser does not have the function of parsing the status path, the search engine needs to have the function to assist in completing the pushing of the target page to the browser. At this time, the search engine may further include: a path parsing unit 1103 and a network interface unit 1104.
The user interface unit 1101 is further configured to, after receiving the status path selected by the user and sent by the browser, send the status path to the path analysis unit 1103.
A path parsing unit 1103, configured to generate a target page request according to the received status path.
And a network interface unit 1104, configured to send the target page request generated by the path analysis unit 1103 to the target page site.
Fig. 12 is a schematic structural diagram of a browser provided with a state path analysis function, and as shown in fig. 12, the browser may include: a network side interface unit 1201, a path analysis unit 1202, and a user side interface unit 1203.
A network side interface unit 1201, configured to receive a search result including a state path sent by the search engine shown in fig. 11; and sending the target page request sent by the path analysis unit 1202 to the target page site.
A user side interface unit 1203, configured to display the search result received by the network side interface unit 1201 to a user; the status path selected by the user is sent to the path analysis unit 1202.
A path parsing unit 1202, configured to generate a target page request according to the state path selected by the user and send the target page request to the network-side interface unit 1201.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (22)

1. A method for obtaining a target page is characterized by comprising the following steps:
A. capturing a basic page corresponding to the received uniform resource locator URL and a script of the basic page;
B. analyzing the captured basic page and the script, generating more than one state path containing dynamic information corresponding to the basic page, and capturing a target page by using the generated state paths; wherein the state path comprises: the method comprises the steps of obtaining URL of a basic page, position information of a Document Object Model (DOM) event generating dynamic information in the basic page and a callback function index corresponding to the DOM event.
2. The method according to claim 1, wherein step B specifically comprises:
downloading each DOM node in the grabbing process of the basic page and the script, and sequentially executing steps B11 to B13 on the downloaded DOM nodes until the downloading of all the DOM nodes is finished, and then executing step B14;
b11, judging whether the currently downloaded DOM node is a script tag, if so, transferring to the step B11 for the next downloaded DOM node, otherwise, executing the step B12;
b12, judging whether the currently downloaded DOM node contains a DOM event and a call-back function, if not, transferring to the step B11 for the next downloaded DOM node, and if so, executing the step B13;
b13, generating a state path by using the DOM event contained in the currently downloaded DOM node, storing the generated state path in a state path queue, and turning to the step B11 for the next downloaded DOM node;
b14, acquiring the target pages corresponding to each state path in the state queue one by one, judging whether to generate new page content or generate page jump, and determining the state path generating the new page content or generating the page jump as the state path corresponding to the basic page.
3. The method according to claim 1, wherein step B specifically comprises:
downloading each DOM node in the grabbing process of the basic page and the script, and sequentially executing the steps B21 to B23 on the downloaded DOM nodes until the downloading of all the DOM nodes is finished;
b21, judging whether the currently downloaded DOM node is a script tag, if so, transferring to the step B21 for the next downloaded DOM node, otherwise, executing the step B22;
b22, judging whether the currently downloaded DOM node contains a DOM event and a call-back function, if not, transferring to the step B21 for the next downloaded DOM node, and if so, executing the step B23;
b23, generating a state path by using the DOM event contained in the currently downloaded DOM node;
b24, acquiring a target page corresponding to the state path, judging whether to generate new page content or generate page jump, if so, determining that the state path is the state path corresponding to the basic page, and turning to the step B21 for the next downloaded DOM node; otherwise, go to step B21 for the next downloaded DOM node.
4. The method of claim 2 or 3, wherein determining whether a page jump has occurred comprises: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.
5. The method of claim 2 or 3, wherein determining whether to generate new page content comprises: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,
and calculating the similarity between the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.
6. A method according to any of claims 1 to 3, wherein the location information of DOM events comprises: DOM node identification, path Xpath of DOM node and DOM event identification.
7. A method according to any one of claims 1 to 3, characterized in that after said step B, the method further comprises:
C. and B, storing the state path corresponding to the basic page generated in the step B and the snapshot of the captured target page, and establishing and storing an index of the target page.
8. A method for obtaining a target page, the method according to claim 7 being followed by:
after receiving a search request from a browser, matching keywords contained in the search request with indexes of stored target pages, including a state path corresponding to the matched target page in a search result, and returning the search result to the browser, so that the browser can obtain the corresponding target page by using the state path selected by a user.
9. The method of claim 8, wherein the search results further comprise: snapshot information of the matched target page;
and after receiving snapshot information of the target page selected by the user and returned by the browser, returning a corresponding snapshot of the target page to the browser.
10. The method of claim 8, wherein after including the status path corresponding to the matched target page in the search result and returning the search result to the browser, the method further comprises:
and after receiving the state path selected by the user and sent by the browser, sending a target page request to a target page site according to the state path selected by the user, so that the target page site can push a target page to the browser.
11. A method for obtaining a target page is characterized by comprising the following steps:
the browser receives a search result containing a state path returned by a search engine after sending a search request to the search engine;
sending a target page request to a target page site according to a state path selected by a user;
receiving a target page pushed by the target page site;
wherein the search result containing the state path is returned by the search engine using the method of claim 8.
12. An apparatus for obtaining a target page, the apparatus comprising:
the first grabbing unit is used for grabbing a basic page corresponding to the received uniform resource locator URL and a script of the basic page;
the analysis unit is used for analyzing the basic page and the script captured by the first capture unit and generating more than one state path which corresponds to the basic page and contains dynamic information; wherein the state path comprises: the method comprises the steps that URL of a basic page, position information of a Document Object Model (DOM) event generating dynamic information in the basic page and a callback function index corresponding to the DOM event are obtained;
and the second grabbing unit is used for grabbing the target page by using the state path generated by the analysis unit.
13. The apparatus according to claim 12, wherein the analysis unit comprises in particular: the device comprises a first judgment module, a second judgment module, a first path generation module and a first path determination module;
the first grabbing unit downloads each DOM node in the grabbing process of the basic page and the script thereof, sends the currently downloaded DOM node to the first judging module, and sends a confirmation notice to the first path determining module after finishing the downloading of all the DOM nodes;
the first judging module is used for judging whether the currently downloaded DOM node is a script tag or not, if so, triggering the first grabbing unit to download the next DOM node, and otherwise, sending a judgment notice to the second judging module;
the second judging module is used for judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, the first capturing unit is triggered to download the next DOM node, and if so, an execution notice is sent to the first path generating module;
the first path generation module is used for generating a state path by using the currently downloaded DOM node after receiving the execution notification, storing the generated state path in a state path queue and triggering the first capture unit to download the next DOM node;
the first path determining module is configured to, when receiving the determination notification, trigger the second capturing unit to obtain target pages corresponding to each state path in the state queue one by one, determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit, and determine the generated new page content or the state path in which the page jump occurs as the state path corresponding to the basic page.
14. The apparatus according to claim 12, wherein the analysis unit comprises in particular: the device comprises a third judging module, a fourth judging module, a second path generating module and a second path determining module;
the first grabbing unit downloads each DOM node in the grabbing process of the basic page and the script thereof, and sends the currently downloaded DOM node to the third judging module until the downloading of all DOM nodes is finished;
the third judging module is used for judging whether the currently downloaded DOM node is a script tag or not, if so, the first grabbing unit is triggered to download the next DOM node, and otherwise, a judgment notice is sent to the fourth judging module;
the fourth judging module is used for judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, the first capturing unit is triggered to download the next DOM node, and if so, an execution notification is sent to the second path generating module;
the second path generating module is configured to generate a state path by using a DOM event included in a currently downloaded DOM node when receiving the execution notification, and send the generated state path to the second path determining module;
and the second path determining module is used for triggering the second capturing unit to obtain a target page corresponding to the state path when the state path is received, judging whether new page content is generated or page jump is generated according to the obtaining result of the second capturing unit, if so, determining that the state path is the state path corresponding to the basic page, and triggering the first capturing unit to download a next DOM node, otherwise, triggering the first capturing unit to download the next DOM node.
15. The apparatus of claim 13 or 14, wherein determining whether a page jump occurs comprises: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.
16. The apparatus of claim 13 or 14, wherein determining whether to generate new page content comprises: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,
and calculating the similarity between the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.
17. The apparatus according to any of claims 12 to 14, wherein the location information of the DOM event comprises: DOM node identification, path Xpath of DOM node and DOM event identification.
18. The apparatus of any one of claims 12 to 14, further comprising:
and the storage unit is used for storing the state path corresponding to the basic page generated by the analysis unit and the snapshot of the target page captured by the second capture unit, and establishing and storing the index of the target page.
19. A search engine, comprising: the apparatus, user interface unit, and search processing unit of claim 18;
the user interface unit is used for receiving a search request from a browser and sending a keyword contained in the search request to the search processing unit; returning the search result sent by the search processing unit to the browser, so that the browser can obtain a corresponding target page by using the state path selected by the user;
and the search processing unit is used for matching the keyword with the index of the target page stored in the storage unit of the device, and sending the state path corresponding to the matched target page to the user interface unit by including the state path in the search result.
20. The search engine of claim 19, further comprising, in the search results: snapshot information of the matched target page;
the user interface unit is also used for sending the snapshot information of the target page selected by the user and returned by the browser to the search processing unit; returning the snapshot of the target page sent by the search processing unit to the browser;
the search processing unit is further configured to obtain a snapshot of the corresponding target page from the storage unit according to the snapshot information of the target page selected by the user, and send the snapshot to the user interface unit.
21. The search engine of claim 19, further comprising: a path analysis unit and a network interface unit;
the user interface unit is also used for sending the state path to the path analysis unit after receiving the state path selected by the user and sent by the browser;
the path analysis unit is used for generating a target page request according to the received state path;
and the network interface unit is used for sending the target page request generated by the path analysis unit to a target page site.
22. A browser, comprising: the system comprises a network side interface unit, a path analysis unit and a user side interface unit;
the network side interface unit, configured to receive a search result including a status path sent by the search engine according to claim 19; sending the target page request sent by the path analysis unit to a target page site;
the user side interface unit is used for displaying the search result received by the network side interface unit to a user; sending the state path selected by the user to the path analysis unit;
and the path analysis unit is used for generating a target page request according to the state path selected by the user and sending the target page request to the network side interface unit.
CN2010105314609A 2010-11-04 2010-11-04 Method and device for acquiring destination page, search engine and browser Active CN101984429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105314609A CN101984429B (en) 2010-11-04 2010-11-04 Method and device for acquiring destination page, search engine and browser

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105314609A CN101984429B (en) 2010-11-04 2010-11-04 Method and device for acquiring destination page, search engine and browser

Publications (2)

Publication Number Publication Date
CN101984429A true CN101984429A (en) 2011-03-09
CN101984429B CN101984429B (en) 2012-03-14

Family

ID=43641598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105314609A Active CN101984429B (en) 2010-11-04 2010-11-04 Method and device for acquiring destination page, search engine and browser

Country Status (1)

Country Link
CN (1) CN101984429B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150307A (en) * 2011-12-06 2013-06-12 株式会社理光 Method and equipment for searching name related to thematic word from network
CN103268361A (en) * 2013-06-07 2013-08-28 百度在线网络技术(北京)有限公司 Extracting method, device and system of hidden URL (Uniform Resource Locator) in webpage
CN103645968A (en) * 2013-12-02 2014-03-19 北京奇虎科技有限公司 Browser status restoration method and device
CN103955495A (en) * 2014-04-18 2014-07-30 百度在线网络技术(北京)有限公司 Downloading method and device for page sub-resource
CN104408198A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for acquiring webpage contents
CN105740417A (en) * 2016-01-29 2016-07-06 青岛海信移动通信技术股份有限公司 Webpage based target data search method and module, browser and terminal
CN105740290A (en) * 2014-12-11 2016-07-06 富士通株式会社 System and method for searching self-adaptive networks of mobile devices
CN105867897A (en) * 2015-12-07 2016-08-17 乐视网信息技术(北京)股份有限公司 Page redirection analysis method and apparatus
WO2017124692A1 (en) * 2016-01-20 2017-07-27 百度在线网络技术(北京)有限公司 Method and apparatus for searching for conversion relationship between form pages and target pages
CN107025111A (en) * 2017-03-17 2017-08-08 烽火通信科技股份有限公司 The method and system that a kind of browser target pages entire screen switch is shown
CN107169011A (en) * 2017-03-31 2017-09-15 百度在线网络技术(北京)有限公司 The original recognition methods of webpage based on artificial intelligence, device and storage medium
CN110674427A (en) * 2019-09-20 2020-01-10 北京达佳互联信息技术有限公司 Method, device, equipment and storage medium for responding to webpage access request
CN110874446A (en) * 2018-08-31 2020-03-10 北京京东尚科信息技术有限公司 Page display method and system, computer system and computer readable medium
CN111177539A (en) * 2019-12-16 2020-05-19 北京百度网讯科技有限公司 Search result page generation method and device, electronic equipment and storage medium
WO2021218468A1 (en) * 2020-04-29 2021-11-04 百度在线网络技术(北京)有限公司 Data update method and device, search server, terminal, and storage medium
CN113657076A (en) * 2021-08-17 2021-11-16 中国平安财产保险股份有限公司 Page operation record table generation method and device, electronic equipment and storage medium
US11803597B2 (en) 2020-04-29 2023-10-31 Baidu Online Network Technology (Beijing) Co., Ltd. Data updating method, apparatus, search server, terminal and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294281A1 (en) * 2006-05-05 2007-12-20 Miles Ward Systems and methods for consumer-generated media reputation management
CN101587488A (en) * 2009-05-25 2009-11-25 深圳市腾讯计算机系统有限公司 Method and device for detecting re-orientation of page in search engine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294281A1 (en) * 2006-05-05 2007-12-20 Miles Ward Systems and methods for consumer-generated media reputation management
CN101587488A (en) * 2009-05-25 2009-11-25 深圳市腾讯计算机系统有限公司 Method and device for detecting re-orientation of page in search engine

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150307A (en) * 2011-12-06 2013-06-12 株式会社理光 Method and equipment for searching name related to thematic word from network
CN103268361A (en) * 2013-06-07 2013-08-28 百度在线网络技术(北京)有限公司 Extracting method, device and system of hidden URL (Uniform Resource Locator) in webpage
CN103268361B (en) * 2013-06-07 2019-05-31 百度在线网络技术(北京)有限公司 Extracting method, the device and system of URL are hidden in webpage
CN103645968A (en) * 2013-12-02 2014-03-19 北京奇虎科技有限公司 Browser status restoration method and device
CN103645968B (en) * 2013-12-02 2017-03-15 北京奇虎科技有限公司 A kind of browser status restored method and device
CN103955495A (en) * 2014-04-18 2014-07-30 百度在线网络技术(北京)有限公司 Downloading method and device for page sub-resource
CN105740290A (en) * 2014-12-11 2016-07-06 富士通株式会社 System and method for searching self-adaptive networks of mobile devices
CN104408198B (en) * 2014-12-15 2018-07-17 北京国双科技有限公司 The acquisition methods and device of Webpage content
CN104408198A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for acquiring webpage contents
CN105867897A (en) * 2015-12-07 2016-08-17 乐视网信息技术(北京)股份有限公司 Page redirection analysis method and apparatus
WO2017124692A1 (en) * 2016-01-20 2017-07-27 百度在线网络技术(北京)有限公司 Method and apparatus for searching for conversion relationship between form pages and target pages
CN105740417A (en) * 2016-01-29 2016-07-06 青岛海信移动通信技术股份有限公司 Webpage based target data search method and module, browser and terminal
CN107025111A (en) * 2017-03-17 2017-08-08 烽火通信科技股份有限公司 The method and system that a kind of browser target pages entire screen switch is shown
CN107169011A (en) * 2017-03-31 2017-09-15 百度在线网络技术(北京)有限公司 The original recognition methods of webpage based on artificial intelligence, device and storage medium
CN107169011B (en) * 2017-03-31 2021-06-11 百度在线网络技术(北京)有限公司 Webpage originality identification method and device based on artificial intelligence and storage medium
CN110874446A (en) * 2018-08-31 2020-03-10 北京京东尚科信息技术有限公司 Page display method and system, computer system and computer readable medium
CN110674427A (en) * 2019-09-20 2020-01-10 北京达佳互联信息技术有限公司 Method, device, equipment and storage medium for responding to webpage access request
CN110674427B (en) * 2019-09-20 2022-04-22 北京达佳互联信息技术有限公司 Method, device, equipment and storage medium for responding to webpage access request
CN111177539A (en) * 2019-12-16 2020-05-19 北京百度网讯科技有限公司 Search result page generation method and device, electronic equipment and storage medium
WO2021218468A1 (en) * 2020-04-29 2021-11-04 百度在线网络技术(北京)有限公司 Data update method and device, search server, terminal, and storage medium
US11803597B2 (en) 2020-04-29 2023-10-31 Baidu Online Network Technology (Beijing) Co., Ltd. Data updating method, apparatus, search server, terminal and storage medium
CN113657076A (en) * 2021-08-17 2021-11-16 中国平安财产保险股份有限公司 Page operation record table generation method and device, electronic equipment and storage medium
CN113657076B (en) * 2021-08-17 2023-08-22 中国平安财产保险股份有限公司 Page operation record table generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN101984429B (en) 2012-03-14

Similar Documents

Publication Publication Date Title
CN101984429B (en) Method and device for acquiring destination page, search engine and browser
US7885950B2 (en) Creating search enabled web pages
CN102043834B (en) Method for realizing searching by utilizing client and search client
US8799262B2 (en) Configurable web crawler
US7536389B1 (en) Techniques for crawling dynamic web content
US8245198B2 (en) Mapping breakpoints between web based documents
US8856263B2 (en) Systems and methods thereto for acceleration of web pages access using next page optimization, caching and pre-fetching techniques
US20110173178A1 (en) Method and system for obtaining script related information for website crawling
US20140286333A1 (en) Method and system for local calling via webpage
CN103279507B (en) Webpage spider operational method and system
WO2010114913A1 (en) Method and system of retrieving ajax web page content
CN104572777B (en) Webpage loading method and device based on UIWebView component
US20040030780A1 (en) Automatic search responsive to an invalid request
CN102054028A (en) Web crawler system with page-rendering function and implementation method thereof
KR101689745B1 (en) Web browsing system and method for rendering dynamic resource URI of script
WO2012071993A1 (en) Processing method and device for world wide web page
Gupta et al. Implementation of web crawler
US9959305B2 (en) Annotating structured data for search
CN104704495B (en) The method and device of a kind of information search
CN109246069B (en) Webpage login method and device and readable storage medium
US20090006481A1 (en) Information providing method and information providing apparatus
Panum et al. Kraaler: A user-perspective web crawler
EP2662785A2 (en) A method and system for non-ephemeral search
CA2538504C (en) Method and system for obtaining script related information for website crawling
Vadivel et al. Component based effective web crawler and indexer using web services

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: BEIJING BAIDU NETWORK INFORMATION TECHNOLOGY CO.,

Free format text: FORMER OWNER: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD.

Effective date: 20111228

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20111228

Address after: 100085 Beijing, Haidian District, No. ten on the ground floor, No. 10 Baidu building, layer 2

Applicant after: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

Address before: 100085 Beijing, Haidian District, No. ten on the street Baidu building, No. 10

Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

C14 Grant of patent or utility model
GR01 Patent grant