CN109582850B - Webpage crawling method and device, storage medium and electronic equipment - Google Patents
Webpage crawling method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN109582850B CN109582850B CN201811467095.2A CN201811467095A CN109582850B CN 109582850 B CN109582850 B CN 109582850B CN 201811467095 A CN201811467095 A CN 201811467095A CN 109582850 B CN109582850 B CN 109582850B
- Authority
- CN
- China
- Prior art keywords
- data
- style data
- file name
- style
- real
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000009193 crawling Effects 0.000 title claims abstract description 40
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000004458 analytical method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012015 optical character recognition Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a method and a device for crawling a webpage, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring style data of a target webpage, wherein the style data is generated on the basis of a reverse-crawling strategy for source data of the target webpage; determining real data corresponding to the style data according to a corresponding relation between the pre-generated style data and the real data, and replacing the style data with the corresponding real data; and determining all real contents of the target webpage. According to the method, the device, the storage medium and the electronic equipment for crawling the webpage, the real data corresponding to the style data are determined according to the corresponding relation between the pre-generated style data and the real data, so that all the real data of the webpage can be quickly and accurately acquired; the method does not need to repeatedly use the image recognition technology to recognize the data in the webpage, saves a large amount of processing resources, and greatly improves the capturing speed and the capturing efficiency.
Description
Technical Field
The invention relates to the technical field of webpage crawling, in particular to a webpage crawling method, a webpage crawling device, a webpage crawling storage medium and electronic equipment.
Background
The traditional crawler starts from one or a plurality of initial URLs (uniform resource locators), acquires URLs and other contents on a webpage corresponding to the initial URLs, and simultaneously puts a new URL acquired on a current page into a queue to continuously capture the new URL until a certain stop condition of the system is met. All contents grabbed by the crawler are stored, classified, analyzed and filtered according to keywords, texts, pictures, audios and videos and the like, and indexes are built for later query and retrieval.
However, some websites adopt anti-crawler measures to prevent crawlers from acquiring webpage source codes, so that the crawlers cannot accurately acquire target webpage information. In order to accurately identify a webpage adopting a reverse-crawling strategy, a common method is to capture a picture after the webpage is opened, store the picture, and identify the picture through Optical Character Recognition (OCR) to acquire all real text data in the webpage. But the OCR recognition occupies a large amount of CPU resources and processing time, and the webpage crawling efficiency is low.
Disclosure of Invention
In order to solve the foregoing problems, embodiments of the present invention provide a method, an apparatus, a storage medium, and an electronic device for crawling a web page.
In a first aspect, an embodiment of the present invention provides a method for crawling a web page, including:
acquiring style data of a target webpage, wherein the style data is generated on the basis of a reverse-crawling strategy for source data of the target webpage;
determining real data corresponding to the style data according to a corresponding relation between pre-generated style data and the real data, and replacing the style data with the corresponding real data;
and determining all real content of the target webpage, wherein the real content comprises real data corresponding to the style data.
In a possible implementation manner, after the obtaining the style data of the target webpage, the method further includes:
judging whether a corresponding relation between the style data matched with the style data and the real data exists or not, and determining the real data corresponding to the style data according to the corresponding relation between the matched style data and the real data when the corresponding relation between the style data matched with the style data and the real data exists;
and when the corresponding relation between the style data matched with the style data and the real data does not exist, establishing the corresponding relation between the style data matched with the style data and the real data.
In a possible implementation manner, the determining whether there is a correspondence between the style data matching the style data and the real data includes:
determining the file name of the style data, and judging whether a historical file name matched with the file name exists or not, wherein the historical file name is the file name of the analyzed historical style data;
when the matched historical file name exists, determining that a corresponding relation between the pattern data matched with the pattern data and the real data exists, wherein the corresponding relation between the pattern data matched with the pattern data and the real data is the corresponding relation between the effective historical pattern data and the real data determined based on the analysis result of the effective historical pattern data; the valid history style data is history style data corresponding to a history file name matched with the file name.
In a possible implementation manner, the determining whether there is a history file name matching the file name includes:
dividing the file name and the historical file name into a plurality of sub-character strings respectively, and determining the arrangement sequence of each sub-character string of the file name in the file name and the arrangement sequence of each sub-character string of the historical file name in the historical file name;
judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not from the last sequential sub-character string, and determining that the file name is not matched with the historical file name when the sub-character string of the file name is different from the corresponding sub-character string of the historical file name;
and when the two are the same, determining the next sequential sub-character string in a reverse order, repeating the process of judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not until the file name is determined not to be matched with the historical file name or all the sub-character strings of the file name are determined to be matched with all the sub-character strings of the historical file name, and determining that the file name is matched with the historical file name when all the sub-character strings of the file name are determined to be matched with all the sub-character strings of the historical file name.
In a possible implementation manner, the establishing a correspondence between the style data matched with the style data and the real data includes:
creating a local webpage and loading the style data of the target webpage into the local webpage;
acquiring a webpage image of the local webpage, identifying the webpage image, and determining real data in the webpage image;
and establishing a corresponding relation between the pattern data and the identified corresponding real data.
In a possible implementation manner, after the establishing of the correspondence between the style data matching the style data and the real data, the method further includes:
and storing the corresponding relation between the style data matched with the style data and the real data in a database.
In one possible implementation, the style data includes text style data and/or picture style data.
In a second aspect, an embodiment of the present invention further provides a device for crawling a web page, including:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring style data of a target webpage, and the style data is generated on the basis of a reverse-climbing strategy for source data of the target webpage;
the processing module is used for determining real data corresponding to the style data according to the corresponding relation between the pre-generated style data and the real data and replacing the style data with the corresponding real data;
and the determining module is used for determining all real contents of the target webpage, wherein the real contents comprise real data corresponding to the style data.
In a third aspect, an embodiment of the present invention further provides a storage medium, where the storage medium stores computer-executable instructions, and the computer-executable instructions are configured to perform any one of the above methods for crawling a web page.
In a fourth aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method of web page crawling of any one of the above.
In the solution provided by the first aspect of the embodiments of the present invention, when a webpage using a reverse-crawling policy is crawled, style data in the webpage is extracted, and real data corresponding to the style data is determined according to a correspondence between the pre-generated style data and the real data, so that all the real data of the webpage can be quickly and accurately acquired; the method does not need to repeatedly use the image recognition technology to recognize the data in the webpage, saves a large amount of processing resources, and greatly improves the capturing speed and the capturing efficiency.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method of web page crawling provided by an embodiment of the present invention;
fig. 2 is a flowchart illustrating a method for crawling a web page according to an embodiment of the present invention, where whether a corresponding relationship exists is determined;
fig. 3 is a flowchart illustrating establishing a correspondence between style data and real data that are matched with the style data in the method for crawling a web page according to the embodiment of the present invention;
FIG. 4 illustrates a flow chart of another method of web page crawling provided by an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an apparatus for crawling web pages according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of another apparatus for web page crawling provided by the embodiment of the present invention;
fig. 7 is a schematic diagram illustrating a specific structure of a parsing module in the apparatus for web page crawling according to the embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device for executing a web page crawling method according to an embodiment of the present invention.
Detailed Description
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The method for crawling the webpage provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps of 101-103:
step 101: and acquiring style data of the target webpage, wherein the style data is generated on the basis of a reverse-crawling strategy for source data of the target webpage.
In the embodiment of the invention, the target webpage is a webpage to be crawled, the target webpage adopts a reverse-crawling strategy, and the target webpage performs special processing on part of characters based on the reverse-crawling strategy and generates style data. Specifically, the source data of the target web page may be crawled, and the source data may be html source codes, which include nodes of the target web page, such as meta nodes, body nodes, and the like; because the target webpage is specially processed based on the reverse-climbing policy, the source data of the target webpage also comprises style data generated based on the reverse-climbing policy, and the style data can be a certain style file or a style link. The style data may specifically include text style data, picture style data, and the like, where the text style data refers to replacing a text of an original web page with another text or a text in another format (or font), and at this time, a traditional crawling method may acquire incorrect data; for example, the page of the target webpage displays that the driving range of a certain vehicle is "12.66 kilometers", but the driving range in the source code of the target webpage is "45.22 kilometers", and the target webpage displays the real driving range of 12.66 kilometers by formatting the source code; the driving mileage is a kind of text style data.
The image style data refers to replacing characters of an original webpage with a corresponding image or replacing the image of the original webpage with another image, for example, replacing each number or a combination of a plurality of numbers in a private telephone number with a corresponding image, and the like, and the traditional crawling method cannot acquire the telephone number correctly unless an image recognition technology is adopted.
Step 102: and determining real data corresponding to the style data according to the corresponding relation between the pre-generated style data and the real data, and replacing the style data with the corresponding real data.
In the embodiment of the present invention, a corresponding relationship between the pattern data and the real data is generated in advance, and the corresponding relationship may represent real data (real text or real picture) actually corresponding to a certain pattern data, for example, real characters corresponding to a certain picture, or the like. And the real data corresponding to the style data of the target webpage can be quickly determined according to the corresponding relation between the style data and the real data, and then the style data in the target webpage is replaced by the identifiable real data, so that the target webpage can be conveniently identified in the follow-up process.
Step 103: and determining all real content of the target webpage, wherein the real content comprises real data corresponding to the style data.
In the embodiment of the present invention, when the style data in the target webpage is replaced with the corresponding real data, all the real contents in the target webpage can be quickly acquired, and all the real data in the target webpage are crawled, which includes the real data corresponding to the style data determined in step 102.
The webpage crawling method provided by the embodiment of the invention extracts the style data in the webpage when the webpage adopting the anti-crawling strategy is crawled, and determines the real data corresponding to the style data according to the corresponding relation between the pre-generated style data and the real data, so that all the real data of the webpage can be quickly and accurately acquired; the method does not need to repeatedly use the image recognition technology to recognize the data in the webpage, saves a large amount of processing resources, and greatly improves the capturing speed and the capturing efficiency.
On the basis of the foregoing embodiment, after "acquiring the style data of the target webpage" in step 101, the method further includes a process of determining whether a corresponding relationship exists, as shown in fig. 2, where the process specifically includes steps 201 and 203:
step 201: judging whether a corresponding relation between the style data matched with the style data and the real data exists or not, and continuing to step 202 when the corresponding relation between the style data matched with the style data and the real data exists; when there is no correspondence between the style data and the real data that match the style data, step 203 is continued.
Step 202: and determining real data corresponding to the style data according to the corresponding relation between the matched style data and the real data.
Step 203: and establishing a corresponding relation between the style data matched with the style data and the real data.
In the embodiment of the present invention, when obtaining the style data of the target webpage, it is necessary to determine whether there is a correspondence between the style data and the real data that matches the style data, that is, it is necessary to determine whether a correspondence between the style data and the real data has been set, or whether the style data has been analyzed. When there is a correspondence between the style data that matches the style data and the real data, the real data corresponding to the style data may be determined according to the correspondence between the style data and the real data. In step 202 and step 102 in the present application, "determining the real data corresponding to the style data according to the correspondence between the pre-generated style data and the real data" is substantially the same, that is, when there is a correspondence between the style data and the real data that match the style data, step 102 may be continuously executed.
When the corresponding relation between the style data matched with the style data and the real data does not exist, the style data of the target webpage is explained to be never analyzed, at the moment, the style data needs to be analyzed, and the corresponding relation between the style data matched with the style data and the real data is established; the style data is analyzed to determine the real data of the target webpage, and the corresponding relation between the style data and the real data is established for the follow-up crawling process.
On the basis of the above embodiment, whether there is a matching correspondence is determined by the file name of the style data. Specifically, the step 201 of determining whether there is a corresponding relationship between the style data and the real data, which is matched with the style data, includes: and determining the file name of the style data, and judging whether a historical file name matched with the file name exists, wherein the historical file name is the file name of the analyzed historical style data.
In the embodiment of the present invention, the style data may specifically be a style file or a style link, and the file name of the style data is the name of the style file or the address of the style link. Because the name of the style file is generally a long character string (such as a 64-bit string), the probability that the two style data adopt the same name is very low, and different link addresses point to different network resources, in the embodiment of the invention, different file names correspond to different style data, and whether a corresponding relationship exists is determined by judging whether the file names are analyzed.
Specifically, after each style data is analyzed, the style data is used as historical style data, and the historical file name of the historical style data is recorded; when certain style data needs to be analyzed currently, judging whether a history file name consistent with the file name of the style data to be analyzed exists in the history file name, if so, indicating that the style data to be analyzed has been analyzed, namely, the corresponding relation between the style data and the real data exists; otherwise, no corresponding relation exists. For example, if three style data with file names f1, f2, and f3 have been parsed, the three style data f1, f2, and f3 are three historical style data; if a style data with a file name f2 needs to be parsed at present, it can be known that the style data has been parsed according to the file name, and then the step 202 is continued; if it is currently necessary to parse the style data with the file name f4 and there is no history style data with the file name f4, then it is necessary to parse the style data with the file name f4, i.e. it is necessary to continue step 203.
Specifically, when there is a matching historical file name, determining that there is a correspondence between the style data matching the style data and the real data, and the correspondence between the style data matching the style data and the real data is a correspondence between the valid historical style data and the real data determined based on the analysis result of the valid historical style data; the valid history style data is history style data corresponding to a history file name matched with the file name.
Specifically, in the history parsing process, each time one history pattern data is parsed, a parsing result of the history pattern data may be determined, and at this time, the parsing result may be used as real data corresponding to the history pattern data. That is, after the history style data is parsed, a file name of the history style data (i.e., a history file name) is marked as a parsed state, while a correspondence between the history style data and a parsing result is taken as a correspondence between the style data and the real data. At the current stage, when judging that the historical file name matched with the file name exists, the corresponding relation between the style data matched with the style data and the real data exists, at the moment, the historical style data matched with the file name, namely the effective historical style data, is determined, the corresponding relation between the effective historical style data and the analysis result is determined, and the corresponding relation between the effective historical style data and the analysis result is the corresponding relation between the style data matched with the current style data and the real data. After determining the corresponding relationship between the style data and the real data, the step 202 can be executed continuously.
Optionally, in the embodiment of the present invention, the step of determining whether there is a history file name matching the file name includes:
step A1: dividing the file name and the historical file name into a plurality of substrings respectively, and determining the arrangement sequence of each substring of the file name in the file name and the arrangement sequence of each substring of the historical file name in the historical file name
Step A2: judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not from the last positioned sub-character string, and continuing to the step A3 when the sub-character string of the file name is different from the corresponding sub-character string of the historical file name; when the two are the same, step a4 is continued.
Step A3: it is determined that the filename does not match the historical filename.
Step A4: determining the next-order substring in a reverse order, and repeating the above-mentioned process of determining whether the substring of the file name is the same as the corresponding substring of the history file name, that is, repeating step a2 until it is determined that the file name is not matched with the history file name, or it is determined that all the substrings of the file name are all matched with all the substrings of the history file name, and when it is determined that all the substrings of the file name are all matched with all the substrings of the history file name, it is determined that the file name is matched with the history file name.
In the embodiment of the invention, because the file name of the style data is a long character string or a link address, and the link address is also a long character string generally, the problem of overlarge processing amount can be caused by directly judging whether two long character strings (the long character string of the file name and the long character string of the historical file name) are the same or not, so that the file name is divided into a plurality of sections of sub character strings, whether the sub character strings of the file name and the historical file name are the same or not is judged in sequence in a segmented manner, if the two sub character strings are different, the file name and the historical file name are different from each other, and other sub character strings do not need to be judged at the moment, thereby reducing the processing amount; if the two are the same, whether the next substring is the same or not can be continuously judged until a certain substring is determined to be different (the name of the description file is different from that of the history file) or all substrings are completely the same (the name of the description file is the same as that of the history file).
Meanwhile, for unification, the previous part of the file name of the style data used in the same website may be the same, and the distinguishing point is mainly at the part behind the name, so in this embodiment, the judgment is performed through the reverse order, so that different substrings can be determined with higher probability, and the processing efficiency is further improved.
Specifically, after a long character string of a file name is divided into a plurality of sub character strings, all the sub character strings can be sequenced according to the positions of the sub character strings in the file name, and then whether the sub character strings are the same or not is judged in a reverse order. For example, the file name is "aabbbccddeff", the history file name is "aabbbccddyff", the file name and the history file name are first segmented, for example, three characters are grouped into a plurality of substrings, the file name includes four substrings, which are "aab", "bcc", "dde", "eff", in turn, the substrings of the history file name are "aab", "bcc", "ddy", "eff".
When the reverse order is judged, firstly judging whether the last sequential substring is the same, namely judging that the 'eff' of the file name is the same as the 'eff' of the historical file name, if the file name is different from the historical file name, indicating that the file name is different from the historical file name, and judging whether other substrings of the file name and the historical file name are the same is not needed; if the two are the same, selecting the next sequential sub-character string in the reverse order, continuously judging whether the two sub-character strings are the same, namely selecting the next sequential sub-character strings 'dde' and 'ddy', judging whether the two sub-character strings are the same, and repeating the process of judging in the reverse order until determining whether the file name is the same as the historical file name. In the embodiment, the file name is divided into a plurality of sub-character strings, whether the file name is the same as the historical file name is judged by taking the sub-character strings as a unit, and a subsequent judging process is not required to be executed when the sub-character strings are determined to be different, so that the processing amount can be reduced, and the processing efficiency can be improved; meanwhile, based on the characteristics of the file name, different substrings can be positioned at a higher probability by adopting a reverse order judgment mode, and the processing efficiency is further improved.
On the basis of the above embodiment, referring to fig. 3, the step 203 "establishing the corresponding relationship between the style data matched with the style data and the real data" includes the steps 2031-2033:
step 2031: and creating a local webpage and loading the style data of the target webpage into the local webpage.
In the embodiment of the invention, when the style data needs to be analyzed, a webpage file, namely a local webpage, is created locally, and then the unresolved style data in the target webpage is loaded into the local webpage, namely the style data is extracted and visually displayed in a mode that the style data of the target webpage is loaded by the local webpage.
Optionally, when the target webpage includes multiple style data, all the style data may be uniformly loaded to the same local webpage, so that the analysis efficiency of the style data may be improved. In addition, the local webpage can be a formatted webpage and is used for adding a unique mark for a plurality of style data so as to accurately distinguish different style data in the same local webpage; for example, sequence numbers (r), (c), and (c) are added to the local web page, and each style data is loaded to a position corresponding to the sequence number, such as the rear of the sequence number.
Step 2032: and acquiring a webpage image of a local webpage, identifying the webpage image, and determining real data in the webpage image.
Step 2033: and establishing a corresponding relation between the pattern data and the identified corresponding real data.
In the embodiment of the invention, the local webpage only contains the displayed style data, and at the moment, when the webpage image of the local webpage is identified, the text corresponding to the style data can be accurately identified, and the webpage image can be specifically identified by adopting an OCR (optical character recognition) technology. For example, 45.22 kilometers of traveled mileage in the target web page source code is provided with special style processing, the traveled mileage is style data, at this time, after the style data is loaded into the local web page, the traveled mileage visually displayed in the local web page is 12.66 kilometers, the real data can be determined to be 12.66 kilometers by identifying the web page image of the local web page, at this time, a corresponding relationship between the style data and the real data can be established, that is, the style data "45.22 kilometers" of the traveled mileage corresponds to the real data "12.66 kilometers". The web page image of the local web page can be obtained by opening the local web page and capturing the image, and other obtaining methods can also be adopted, which is not limited in this embodiment.
In the embodiment of the invention, another local webpage is established and the style data is loaded, and then the style data is analyzed by utilizing the webpage image, so that the corresponding relation between the style data and the real data can be obtained; in the method, only the style data of the target webpage needs to be identified, and all contents of the target webpage do not need to be identified, so that the processing amount of image identification can be reduced, and the processing efficiency is improved; meanwhile, after the corresponding relation between the style data and the real data is established, image recognition is not needed when the webpage with the style data is crawled subsequently, the real text can be conveniently and quickly determined according to the corresponding relation between the style data and the real data, and the capturing efficiency is greatly improved.
On the basis of the above embodiment, after "establishing a correspondence between style data matching the style data and the real data" in step 203, the method further includes: and storing the corresponding relation between the style data matched with the style data and the real data in a database.
In the embodiment of the invention, the database is established to store the corresponding relation between the analyzed style data and the real data, and when the webpage needs to be crawled, the database can be inquired to determine whether the style data in the webpage to be crawled is analyzed. Meanwhile, the corresponding relation between the style data and the real data is stored through the database, so that the corresponding relation is convenient to manage, such as adding, deleting or updating the corresponding relation between the style data and the real data.
The method flow of the web page crawling is described in detail below by an embodiment.
In the embodiment of the invention, after the style data is acquired, whether the style data is analyzed is judged, and then a corresponding processing flow is executed. Referring to fig. 4, the method flow of the web page crawling includes steps 401-:
step 401: and acquiring style data of the target webpage, wherein the style data is generated based on a reverse crawling strategy.
Step 402: judging whether a corresponding relationship between the style data matched with the style data and the real data exists, and continuing to step 403 when the corresponding relationship between the style data matched with the style data and the real data exists; when there is no correspondence between the style data and the real data that match the style data, step 404 is continued.
Step 403: and determining the real data corresponding to the style data according to the corresponding relation between the matched style data and the real data, and continuing to step 407.
Step 404: and creating a local webpage and loading the style data of the target webpage into the local webpage.
Step 405: and acquiring a webpage image of a local webpage, identifying the webpage image, and determining real data in the webpage image.
Step 406: and establishing a corresponding relation between the pattern data and the real data according to the pattern data and the identified corresponding real data, and taking the identified real data as the real data corresponding to the pattern data.
Step 407: the style data is replaced with corresponding real data.
Step 408: and determining all real contents of the target webpage.
The webpage crawling method provided by the embodiment of the invention extracts the style data in the webpage when the webpage adopting the anti-crawling strategy is crawled, and determines the real data corresponding to the style data according to the corresponding relation between the pre-generated style data and the real data, so that all the real data of the webpage can be quickly and accurately acquired; the method does not need to repeatedly use the image recognition technology to recognize the data in the webpage, saves a large amount of processing resources, and greatly improves the capturing speed and the capturing efficiency. When the style data is analyzed, only the style data of the target webpage needs to be identified, and all contents of the target webpage do not need to be identified by images, so that the processing amount of image identification can be reduced, and the processing efficiency is improved; meanwhile, after the corresponding relation between the style data and the real data is established, image recognition is not needed when the webpage with the style data is crawled subsequently, the real text can be conveniently and quickly determined according to the corresponding relation between the style data and the real data, and the capturing efficiency is greatly improved. By utilizing the characteristic that the file name of the style data is not repeated, whether the style data is analyzed or not is judged based on the file name, and the processing efficiency of judging whether the style data is analyzed or not can be further improved by means of segmenting the file name and judging the reverse order.
The above describes in detail the method flow of web page crawling, and the method can also be implemented by a corresponding device, and the structure and function of the device are described in detail below.
The embodiment of the present invention further provides a device for crawling a web page, as shown in fig. 5, including:
the acquiring module 51 is configured to acquire style data of a target webpage, where the style data is data generated based on a reverse-crawling policy on source data of the target webpage;
the processing module 52 is configured to determine real data corresponding to the style data according to a correspondence between pre-generated style data and the real data, and replace the style data with the corresponding real data;
a determining module 53, configured to determine all real contents of the target webpage, where the real contents include real data corresponding to the style data.
In one possible implementation, referring to fig. 6, the apparatus further includes a determining module 54 and an analyzing module 55;
after the obtaining module 51 obtains the style data of the target webpage, the determining module 54 is configured to determine whether there is a corresponding relationship between the style data and the real data that matches the style data;
when there is a correspondence between the style data and the real data that match the style data, the processing module 52 is configured to determine the real data corresponding to the style data according to the correspondence between the matched style data and the real data;
when there is no correspondence between the style data and the real data that match the style data, the parsing module 55 is configured to establish a correspondence between the style data and the real data that match the style data.
In a possible implementation manner, the determining module 54 is specifically configured to: determining the file name of the style data, and judging whether a historical file name matched with the file name exists or not, wherein the historical file name is the file name of the analyzed historical style data;
when the matched historical file name exists, determining that a corresponding relation between the pattern data matched with the pattern data and the real data exists, wherein the corresponding relation between the pattern data matched with the pattern data and the real data is the corresponding relation between the effective historical pattern data and the real data determined based on the analysis result of the effective historical pattern data; the valid history style data is history style data corresponding to a history file name matched with the file name.
In a possible implementation manner, the step of determining whether there is a history file name matching the file name by the determining module 54 includes:
dividing the file name and the historical file name into a plurality of sub-character strings respectively, and determining the arrangement sequence of each sub-character string of the file name in the file name and the arrangement sequence of each sub-character string of the historical file name in the historical file name;
judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not from the last sequential sub-character string, and determining that the file name is not matched with the historical file name when the sub-character string of the file name is different from the corresponding sub-character string of the historical file name;
and when the two are the same, determining the next sequential sub-character string in a reverse order, repeating the process of judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not until the file name is determined not to be matched with the historical file name or all the sub-character strings of the file name are determined to be matched with all the sub-character strings of the historical file name, and determining that the file name is matched with the historical file name when all the sub-character strings of the file name are determined to be matched with all the sub-character strings of the historical file name.
In one possible implementation, referring to fig. 7, the parsing module 55 includes:
the preprocessing unit 551 is configured to create a local webpage and load style data of the target webpage into the local webpage;
an identifying unit 552, configured to obtain a web page image of the local web page, identify the web page image, and determine real data in the web page image;
a determining unit 553, configured to establish a correspondence between the pattern data and the identified corresponding real data.
In one possible implementation, referring to fig. 6, the apparatus further includes a storage module 56;
after the parsing module 55 establishes a correspondence between the style data and the real data matching the style data, the storage module 56 is configured to store the correspondence between the style data and the real data matching the style data in a database.
In one possible implementation, the style data includes text style data and/or picture style data.
The device for crawling the webpage provided by the embodiment of the invention extracts the style data in the webpage when the webpage adopting the anti-crawling strategy is crawled, and determines the real data corresponding to the style data according to the corresponding relation between the pre-generated style data and the real data, so that all the real data of the webpage can be quickly and accurately acquired; the method does not need to repeatedly use the image recognition technology to recognize the data in the webpage, saves a large amount of processing resources, and greatly improves the capturing speed and the capturing efficiency. When the style data is analyzed, only the style data of the target webpage needs to be identified, and all contents of the target webpage do not need to be identified by images, so that the processing amount of image identification can be reduced, and the processing efficiency is improved; meanwhile, after the corresponding relation between the style data and the real data is established, image recognition is not needed when the webpage with the style data is crawled subsequently, the real text can be conveniently and quickly determined according to the corresponding relation between the style data and the real data, and the capturing efficiency is greatly improved. By utilizing the characteristic that the file name of the style data is not repeated, whether the style data is analyzed or not is judged based on the file name, and the processing efficiency of judging whether the style data is analyzed or not can be further improved by means of segmenting the file name and judging the reverse order.
Embodiments of the present invention further provide a storage medium, where the storage medium stores computer-executable instructions, which include a program for executing the method for crawling a web page, and the computer-executable instructions may execute the method in any of the method embodiments.
The storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, nonvolatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
Fig. 8 shows a block diagram of an electronic device according to another embodiment of the present invention. The electronic device 1100 may be a host server with computing capabilities, a personal computer PC, or a portable computer or terminal that is portable, or the like. The specific embodiment of the present invention does not limit the specific implementation of the electronic device.
The electronic device 1100 includes at least one processor (processor)1110, a Communications Interface 1120, a memory 1130, and a bus 1140. The processor 1110, the communication interface 1120, and the memory 1130 communicate with each other via the bus 1140.
The communication interface 1120 is used for communicating with network elements including, for example, virtual machine management centers, shared storage, etc.
Processor 1110 is configured to execute programs. Processor 1110 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.
The memory 1130 is used for executable instructions. The memory 1130 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1130 may also be a memory array. The storage 1130 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules. The instructions stored by the memory 1130 are executable by the processor 1110 to enable the processor 1110 to perform a method of web page crawling in any of the method embodiments described above.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method of web page crawling, comprising:
acquiring style data of a target webpage, wherein the style data is generated in source data of the target webpage based on a reverse-climbing strategy;
judging whether a corresponding relation between the style data matched with the style data and the real data exists or not, and when the corresponding relation between the style data matched with the style data and the real data exists, determining the real data corresponding to the style data according to the pre-generated corresponding relation between the style data and the real data, and replacing the style data with the corresponding real data; the corresponding relation between the style data and the real data represents real data actually corresponding to certain style data;
determining all real content of the target webpage, wherein the real content comprises real data corresponding to the style data;
wherein the determining whether there is a correspondence between the style data matched with the style data and the real data includes:
determining the file name of the style data, and judging whether a historical file name matched with the file name exists or not, wherein the historical file name is the file name of the analyzed historical style data;
and when the matched historical file names exist, determining that the corresponding relation between the style data matched with the style data and the real data exists.
2. The method according to claim 1, further comprising, after the obtaining style data of the target web page:
and when the corresponding relation between the style data matched with the style data and the real data does not exist, establishing the corresponding relation between the style data matched with the style data and the real data.
3. The method according to claim 1, wherein the correspondence between the pattern data matching the pattern data and the real data is: a correspondence between valid history pattern data and real data determined based on a result of parsing of the valid history pattern data; the valid history style data is history style data corresponding to a history file name matched with the file name.
4. The method of claim 1, wherein determining whether there is a historical filename matching the filename comprises:
dividing the file name and the historical file name into a plurality of sub-character strings respectively, and determining the arrangement sequence of each sub-character string of the file name in the file name and the arrangement sequence of each sub-character string of the historical file name in the historical file name;
judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not from the last sequential sub-character string, and determining that the file name is not matched with the historical file name when the sub-character string of the file name is different from the corresponding sub-character string of the historical file name;
when the two are the same, determining the next sequential sub-character string in a reverse order, and repeating the process of judging whether the sub-character string of the file name is the same as the corresponding sub-character string of the historical file name or not until the file name is determined not to be matched with the historical file name or all the sub-character strings of the file name are determined to be matched with all the sub-character strings of the historical file name; and when all the substrings of the file name are determined to be matched with all the substrings of the historical file name, determining that the file name is matched with the historical file name.
5. The method according to claim 2, wherein the establishing of the correspondence between the style data matching the style data and the real data comprises:
creating a local webpage and loading the style data of the target webpage into the local webpage;
acquiring a webpage image of the local webpage, identifying the webpage image, and determining real data in the webpage image;
and establishing a corresponding relation between the pattern data and the identified corresponding real data.
6. The method according to claim 2, further comprising, after the establishing of the correspondence between the style data matching the style data and the real data:
and storing the corresponding relation between the style data matched with the style data and the real data in a database.
7. The method according to claim 1, wherein the style data comprises text style data and/or picture style data.
8. An apparatus for web page crawling, comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring style data of a target webpage, and the style data is generated in source data of the target webpage based on a reverse-climbing strategy;
the processing module is used for determining real data corresponding to the style data according to the corresponding relation between the pre-generated style data and the real data and replacing the style data with the corresponding real data; the corresponding relation between the style data and the real data represents real data actually corresponding to certain style data;
the determining module is used for determining all real contents of the target webpage, wherein the real contents comprise real data corresponding to the style data;
the judging module is used for judging whether a corresponding relation between the style data matched with the style data and the real data exists or not; when the corresponding relation between the style data matched with the style data and the real data exists, the processing module determines the real data corresponding to the style data according to the corresponding relation between the matched style data and the real data;
wherein, the judging module is specifically configured to: determining the file name of the style data, and judging whether a historical file name matched with the file name exists or not, wherein the historical file name is the file name of the analyzed historical style data;
and when the matched historical file names exist, determining that the corresponding relation between the style data matched with the style data and the real data exists.
9. A storage medium storing computer-executable instructions for performing the method of web page crawling of any one of claims 1 to 7.
10. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of web page crawling of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811467095.2A CN109582850B (en) | 2018-12-03 | 2018-12-03 | Webpage crawling method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811467095.2A CN109582850B (en) | 2018-12-03 | 2018-12-03 | Webpage crawling method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109582850A CN109582850A (en) | 2019-04-05 |
CN109582850B true CN109582850B (en) | 2021-07-02 |
Family
ID=65925947
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811467095.2A Active CN109582850B (en) | 2018-12-03 | 2018-12-03 | Webpage crawling method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109582850B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1797395A (en) * | 2004-12-21 | 2006-07-05 | 鸿富锦精密工业(深圳)有限公司 | Method for searching file under directory of file |
US8990200B1 (en) * | 2009-10-02 | 2015-03-24 | Flipboard, Inc. | Topical search system |
CN104933138A (en) * | 2015-06-16 | 2015-09-23 | 携程计算机技术(上海)有限公司 | Webpage crawler system and webpage crawling method |
CN108595583A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Dynamic chart class page data crawling method, device, terminal and storage medium |
-
2018
- 2018-12-03 CN CN201811467095.2A patent/CN109582850B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1797395A (en) * | 2004-12-21 | 2006-07-05 | 鸿富锦精密工业(深圳)有限公司 | Method for searching file under directory of file |
US8990200B1 (en) * | 2009-10-02 | 2015-03-24 | Flipboard, Inc. | Topical search system |
CN104933138A (en) * | 2015-06-16 | 2015-09-23 | 携程计算机技术(上海)有限公司 | Webpage crawler system and webpage crawling method |
CN108595583A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Dynamic chart class page data crawling method, device, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109582850A (en) | 2019-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104933056A (en) | Uniform resource locator (URL) de-duplication method and device | |
CN108900554B (en) | HTTP asset detection method, system, device and computer medium | |
CN112115266B (en) | Classification method and device for malicious websites, computer equipment and readable storage medium | |
CN104750791A (en) | Image retrieval method and device | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN106446123A (en) | Webpage verification code element identification method | |
CN108334800B (en) | Stamp image processing device and method and electronic equipment | |
CN113810375B (en) | Webshell detection method, device and equipment and readable storage medium | |
CN114372267B (en) | Malicious webpage identification detection method based on static domain, computer and storage medium | |
CN114996714A (en) | Vulnerability detection method and device, electronic equipment and storage medium | |
CN111783786B (en) | Picture identification method, system, electronic device and storage medium | |
CN109582850B (en) | Webpage crawling method and device, storage medium and electronic equipment | |
CN116701567A (en) | Electronic book retrieval method and system based on artificial intelligence | |
CN115186240A (en) | Social network user alignment method, device and medium based on relevance information | |
CN114168871A (en) | Method and device for page jump, electronic equipment and storage medium | |
CN110704617B (en) | News text classification method, device, electronic equipment and storage medium | |
CN110825976B (en) | Website page detection method and device, electronic equipment and medium | |
CN114036940A (en) | Sensitive data identification method and device, electronic equipment and storage medium | |
CN115880702A (en) | Data processing method, device, equipment, program product and storage medium | |
CN108153817B (en) | Intelligent web page data acquisition method | |
CN112487398A (en) | Automatic character type identifying code identifying method, terminal equipment and storage medium | |
CN115828023B (en) | Method and system for identifying network content sensitivity through machine model | |
CN111125567A (en) | Equipment marking method and device, electronic equipment and storage medium | |
CN111581950A (en) | Method for determining synonym and method for establishing synonym knowledge base |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |