US6571243B2 - Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information - Google Patents

Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information Download PDF

Info

Publication number
US6571243B2
US6571243B2 US10/000,743 US74301A US6571243B2 US 6571243 B2 US6571243 B2 US 6571243B2 US 74301 A US74301 A US 74301A US 6571243 B2 US6571243 B2 US 6571243B2
Authority
US
United States
Prior art keywords
information
listing
stack
semistructured
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US10/000,743
Other versions
US20020062312A1 (en
Inventor
Ashish Gupta
Peter Norvig
Anand Rajaraman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Amazon Technologies Inc
Original Assignee
Amazon com Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Amazon com Inc filed Critical Amazon com Inc
Priority to US10/000,743 priority Critical patent/US6571243B2/en
Publication of US20020062312A1 publication Critical patent/US20020062312A1/en
Application granted granted Critical
Publication of US6571243B2 publication Critical patent/US6571243B2/en
Assigned to A9.COM, INC. reassignment A9.COM, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AMAZON.COM, INC.
Assigned to A9.COM, INC. reassignment A9.COM, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AMAZON.COM, INC.
Assigned to AMAZON.COM, INC. reassignment AMAZON.COM, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: A9.COM, INC.
Assigned to AMAZON TECHNOLOGIES, INC. reassignment AMAZON TECHNOLOGIES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AMAZON.COM, INC.
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99936Pattern matching access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • This invention relates to structured information retrieval and interpretation from disparate semistructured information resources.
  • a particular application of the invention is extraction of information from public and semipublic databases through worldwide information sources, as facilitated by the Internet.
  • the Internet provides avenues for worldwide communication of information, ideas and messages. Although the Internet has been utilized by Kirschner, recently public interest has turned to the Internet and the information made available by it.
  • the World Wide Web (or “the Web”) accounts for a significant part of the growth in the popularity of the Internet, due in part to the user-friendly graphical user interfaces (“GUIs”) that are readily available for accessing the Web.
  • GUIs graphical user interfaces
  • the World Wide Web makes hypertext documents available to users over the Internet.
  • a hypertext document does not present information linearly like a book, but instead provides the reader with links or pointers to other locations so that the user may jump from one location to another.
  • the hypertext documents on the Web are written in the Hypertext Markup Language (“HTML”).
  • HTML Hypertext Markup Language
  • Keyword searches are adequate for many applications, they fail miserably for many others. For example, there are numerous web sites that include multiple entries or lists on job openings, houses for sale, and the like. Keyword searches are inadequate to search these sites for many reasons. Keyword searches invariably turn up information that, although matching the keywords, is not of interest. This problem may be alleviated somewhat by narrowing the search parameters, but this has the attendant risk of missing information of interest. Additionally, the search terms supported may not allow identification of information of interest. As an example, one may not be able to specify in a keyword search query to find job listings that require less than three years of experience in computer programming.
  • a system for extracting information from a semistructured information source.
  • the system includes a listing stack for holding extracted information.
  • a means for matching at least one extractor to the semistructured information to return a list of potential matches is also included.
  • the system can also include a means for iterating through the list of potential matches and a means for retrieving information from a particular match in the list of potential matches.
  • a means for adding a particular match into the listing stack can also be part of the system.
  • a method for extracting information from a semistructured information source into a listing stack is provided.
  • the step of matching at least one extractor to the semistructured information in order to return a list of potential matches is included in the method.
  • a step of iterating through the list of potential matches can also be part of the method.
  • Information from a particular match in the list of potential matches can be retrieved in another step.
  • the method can also include a step of adding a particular match into the listing stack. Combinations of these steps can extract information from a semistructured information source.
  • a relational database to organize information obtained from a semistructured source, such as Web pages on the World Wide Web, over conventional Web search techniques.
  • the present invention is easier to use than conventional user interfaces.
  • the present invention can provide way to automatically propagate information to related tuples. Some embodiments according to the invention are easier for new users to learn than known techniques.
  • the present invention enables data mining to be accomplished using a relational database.
  • FIG. 1A depicts a representative client server relationship in accordance with a particular embodiment of the invention
  • FIG. 1B depicts a functional perspective of the representative client server relationship in accordance with a particular embodiment of the invention
  • FIG. 1C depicts a representative internetworking environment in accordance with a particular embodiment of the invention.
  • FIG. 1D depicts a relationship diagram of the layers of the TCP/IP protocol suite
  • FIG. 2A depicts a flowchart of process steps in producing a wrapper in accordance with a particular embodiment of the invention
  • FIG. 2B depicts a flowchart of process steps in defining a wrapper in accordance with a particular embodiment of the invention
  • FIG. 2C depicts a flowchart of process steps in the execution of a wrapper in accordance with a particular embodiment of the invention
  • FIG. 2D depicts a flowchart of process steps in computing an information closure for a listing stack in a wrapper in accordance with a particular embodiment of the invention.
  • FIG. 2E depicts a flowchart of process steps in computing a selective cross product for determining an information closure for a listing stack in a wrapper in accordance with a particular embodiment of the invention.
  • the present invention provides a system for automated extraction of information from a plurality of semistructured information sources useful for incorporating the tuples into a relational database.
  • Systems according to the present invention enable network programmers to build wrapper programs capable of accessing multiple web sites, extracting information therefrom and incorporating the resulting information into relational databases for search.
  • Table 1 provides a definitional list of terminology used herein.
  • semistructured Information that as a whole does not have a precise information structure, however, elements within the semistructured information have meanings based on their location or surroundings within the semistructured information.
  • the format of semistructured information may be represented by a grammar or by regular expressions, typically nested regular expressions.
  • Site A location or object including related, interconnected collection of blocks of text, forms, and the like.
  • a web site may present text as semistructured information in the form of a web page.
  • Agent A program that serves the information needs of a user. Often an agent will have a visible component.
  • an agent may include a user interface that accepts a user's relational database query and displays the results of the query.
  • Mapper (or site A software layer that provides a relational database program) interface to information on a site.
  • Mapper A component responsible for translating the different site vocabularies into one that an agent understands. Mappers generally reside between agents and wrappers, providing a level of insulation between the two.
  • FIG. 1A shows a conventional client-server computer system which includes a server 20 and numerous clients, one of which is shown as client 25 .
  • server receives queries from (typically remote) clients, does substantially all the processing necessary to formulate responses to the queries, and provides these responses to the clients.
  • server 20 may itself act in the capacity of a client when it accesses remote databases located at another node acting as a database server.
  • server 20 includes one or more processors 30 which communicate with a number of peripheral devices via a bus subsystem 32 .
  • peripheral devices typically include a storage subsystem 35 , comprised of memory subsystem 35 a and file storage subsystem 35 b , which hold computer programs (e.g., code or instructions) and data, set of user interface input and output devices 37 , and an interface to outside networks, which may employ Ethernet, Token Ring, ATM, IEEE 802.3, ITU X.25, Serial Link Internet Protocol (SLIP) or the public switched telephone network.
  • This interface is shown schematically as a “Network Interface” block 40 . It is coupled to corresponding interface devices in client computers via a network connection 45 .
  • Client 25 has the same general configuration, although typically with less storage and processing capability.
  • the client computer could be a terminal or a low-end personal computer
  • the server computer is generally a high-end workstation or mainframe, such as a SUN SPARCTM server.
  • Corresponding elements and subsystems in the client computer are shown with corresponding, but primed, reference numerals.
  • the user interface input devices typically includes a keyboard and may further include a pointing device and a scanner.
  • the pointing device may be an indirect pointing device such as a mouse, trackball, touchpad, or graphics tablet, or a direct pointing device such as a touchscreen incorporated into the display.
  • Other types of user interface input devices, such as voice recognition systems, are also possible.
  • the user interface output devices typically include a printer and a display subsystem, which includes a display controller and a display device coupled to the controller.
  • the display device may be a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), or a projection device.
  • Display controller provides control signals to the display device and normally includes a display memory for storing the pixels that appear on the display device.
  • the display subsystem may also provide non-visual display such as audio output.
  • the memory subsystem typically includes a number of memories including a main random access memory (RAM) for storage of instructions and data during program execution and a read only memory (ROM) in which fixed instructions are stored.
  • RAM main random access memory
  • ROM read only memory
  • the ROM would include portions of the operating system; in the case of IBM-compatible personal computers, this would include the BIOS (basic input/output system).
  • the file storage subsystem provides persistent (non-volatile) storage for program and data files, and typically includes at least one hard disk drive and at least one floppy disk drive (with associated removable media). There may also be other devices such as a CD-ROM drive and optical drives (all with their associate removable media). Additionally, the computer system may include drives of the type with removable media cartridges.
  • the removable media cartridges may, for example be hard disk cartridges, such as those marketed by Syquest and others, and flexible disk cartridges, such as those marketed by Iomega.
  • One or more of the drives may be located at a remote location, such as in a server on a local area network or at a site of the Internet's World Wide Web.
  • bus subsystem is used generically so as to include any mechanism for letting the various components and subsystems communicate with each other as intended.
  • the other components need not be at the same physical location.
  • portions of the file storage system could be connected via various local-area or wide-area network media, including telephone lines.
  • the input devices and display need not be at the same location as the processor, although it is anticipated that the present invention will most often be implemented in the context of PCs and workstations.
  • Bus subsystem 32 is shown schematically as a single bus, but a typical system has a number of buses such as a local bus and one or more expansion buses (e.g., ADB, SCSI, ISA, EISA, MCA, NuBus, or PCI), as well as serial and parallel ports. Network connections are usually established through a device such as a network adapter on one of these expansion buses or a modem on a serial port.
  • the client computer may be a desktop system or a portable system.
  • interface devices 37 ′ or devices 37 in a standalone system.
  • client queries are entered via a keyboard, communicated to client processor 30 ′, and thence to network interface 40 ′ over bus subsystem 32 ′.
  • the query is then communicated to server 20 via network connection 45 .
  • results of the query are communicated from the server to the client via network connection 45 for output on one of devices 37 ′ (say a display or a printer), or may be stored on storage subsystem 35 ′.
  • FIG. 1B is a functional diagram of the computer system of FIG. 1 A.
  • FIG. 1B depicts a server 20 , and a representative client 25 of a multiplicity of clients which may interact with the server 20 via the internet 45 or any other communications method. Blocks to the right of the server are indicative of the processing components and functions which occur in the server's program and data storage indicated by block 35 a in FIG. 1A.
  • a TCP/IP “stack” 44 works in conjunction with Operating System 42 to communicate with processes over a network or serial connection attaching Server 20 to internet 45 .
  • Web server software 46 executes concurrently and cooperatively with other processes in server 20 to make data objects 50 and 51 available to requesting clients.
  • a Common Gateway Interface (CGI) script 55 enables information from user clients to be acted upon by web server 46 , or other processes within server 20 . Responses to client queries may be returned to the clients in the form of a Hypertext Markup Language (HTML) document outputs which are then communicated via internet 45 back to the user.
  • CGI Common Gateway Interface
  • Client 25 in FIG. 1B possesses software implementing functional processes operatively disposed in its program and data storage as indicated by block 35 a ′ in FIG. 1 A.
  • TCP/IP stack 44 ′ works in conjunction with Operating System 42 ′ to communicate with processes over a network or serial connection attaching Client 25 to internet 45 .
  • Software implementing the function of a web browser 46 ′ executes concurrently and cooperatively with other processes in client 25 to make requests of server 20 for data objects 50 and 51 .
  • the user of the client may interact via the web browser 46 ′ to make such queries of the server 20 via internet 45 and to view responses from the server 20 via internet 45 on the web browser 46 ′.
  • FIG. 1C is illustrative of the internetworking of a plurality of clients such as client 25 of FIGS. 1A and 1B and a multiplicity of servers such as server 20 of FIGS. 1A and 1B as described herein above.
  • a network 70 is an example of a Token Ring or frame oriented network.
  • Network 70 links a host 71 , such as an IBM RS6000 RISC workstation, which may be running the AIX operating system, to a host 72 , which is a personal computer, which may be running Windows 95, IBM 0S/2 or a DOS operating system, and a host 73 , which may be an IBM AS/400 computer, which may be running the OS/400 operating system.
  • a host 71 such as an IBM RS6000 RISC workstation, which may be running the AIX operating system
  • a host 72 which is a personal computer, which may be running Windows 95, IBM 0S/2 or a DOS operating system
  • a host 73 which may be an IBM AS/
  • Network 70 is internetworked to a network 60 via a system gateway which is depicted here as router 75 , but which may also be a gateway having a firewall or a network bridge.
  • Network 60 is an example of an Ethernet network that interconnects a host 61 , which is a SPARC workstation, which may be running SUNOS operating system with a host 62 , which may be a Digital Equipment VAX6000 computer which may be running the VMS operating system.
  • Router 75 is a network access point (NAP) of network 70 and network 60 .
  • Router 75 employs a Token Ring adapter and Ethernet adapter. This enables router 75 to interface with the two heterogeneous networks.
  • Router 75 is also aware of the Inter-network Protocols, such as ICMP ARP and RIP, which are described below.
  • FIG. 1D is illustrative of the constituents of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite.
  • the base layer of the TCP/IP protocol suite is the physical layer 80 , which defines the mechanical, electrical, functional and procedural standards for the physical transmission of data over communications media, such as, for example, the network connection 45 of FIG. 1 A.
  • the physical layer may comprise electrical, mechanical or functional standards such as whether a network is packet switching or frame-switching; or whether a network is based on a Carrier Sense Multiple Access/Collision Detection (CSMA/CD) or a frame relay paradigm.
  • CSMA/CD Carrier Sense Multiple Access/Collision Detection
  • the data link layer Overlying the physical layer is the data link layer 82 .
  • the data link layer provides the function and protocols to transfer data between network resources and to detect errors that may occur at the physical layer.
  • Operating modes at the datalink layer comprise such standardized network topologies as IEEE 802.3 Ethernet, IEEE 802.5 Token Ring, ITU X.25, or serial (SLIP) protocols.
  • Network layer protocols 84 overlay the datalink layer and provide the means for establishing connections between networks.
  • the standards of network layer protocols provide operational control procedures for internetworking communications and routing information through multiple heterogenous networks.
  • Examples of network layer protocols are the Internet Protocol (IP) and the Internet Control Message Protocol (ICMP).
  • the Address Resolution Protocol (ARP) is used to correlate an Internet address and a Media Access Address (MAC) of a particular host.
  • the Routing Information Protocol (RIP) is a dynamic routing protocol for passing routing information between hosts on networks.
  • the Internet Control Message Protocol (ICMP) is an internal protocol for passing control messages between hosts on various networks. ICMP messages provide feedback about events in the network environment or can help determine if a path exists to a particular host in the network environment. The latter is called a “Ping”.
  • IP Internet Protocol
  • IP provides the basic mechanism for routing packets of information in the Internet.
  • IP is a non-reliable communication protocol. It provides a “best efforts” delivery service and does not commit network resources to a particular transaction, nor does it perform retransmissions or give acknowledgments.
  • the transport layer protocols 86 provide end-to-end transport services across multiple heterogenous networks.
  • the User Datagram Protocol (UDP) provides a connectionless, datagram oriented service which provides a non-reliable delivery mechanism for streams of information.
  • the Transmission Control Protocol (TCP) provides a reliable session-based service for delivery of sequenced packets of information across the Internet. TCP provides a connection oriented reliable mechanism for information delivery.
  • the session, or application layer 88 provides a list of network applications and utilities, a few of which are illustrated here.
  • File Transfer Protocol is a standard TCP/IP protocol for transferring files from one machine to another.
  • FTP clients establish sessions through TCP connections with FTP servers in order to obtain files.
  • Telnet is a standard TCP/IP protocol for remote terminal connection.
  • a Telnet client acts as a terminal emulator and establishes a connection using TCP as the transport mechanism with a Telnet server.
  • the Simple Network Management Protocol (SNMP) is a standard for managing TCP/IP networks. SNMP tasks, called “agents”, monitor network status parameters and transmit these status parameters to SNMP tasks called “managers.” Managers track the status of associated networks.
  • a Remote Procedure Call is a programming interface which enables programs to invoke remote functions on server machines.
  • the Hypertext Transfer Protocol (HTTP) facilitates the transfer of data objects across networks via a system of uniform resource indicators (URI).
  • URI uniform resource indicators
  • the Hypertext Transfer Protocol is a simple protocol built on top of Transmission Control Protocol (TCP). It is the mechanism which underlies the function of the World Wide Web.
  • the HTTP provides a method for users to obtain data objects from various hosts acting as servers on the Internet.
  • User requests for data objects are made by means of an HTTP request, such as a GET request.
  • a GET request as depicted below is comprised of 1) an HTTP protocol version, such as “http:/1.0”; followed by 2) the full path of the data object; followed by 3) the name of the data object.
  • a request is being made for the data object with a path name of “/pub/” and a name of “MyData.html”:
  • Processing of a GET request entails the establishing of an TCP/IP connection with the server named in the GET request and receipt from the server of the data object specified. After receiving and interpreting a request message, a server responds in the form of an HTTP RESPONSE message.
  • Status line comprising a protocol version followed by a numeric Status Code and an associated textual Reason Phrase. These elements are separated by space characters.
  • the format of a status line is depicted in line (2):
  • the status line always begins with a protocol version and status code, e.g., “HTTP/1.0 200”
  • the status code element is a three digit integer result code of the attempt to understand and satisfy a prior request message.
  • the reason phrase is intended to give a short textual description of the status code.
  • the first digit of the status code defines the class of response. There are five categories for the first digit. 1XX is an information response. It is not currently used. 2XX is a successful response, indicating that the action was successfully received, understood and accepted. 3XX is a redirection response, indicating that further action must be taken in order to complete the request. 4XX is a client error response. This indicates a bad syntax in the request. Finally, 5XX is a server error. This indicates that the server failed to fulfill an apparently valid request.
  • a user may generate a relational database query (e.g., SQL query) which operates on the tuples produced by the wrapper. Accordingly, the relational database system views the semistructured information as one or more database tables as a result of the wrapper's processing.
  • a relational database query e.g., SQL query
  • FIG. 2A depicts a flowchart 101 of a processes of defining, generating and using a wrapper to access semistructured information from disparate semistructured information sources.
  • a wrapper may be described using a description language called a Site Description Language (SDL), which provides mechanisms for specifying different types of interactions between the wrapper and data sources.
  • SDL Site Description Language
  • step 102 semistructured information is examined to identify patterns including attributes.
  • step 104 SDL statements describing patterns are specified in a definitional file.
  • the definitional file produced in step 104 is acted upon by a compiler or an interpreter to produce a wrapper.
  • multiple wrappers corresponding to different semistructured information are generated for a particular application.
  • one or more mappers may be provided in order to translate attributes within semistructured information to fields in the relational database schema.
  • FIG. 2B depicts a flowchart 201 showing the process steps for defining a wrapper according to step 102 of flowchart 101 .
  • step 202 the semistructured information is examined for repetitive patterns of interest using lexical analysis techniques, as are well known to persons of ordinary skill in the art. These repetitive patterns of interest include one or more attributes.
  • the occurrences of the patterns in the semistructured information are cataloged by name and position in a nested structure without a priori information, i.e., there is no requirement that a user have prior knowledge or perform any prior programming before the patterns are cataloged.
  • the nested structure is a graph representing the nesting of the attributes within the semistructured information.
  • many of the attributes of the nested structure correspond to fields of a relational database schema.
  • a step 206 the patterns in the nested structure are examined to identify attributes that correspond to fields of a relational database schema. After these attributes are identified, regular expressions are generated that specify the location of the attributes within the semistructured information. The regular expressions may be generated as soon as these attributes are identified or when the definition of the semistructured information is written to a file. Thus, the generation of the regular expressions need not be performed at any specific time.
  • step 207 the patterns in the nested structure are examined to identify patterns that may be further cataloged. Some patterns of interest may be further broken down into sub-component patterns of interest. Each one of these patterns that is identified is decomposed into its constituent patterns. These constituent patterns are then cataloged in the nested structure for further examination.
  • the patterns in the nested structure are examined to identify links to other semistructured information.
  • the links identified in step 208 point to other semistructured information that may include patterns of interest and attributes.
  • the links are traversed to further semistructured information, which is examined for patterns of interest. If patterns of interest are discovered, they are cataloged in the nested structure.
  • the links are Uniform Resource Locator (“URL”) addresses of web pages.
  • URL Uniform Resource Locator
  • the links may also point to a program which, when executed, will generate semistructured information output. In the later case, the program is executed and the output is examined.
  • a decisional step 210 it is determined whether there is more nested information to examine. If more nested information exists, then it is examined to identify attributes corresponding to fields in the relational database schema identified in step 206 .
  • steps 206 , 207 and 208 are shown in a particular order, it is not required that these steps, like many other steps in the flowcharts, be performed in the order shown. Thus, the order shown in the flowcharts is to illustrate one embodiment and not intended to limit the invention.
  • a definition of the semistructured information which serves as input to a program translator to build a parser.
  • This definition of the semistructured information comprises regular expressions having attributes corresponding to fields of the relational database schema.
  • the regular expressions specify locations of the attributes within the semistructured information that correspond to the relational database schema.
  • the wrapper produced by the program translator, includes a parser that is capable of parsing the semistructured information for attributes so that these attributes can be presented to a relational database system as tuples when the wrapper executes.
  • the program translator is a compiler, which generates a parser by receiving the definition file as input and generating a program (i.e., the parser) for extracting attributes from the semistructured information that correspond to fields of the relational database schema to form tuples.
  • the program translator is an interpreter, which generates a parser from the definition of the semistructured information and the semistructured information as inputs, by extracting attributes from the semistructured information that correspond to fields of the relational schema to form tuples.
  • FIG. 2C depicts a flowchart 203 showing process steps performed by a typical wrapper in traversing web pages to collect semistructured information according to a particular embodiment of the present invention.
  • a variable root is set to be the root URL of a particular site.
  • a URL is created for a target site of interest with a call to a url( ) function. For example, url(“:https://www.company.com”) or, for forms that take a relative URL string and a context, url(“next-page.html”, root).
  • a web page corresponding to the url created in step 222 is fetched with a getString(url) function.
  • step 224 if step 223 failed to fetch a web page, the routine terminates and processing returns. Otherwise, in a step 225 , a regular expression is matched against a string of input representing the web page fetched in step 223 as depicted in line 1 below:
  • the match( ) function returns a list of possible matches which can be cycled through with a next( ) function, as depicted in line 2 above. Each call to next( ) returns the next match of the pattern.
  • Table 2 lists the parameters of the match( ) function, in accordance with a particular embodiment of the present invention:
  • Pattern $var more pattern is not implemented in this embodiment, however, state- ments such as, “pattern” + var + “more pattern” are permitted.
  • mask (optional) A mask of options, formed by or-ing together a plurality of bits: I (ignore case), S (single line) and X (extended match). If match is not specified, it defaults to a value of matchDefault, which is initially Matcher.I
  • step 226 the existence of a match is determined, and if no match is found, processing proceeds with step 228 , obtaining the next page. Otherwise, if a match has occurred, then in a step 230 , information is retrieved from the match.
  • a group(i) function is provided to retrieve matches. A basic pattern for matching is described in line 4 below. Use of the group(i) function is depicted in line 6 below:
  • step 232 information retrieved from a match is added to the listing stack using a set( ) function, as depicted in line 13 below.
  • step 234 the contents of the listing stack are placed in a table using the emit ( ) function, as depicted in line 14 below:
  • Table 3 lists the parameters of the set( ) function:
  • a set( ) function adds information to the table that is the answer to a query.
  • two sets of attribute/value pairs are specified: a job title is s1 and an ad text is s2. In this embodiment, from 1 to 6 attribute/value pairs are permitted.
  • a mark( ) and a reset( ) functions are used to show the starting and ending, respectively, of a portion of a listing. Each match of the pattern represents a separate instance of information, here a job listing, the information for which must be kept separate from all others. This is accomplished by using mark( ) at the beginning of each listing, and reset( ) at the end.
  • An emit( ) function specifies that collected data is to be passed to the table as part of the answer to the query. Typically, emit( ) appears at the end of an innermost loop. Emit( ) processing automatically extracts fields that have not been assigned within the course of wrapper processing.
  • a key concept is called a listing stack.
  • Each call to set( ) adds a new listing to the listing stack.
  • Each call to emit( ) causes the listing stack to be converted into a sequence of rows in a table, which sequence of rows serves to answer the original query.
  • An individual listing consists of a number of fields (or attributes, or columns), each of which may be empty, may be filled with a single value, or may be filled with a vector of values.
  • An empty column corresponds to “don't know”.
  • a vector of values corresponds to “all of the above”. That is, adding the listing in line 24 below to the stack:
  • a listing stack is a representation of a table having a sequence of rows.
  • the emit function converts a listing stack into a table. Emit function processing can be described by the following rule:
  • a row r is a candidate member of the resulting table if and only if r can be formed by starting with an all-null row and then repeatedly selecting some row s from is and filling in any null fields in r with the corresponding field values from s.
  • the resulting table is the same as the listing stack.
  • the resulting table has a single row with all the fields combined. That is, the listing stack:
  • the processing steps are that the F, P, and A columns must be filled with the only possible value. Remaining is one group of two rows with values for C, S and T, and another group of two rows with values for N (and either F or P, but these have already been dealt with). Therefore, form the cross-product of the two N rows with the two C/S/T rows to get four rows, and then fill in the blanks with the only possible values.
  • each listing comprises fields from a closed set—in other words, a complete set of fields for which no other values are permitted. For example, City/State/Zip go together, as do Job Title and Job category.
  • the set of potential values can be restricted by setting all values not of interest to NOVALUE.
  • Information listings could be defined having associated priority or probability meta data.
  • FIG. 2D depicts a flowchart 205 showing the steps for computing an information closure for a set of rows in a listing stack.
  • a cross product is computed for the first row in the listing stack.
  • the cross product computed in step 240 is added to a list of accepted rows.
  • a decisional step 244 a check is done for any further remaining rows in the linkage stack. If a remaining row is found, then in a step 246 , a selective cross product is computed on the remaining row and the list of accepted rows started in step 240 . Otherwise, if no further rows remain, then in a step 248 , the list of accepted rows is reduced by eliminating rows having identical fields.
  • the resulting list of accepted rows is provided as the information closure.
  • the pseudo code for this algorithm is depicted in the lines below:
  • FIG. 2E depicts a flowchart 207 showing the component steps for step 246 of FIG. 2E, computing the selective cross product for a list of accepted rows and a remaining row.
  • a step 260 an interim result is initialized to empty.
  • a decisional step 262 a determination is made whether there are any further rows in the list of accepted rows to process. If there are further rows to process, then processing of the next accepted row in the list continues in a step 264 , in which a new row r′ is computed from the accepted row extended with non-empty fields of the remaining row passed to the routine.
  • a new row n′ is computed from the remaining row passed to the routine extended with non-empty fields of the accepted row.
  • rows r′ and n′ are added to the result, and processing continues with the next row in the accepted row list at step 262 . If in step 262 , there are no further rows to process, then in a step 270 , the routine returns the result as the selective cross product.
  • the pseudo code for this algorithm is depicted in the lines below:
  • the mark( ) and reset( ) methods also operate on the listing stack. They keep track of progress during the traversal of a site, and release information that is no longer of interest, because the corresponding table entries have already been emitted. For example, a site may have some common information, such as a header page (perhaps the contact phone and fax numbers), followed by specific information for each job on separate pages.
  • the corresponding wrapper would be:
  • the reset( ) sets the listing stack back to the state it was in at the previous mark( ), thereby discarding the job just processed, but keeping the common information that came before the mark( ). In certain instances, omitting the mark( ) and reset( ) yields the same results because of the way the information closure algorithm is defined. However, it is more efficient to include mark( )/reset( ) when possible.
  • the described method can be especially useful in the area of electronic commerce applications.
  • an online book or music store could be created using the methods of the invention to process semistructured information about music and book offerings.
  • semistructured infromation about other products and services can be processed by embodiments according to the present invention.
  • the present invention provides for a system for automated extraction of information from a plurality of semistructured information sources.
  • An advantage of the present invention is that information is automatically propagated to related tuples.
  • a further advantage of the present invention is that it enables the use of a relational database to organize information obtained from a semistructured source, such as web pages on the world wide web.
  • a yet further advantage of the present invention is that it enables data mining to be accomplished using a relational database.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to the invention, a system and method for extracting information from a semistructured information source. The system includes a listing stack for holding extracted information. A means for matching at least one extractor to the semistructured information to return a list of potential matches is also included. The system can also include a means for iterating through the list of potential matches and a means for retrieving information from a particular match in the list of potential matches. A means for adding a particular match into the listing stack can also be part of the system.

Description

CROSS-REFERENCES TO RELATED APPLICATIONS
This application is a continuation of and claims the benefit of U.S. patent application Ser. No. 09/196,421, filed Nov. 19, 1998 now abandoned, which is a continuation-in-part of and claims the benefit of U.S. Provisional Application No. 60/066,125, filed Nov. 21, 1997, both of which are hereby incorporated by reference. This application is also related to commonly-owned, concurrently filed U.S. application Ser. No. 10000235, entitled “Method for Creating an Information Closure Model,” by Ashish Gupta, et. al., which is a continuation of U.S. application Ser. No. 09/196,026, filed Nov. 19, 1998, now abandoned. The disclosure of U.S. application Ser. No. 10/000,235 is hereby incorporated by reference.
This application makes reference to the following commonly owned U.S. Patent, which is incorporated herein in its entirety for all purposes:
Copending U.S. Pat. No. 5,826,258, in the name of Ashish Gupta, et. al., entitled “Method and Apparatus for Structuring the Querying and Interpretation of Semistructured Information,” relates to information retrieval and interpretation from disparate semistructured information resources.
COPYRIGHT NOTICE
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
This invention relates to structured information retrieval and interpretation from disparate semistructured information resources. A particular application of the invention is extraction of information from public and semipublic databases through worldwide information sources, as facilitated by the Internet.
The Internet provides avenues for worldwide communication of information, ideas and messages. Although the Internet has been utilized by academia for decades, recently public interest has turned to the Internet and the information made available by it. The World Wide Web (or “the Web”) accounts for a significant part of the growth in the popularity of the Internet, due in part to the user-friendly graphical user interfaces (“GUIs”) that are readily available for accessing the Web.
The World Wide Web makes hypertext documents available to users over the Internet. A hypertext document does not present information linearly like a book, but instead provides the reader with links or pointers to other locations so that the user may jump from one location to another. The hypertext documents on the Web are written in the Hypertext Markup Language (“HTML”).
As the popularity of the World Wide Web grows, so too does the wealth of information it provides. Accordingly, there may be many sites and pages on the World Wide Web that contain information a user is seeking. However, the Web contains no built-in mechanism for searching for information of interest. Without a searching mechanism, finding sites of interest would literally be like finding a needle in a haystack. Fortunately, there exist a number of web sites (e.g., YAHOO, ALTA VISTA, EXCITE, etc.) that allow users to perform relatively simple keyword searches.
Although keyword searches are adequate for many applications, they fail miserably for many others. For example, there are numerous web sites that include multiple entries or lists on job openings, houses for sale, and the like. Keyword searches are inadequate to search these sites for many reasons. Keyword searches invariably turn up information that, although matching the keywords, is not of interest. This problem may be alleviated somewhat by narrowing the search parameters, but this has the attendant risk of missing information of interest. Additionally, the search terms supported may not allow identification of information of interest. As an example, one may not be able to specify in a keyword search query to find job listings that require less than three years of experience in computer programming.
Ideally, it would be desirable if information like job listings on multiple web sites could appear as a single relational database so that relational database queries could be utilized to find information of interest. However, there is no standard for the structure of information like job listings on the Web. This problem was addressed in a co-owned, U.S. Pat. No. 5,826,258, in the name of Ashish Gupta, et. al., entitled “Method and Apparatus for Structuring the Querying and Interpretation of Semistructured Information,” which introduced the concept of “Wrappers” for retrieving and interpreting information from disparate semistructured information resources. Wrappers are programs that interact with web sites to obtain information stored in the web site and then to structure it according to a prespecified schema. In a copending U.S. patent application Ser. No. 10/000,235, in the name of Ashish Gupta, et. al. entitled, “Method for Creating an Information Closure Model” methods for forming the information closure of information gathered by a wrapper are disclosed. However, the methods for formulating extractors, field objects and inheritance hierarchies in a wrapper framework of the present invention are heretofore not known in the art.
What is needed is a method of formulating extractors, field objects and inheritance hierarchies for retrieving and interpreting information from semistructured resources for incorporation into a relational database.
SUMMARY OF THE INVENTION
According to the invention, a system is provided for extracting information from a semistructured information source. The system includes a listing stack for holding extracted information. A means for matching at least one extractor to the semistructured information to return a list of potential matches is also included. The system can also include a means for iterating through the list of potential matches and a means for retrieving information from a particular match in the list of potential matches. A means for adding a particular match into the listing stack can also be part of the system.
In another aspect of the present invention, a method for extracting information from a semistructured information source into a listing stack is provided. The step of matching at least one extractor to the semistructured information in order to return a list of potential matches is included in the method. A step of iterating through the list of potential matches can also be part of the method. Information from a particular match in the list of potential matches can be retrieved in another step. The method can also include a step of adding a particular match into the listing stack. Combinations of these steps can extract information from a semistructured information source.
Numerous benefits are achieved by way of the present invention for enabling the use of a relational database to organize information obtained from a semistructured source, such as Web pages on the World Wide Web, over conventional Web search techniques. In some embodiments, the present invention is easier to use than conventional user interfaces. The present invention can provide way to automatically propagate information to related tuples. Some embodiments according to the invention are easier for new users to learn than known techniques. The present invention enables data mining to be accomplished using a relational database. These and other benefits are described throughout the present specification.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1A depicts a representative client server relationship in accordance with a particular embodiment of the invention;
FIG. 1B depicts a functional perspective of the representative client server relationship in accordance with a particular embodiment of the invention;
FIG. 1C depicts a representative internetworking environment in accordance with a particular embodiment of the invention;
FIG. 1D depicts a relationship diagram of the layers of the TCP/IP protocol suite;
FIG. 2A depicts a flowchart of process steps in producing a wrapper in accordance with a particular embodiment of the invention;
FIG. 2B depicts a flowchart of process steps in defining a wrapper in accordance with a particular embodiment of the invention;
FIG. 2C depicts a flowchart of process steps in the execution of a wrapper in accordance with a particular embodiment of the invention;
FIG. 2D depicts a flowchart of process steps in computing an information closure for a listing stack in a wrapper in accordance with a particular embodiment of the invention; and
FIG. 2E depicts a flowchart of process steps in computing a selective cross product for determining an information closure for a listing stack in a wrapper in accordance with a particular embodiment of the invention.
DESCRIPTION OF THE SPECIFIC EMBODIMENTS
1.0 Introduction
The present invention provides a system for automated extraction of information from a plurality of semistructured information sources useful for incorporating the tuples into a relational database. Systems according to the present invention enable network programmers to build wrapper programs capable of accessing multiple web sites, extracting information therefrom and incorporating the resulting information into relational databases for search. Table 1 provides a definitional list of terminology used herein.
TABLE 1
LIST OF DEFINITIONAL TERMS
Semistructured Information that as a whole does not have a precise
information structure, however, elements within the
semistructured information have meanings based on
their location or surroundings within the
semistructured information. The format of
semistructured information may be represented by a
grammar or by regular expressions, typically nested
regular expressions.
Site A location or object including related,
interconnected collection of blocks of text, forms,
and the like. For example, a web site may present
text as semistructured information in the form of a
web page.
Agent A program that serves the information needs of a
user. Often an agent will have a visible component.
For example, an agent may include a user interface
that accepts a user's relational database query and
displays the results of the query.
Wrapper (or site A software layer that provides a relational database
program) interface to information on a site.
Mapper A component responsible for translating the
different site vocabularies into one that an agent
understands. Mappers generally reside between
agents and wrappers, providing a level of insulation
between the two.
1.1 Hardware Overview
The system for automated extraction of information from a plurality of semistructured information sources of the present invention is implemented in the Perl and Java programming languages and is operational on a computer system such as shown in FIG. 1A. This invention may be implemented in a client-server environment, but a client-server environment is not essential. FIG. 1A shows a conventional client-server computer system which includes a server 20 and numerous clients, one of which is shown as client 25. The use of the term “server” is used in the context of the invention, wherein the server receives queries from (typically remote) clients, does substantially all the processing necessary to formulate responses to the queries, and provides these responses to the clients. However, server 20 may itself act in the capacity of a client when it accesses remote databases located at another node acting as a database server.
The hardware configurations are in general standard and will be described only briefly. In accordance with known practice, server 20 includes one or more processors 30 which communicate with a number of peripheral devices via a bus subsystem 32. These peripheral devices typically include a storage subsystem 35, comprised of memory subsystem 35 a and file storage subsystem 35 b, which hold computer programs (e.g., code or instructions) and data, set of user interface input and output devices 37, and an interface to outside networks, which may employ Ethernet, Token Ring, ATM, IEEE 802.3, ITU X.25, Serial Link Internet Protocol (SLIP) or the public switched telephone network. This interface is shown schematically as a “Network Interface” block 40. It is coupled to corresponding interface devices in client computers via a network connection 45.
Client 25 has the same general configuration, although typically with less storage and processing capability. Thus, while the client computer could be a terminal or a low-end personal computer, the server computer is generally a high-end workstation or mainframe, such as a SUN SPARCTM server. Corresponding elements and subsystems in the client computer are shown with corresponding, but primed, reference numerals.
The user interface input devices typically includes a keyboard and may further include a pointing device and a scanner. The pointing device may be an indirect pointing device such as a mouse, trackball, touchpad, or graphics tablet, or a direct pointing device such as a touchscreen incorporated into the display. Other types of user interface input devices, such as voice recognition systems, are also possible.
The user interface output devices typically include a printer and a display subsystem, which includes a display controller and a display device coupled to the controller. The display device may be a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), or a projection device. Display controller provides control signals to the display device and normally includes a display memory for storing the pixels that appear on the display device. The display subsystem may also provide non-visual display such as audio output.
The memory subsystem typically includes a number of memories including a main random access memory (RAM) for storage of instructions and data during program execution and a read only memory (ROM) in which fixed instructions are stored. In the case of Macintosh-compatible personal computers the ROM would include portions of the operating system; in the case of IBM-compatible personal computers, this would include the BIOS (basic input/output system).
The file storage subsystem provides persistent (non-volatile) storage for program and data files, and typically includes at least one hard disk drive and at least one floppy disk drive (with associated removable media). There may also be other devices such as a CD-ROM drive and optical drives (all with their associate removable media). Additionally, the computer system may include drives of the type with removable media cartridges. The removable media cartridges may, for example be hard disk cartridges, such as those marketed by Syquest and others, and flexible disk cartridges, such as those marketed by Iomega. One or more of the drives may be located at a remote location, such as in a server on a local area network or at a site of the Internet's World Wide Web.
In this context, the term “bus subsystem” is used generically so as to include any mechanism for letting the various components and subsystems communicate with each other as intended. With the exception of the input devices and the display, the other components need not be at the same physical location. Thus, for example, portions of the file storage system could be connected via various local-area or wide-area network media, including telephone lines. Similarly, the input devices and display need not be at the same location as the processor, although it is anticipated that the present invention will most often be implemented in the context of PCs and workstations.
Bus subsystem 32 is shown schematically as a single bus, but a typical system has a number of buses such as a local bus and one or more expansion buses (e.g., ADB, SCSI, ISA, EISA, MCA, NuBus, or PCI), as well as serial and parallel ports. Network connections are usually established through a device such as a network adapter on one of these expansion buses or a modem on a serial port. The client computer may be a desktop system or a portable system.
The user interacts with the system using interface devices 37′ (or devices 37 in a standalone system). For example, client queries are entered via a keyboard, communicated to client processor 30′, and thence to network interface 40′ over bus subsystem 32′. The query is then communicated to server 20 via network connection 45. Similarly, results of the query are communicated from the server to the client via network connection 45 for output on one of devices 37′ (say a display or a printer), or may be stored on storage subsystem 35′.
FIG. 1B is a functional diagram of the computer system of FIG. 1A. FIG. 1B depicts a server 20, and a representative client 25 of a multiplicity of clients which may interact with the server 20 via the internet 45 or any other communications method. Blocks to the right of the server are indicative of the processing components and functions which occur in the server's program and data storage indicated by block 35 a in FIG. 1A. A TCP/IP “stack” 44 works in conjunction with Operating System 42 to communicate with processes over a network or serial connection attaching Server 20 to internet 45. Web server software 46 executes concurrently and cooperatively with other processes in server 20 to make data objects 50 and 51 available to requesting clients. A Common Gateway Interface (CGI) script 55 enables information from user clients to be acted upon by web server 46, or other processes within server 20. Responses to client queries may be returned to the clients in the form of a Hypertext Markup Language (HTML) document outputs which are then communicated via internet 45 back to the user.
Client 25 in FIG. 1B possesses software implementing functional processes operatively disposed in its program and data storage as indicated by block 35 a′ in FIG. 1A. TCP/IP stack 44′, works in conjunction with Operating System 42′ to communicate with processes over a network or serial connection attaching Client 25 to internet 45. Software implementing the function of a web browser 46′ executes concurrently and cooperatively with other processes in client 25 to make requests of server 20 for data objects 50 and 51. The user of the client may interact via the web browser 46′ to make such queries of the server 20 via internet 45 and to view responses from the server 20 via internet 45 on the web browser 46′.
1.2 Network Overview
FIG. 1C is illustrative of the internetworking of a plurality of clients such as client 25 of FIGS. 1A and 1B and a multiplicity of servers such as server 20 of FIGS. 1A and 1B as described herein above. In FIG. 1C, a network 70 is an example of a Token Ring or frame oriented network. Network 70 links a host 71, such as an IBM RS6000 RISC workstation, which may be running the AIX operating system, to a host 72, which is a personal computer, which may be running Windows 95, IBM 0S/2 or a DOS operating system, and a host 73, which may be an IBM AS/400 computer, which may be running the OS/400 operating system. Network 70 is internetworked to a network 60 via a system gateway which is depicted here as router 75, but which may also be a gateway having a firewall or a network bridge. Network 60 is an example of an Ethernet network that interconnects a host 61, which is a SPARC workstation, which may be running SUNOS operating system with a host 62, which may be a Digital Equipment VAX6000 computer which may be running the VMS operating system.
Router 75 is a network access point (NAP) of network 70 and network 60. Router 75 employs a Token Ring adapter and Ethernet adapter. This enables router 75 to interface with the two heterogeneous networks. Router 75 is also aware of the Inter-network Protocols, such as ICMP ARP and RIP, which are described below.
FIG. 1D is illustrative of the constituents of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite. The base layer of the TCP/IP protocol suite is the physical layer 80, which defines the mechanical, electrical, functional and procedural standards for the physical transmission of data over communications media, such as, for example, the network connection 45 of FIG. 1A. The physical layer may comprise electrical, mechanical or functional standards such as whether a network is packet switching or frame-switching; or whether a network is based on a Carrier Sense Multiple Access/Collision Detection (CSMA/CD) or a frame relay paradigm.
Overlying the physical layer is the data link layer 82. The data link layer provides the function and protocols to transfer data between network resources and to detect errors that may occur at the physical layer. Operating modes at the datalink layer comprise such standardized network topologies as IEEE 802.3 Ethernet, IEEE 802.5 Token Ring, ITU X.25, or serial (SLIP) protocols.
Network layer protocols 84 overlay the datalink layer and provide the means for establishing connections between networks. The standards of network layer protocols provide operational control procedures for internetworking communications and routing information through multiple heterogenous networks. Examples of network layer protocols are the Internet Protocol (IP) and the Internet Control Message Protocol (ICMP). The Address Resolution Protocol (ARP) is used to correlate an Internet address and a Media Access Address (MAC) of a particular host. The Routing Information Protocol (RIP) is a dynamic routing protocol for passing routing information between hosts on networks. The Internet Control Message Protocol (ICMP) is an internal protocol for passing control messages between hosts on various networks. ICMP messages provide feedback about events in the network environment or can help determine if a path exists to a particular host in the network environment. The latter is called a “Ping”. The Internet Protocol (IP) provides the basic mechanism for routing packets of information in the Internet. IP is a non-reliable communication protocol. It provides a “best efforts” delivery service and does not commit network resources to a particular transaction, nor does it perform retransmissions or give acknowledgments.
The transport layer protocols 86 provide end-to-end transport services across multiple heterogenous networks. The User Datagram Protocol (UDP) provides a connectionless, datagram oriented service which provides a non-reliable delivery mechanism for streams of information. The Transmission Control Protocol (TCP) provides a reliable session-based service for delivery of sequenced packets of information across the Internet. TCP provides a connection oriented reliable mechanism for information delivery.
The session, or application layer 88 provides a list of network applications and utilities, a few of which are illustrated here. For example, File Transfer Protocol (FTP) is a standard TCP/IP protocol for transferring files from one machine to another. FTP clients establish sessions through TCP connections with FTP servers in order to obtain files. Telnet is a standard TCP/IP protocol for remote terminal connection. A Telnet client acts as a terminal emulator and establishes a connection using TCP as the transport mechanism with a Telnet server. The Simple Network Management Protocol (SNMP) is a standard for managing TCP/IP networks. SNMP tasks, called “agents”, monitor network status parameters and transmit these status parameters to SNMP tasks called “managers.” Managers track the status of associated networks. A Remote Procedure Call (RPC) is a programming interface which enables programs to invoke remote functions on server machines. The Hypertext Transfer Protocol (HTTP) facilitates the transfer of data objects across networks via a system of uniform resource indicators (URI).
The Hypertext Transfer Protocol is a simple protocol built on top of Transmission Control Protocol (TCP). It is the mechanism which underlies the function of the World Wide Web. The HTTP provides a method for users to obtain data objects from various hosts acting as servers on the Internet. User requests for data objects are made by means of an HTTP request, such as a GET request. A GET request as depicted below is comprised of 1) an HTTP protocol version, such as “http:/1.0”; followed by 2) the full path of the data object; followed by 3) the name of the data object. In the GET request shown below, a request is being made for the data object with a path name of “/pub/” and a name of “MyData.html”:
HTTP-Version GET/pub/MyData.html  (1)
Processing of a GET request entails the establishing of an TCP/IP connection with the server named in the GET request and receipt from the server of the data object specified. After receiving and interpreting a request message, a server responds in the form of an HTTP RESPONSE message.
Response messages begin with a status line comprising a protocol version followed by a numeric Status Code and an associated textual Reason Phrase. These elements are separated by space characters. The format of a status line is depicted in line (2):
Status-Line=HTTP-Version Status-Code Reason-Phrase  (2)
The status line always begins with a protocol version and status code, e.g., “HTTP/1.0 200” The status code element is a three digit integer result code of the attempt to understand and satisfy a prior request message. The reason phrase is intended to give a short textual description of the status code.
The first digit of the status code defines the class of response. There are five categories for the first digit. 1XX is an information response. It is not currently used. 2XX is a successful response, indicating that the action was successfully received, understood and accepted. 3XX is a redirection response, indicating that further action must be taken in order to complete the request. 4XX is a client error response. This indicates a bad syntax in the request. Finally, 5XX is a server error. This indicates that the server failed to fulfill an apparently valid request.
2.0 Defining a Wrapper for Semi-structured Information
The process of generating a wrapper for extracting attributes of interest from semistructured information, such as from web data objects, for incorporation into a relational database is more fully described in U.S. Pat. No. 5,826,258, in the name of Ashish Gupta, et. al., entitled “Method and Apparatus for Structuring the Querying and Interpretation of Semistructured Information,” which is incorporated herein by reference for all purposes. The wrapper extracts the attributes of interest from the semistructured information and produces tuples, which may be provided to a relational database system. Once the wrapper for specific semistructured information is executed, a user may generate a relational database query (e.g., SQL query) which operates on the tuples produced by the wrapper. Accordingly, the relational database system views the semistructured information as one or more database tables as a result of the wrapper's processing.
FIG. 2A depicts a flowchart 101 of a processes of defining, generating and using a wrapper to access semistructured information from disparate semistructured information sources. As previously taught in the art, a wrapper may be described using a description language called a Site Description Language (SDL), which provides mechanisms for specifying different types of interactions between the wrapper and data sources. In a step 102, semistructured information is examined to identify patterns including attributes. In step 104, SDL statements describing patterns are specified in a definitional file. Next, in a step 106, the definitional file produced in step 104 is acted upon by a compiler or an interpreter to produce a wrapper. Typically, multiple wrappers corresponding to different semistructured information are generated for a particular application. Additionally, one or more mappers may be provided in order to translate attributes within semistructured information to fields in the relational database schema.
FIG. 2B depicts a flowchart 201 showing the process steps for defining a wrapper according to step 102 of flowchart 101. In step 202, the semistructured information is examined for repetitive patterns of interest using lexical analysis techniques, as are well known to persons of ordinary skill in the art. These repetitive patterns of interest include one or more attributes.
In step 204, the occurrences of the patterns in the semistructured information are cataloged by name and position in a nested structure without a priori information, i.e., there is no requirement that a user have prior knowledge or perform any prior programming before the patterns are cataloged. In one embodiment, the nested structure is a graph representing the nesting of the attributes within the semistructured information. Typically, many of the attributes of the nested structure correspond to fields of a relational database schema.
In a step 206, the patterns in the nested structure are examined to identify attributes that correspond to fields of a relational database schema. After these attributes are identified, regular expressions are generated that specify the location of the attributes within the semistructured information. The regular expressions may be generated as soon as these attributes are identified or when the definition of the semistructured information is written to a file. Thus, the generation of the regular expressions need not be performed at any specific time.
In step 207, the patterns in the nested structure are examined to identify patterns that may be further cataloged. Some patterns of interest may be further broken down into sub-component patterns of interest. Each one of these patterns that is identified is decomposed into its constituent patterns. These constituent patterns are then cataloged in the nested structure for further examination.
In a step 208, the patterns in the nested structure are examined to identify links to other semistructured information. The links identified in step 208 point to other semistructured information that may include patterns of interest and attributes. The links are traversed to further semistructured information, which is examined for patterns of interest. If patterns of interest are discovered, they are cataloged in the nested structure. Typically, the links are Uniform Resource Locator (“URL”) addresses of web pages. However, the links may also point to a program which, when executed, will generate semistructured information output. In the later case, the program is executed and the output is examined.
In a decisional step 210, it is determined whether there is more nested information to examine. If more nested information exists, then it is examined to identify attributes corresponding to fields in the relational database schema identified in step 206. Although steps 206, 207 and 208 are shown in a particular order, it is not required that these steps, like many other steps in the flowcharts, be performed in the order shown. Thus, the order shown in the flowcharts is to illustrate one embodiment and not intended to limit the invention.
Otherwise, if there is not more nested information to examine, then in a step 212, a definition of the semistructured information is provided, which serves as input to a program translator to build a parser. This definition of the semistructured information comprises regular expressions having attributes corresponding to fields of the relational database schema. The regular expressions specify locations of the attributes within the semistructured information that correspond to the relational database schema. Thus, the wrapper, produced by the program translator, includes a parser that is capable of parsing the semistructured information for attributes so that these attributes can be presented to a relational database system as tuples when the wrapper executes.
In one embodiment, the program translator is a compiler, which generates a parser by receiving the definition file as input and generating a program (i.e., the parser) for extracting attributes from the semistructured information that correspond to fields of the relational database schema to form tuples. In an alternative embodiment, the program translator is an interpreter, which generates a parser from the definition of the semistructured information and the semistructured information as inputs, by extracting attributes from the semistructured information that correspond to fields of the relational schema to form tuples.
2.1 Defining a Wrapper to Collect Information
FIG. 2C depicts a flowchart 203 showing process steps performed by a typical wrapper in traversing web pages to collect semistructured information according to a particular embodiment of the present invention. In a step 221, a variable root is set to be the root URL of a particular site. In a step 222, a URL is created for a target site of interest with a call to a url( ) function. For example, url(“:https://www.company.com”) or, for forms that take a relative URL string and a context, url(“next-page.html”, root). Next, in a step 223, a web page corresponding to the url created in step 222 is fetched with a getString(url) function. In decisional step 224, if step 223 failed to fetch a web page, the routine terminates and processing returns. Otherwise, in a step 225, a regular expression is matched against a string of input representing the web page fetched in step 223 as depicted in line 1 below:
1. Matches m = match(1, string, “pattern”);
2. while(m.next()) {/* do something*/}
The match( ) function returns a list of possible matches which can be cycled through with a next( ) function, as depicted in line 2 above. Each call to next( ) returns the next match of the pattern. Table 2 lists the parameters of the match( ) function, in accordance with a particular embodiment of the present invention:
TABLE 2
MATCH FUNCTION PARAMETERS
id An integer identifier for this match. It is used two ways: first,
compiled patterns are cached under this index, eliminating the
need to re-compile patterns unless the pattern actually changes.
Second, if the id is a negative number, debugging output is
available for this match. The idea is that when debugging,
instant feedback for one pattern match is available by
adding a “-”.
input Either a string or a URL. If it is a URL, a function getString() is
applied to it to get the contents of the URL page.
pattern A matching pattern which uses the pattern syntax of the Perl 5
programming language. Note that two backslashes are used
wherever Perl uses one, because of the way Java defines strings.
Also note that Perl variable interpolation (“pattern $var more
pattern”) is not implemented in this embodiment, however, state-
ments such as, “pattern” + var + “more pattern” are permitted.
mask (optional) A mask of options, formed by or-ing together a
plurality of bits: I (ignore case), S (single line) and X (extended
match). If match is not specified, it defaults to a value of
matchDefault, which is initially Matcher.I|Matcher.S. There is
also a match1 function, which performs a single match (i.e.,
does not iterate) and a matchIt function, which performs a single
match, containing a single pair of parens, and returns a string
matching the parens. Finally, a substitute function is available
for making changes based on a regular expression match.
In a decisional step 226, the existence of a match is determined, and if no match is found, processing proceeds with step 228, obtaining the next page. Otherwise, if a match has occurred, then in a step 230, information is retrieved from the match. A group(i) function is provided to retrieve matches. A basic pattern for matching is described in line 4 below. Use of the group(i) function is depicted in line 6 below:
4 Matches m = match(1, string, “1:(.*?)2:(.*?)3”, I|S);
5 while(m.next()) {
6 String s1 = m.group(1), s2 = m.group(2);
7 /* do something */
8 }
In a step 232, information retrieved from a match is added to the listing stack using a set( ) function, as depicted in line 13 below. In a step 234, the contents of the listing stack are placed in a table using the emit ( ) function, as depicted in line 14 below:
 9 Matches m = match(1, string, “Title:(.*?)Description:(.*?)Other”, I|S);
10 while(m.next()) {
11 mark();
12 String s1 = m.group(1), s2 = m.group(2);
13 set(LITERAL|JOB_TITLE, s1, AD_TEXT, s2);
14 emit();
15 reset();
16 }
Table 3 lists the parameters of the set( ) function:
TABLE 3
SET FUNCTION PARAMETERS
LITERAL| Indicates that the job title should be processed literally, i.e.,
JOB_TITLE not normalized with a rule. If EXTRACT|JOB_TITLE is
specified, a normalization occurs. If just JOB_TITLE is
specified, then results depend on the value of a parameter
extractDefault. It is initially set to EXTRACT, but is user
modifiable.
A set( ) function adds information to the table that is the answer to a query. In this example, two sets of attribute/value pairs are specified: a job title is s1 and an ad text is s2. In this embodiment, from 1 to 6 attribute/value pairs are permitted. There is also a version of set( ) that takes an array of fields and an array of String values. A mark( ) and a reset( ) functions are used to show the starting and ending, respectively, of a portion of a listing. Each match of the pattern represents a separate instance of information, here a job listing, the information for which must be kept separate from all others. This is accomplished by using mark( ) at the beginning of each listing, and reset( ) at the end. An emit( ) function specifies that collected data is to be passed to the table as part of the answer to the query. Typically, emit( ) appears at the end of an innermost loop. Emit( ) processing automatically extracts fields that have not been assigned within the course of wrapper processing.
The foregoing is provided as an example of a particular embodiment and not intended to be limiting of the invention to a particular order of processing. For example, the processing could have been described as in lines 17-23 below:
17 Matches m = match(1, string, “Title:(.*?)Description:(.*?)Other”,I|S);
18 while(m.next()) {
19 mark();
20 setFromMatch(m,LITERAL|JOB_TITLE, AD_TEXT);
21 emit();
22 reset();
23 }
3.0 The Listing Stack and the Execution Model
The foregoing example depicts a typical processing of a wrapper in a particular embodiment. A key concept is called a listing stack. Each call to set( ) adds a new listing to the listing stack. Each call to emit( ) causes the listing stack to be converted into a sequence of rows in a table, which sequence of rows serves to answer the original query.
3.1 The Listing Class and the Set Method
An individual listing consists of a number of fields (or attributes, or columns), each of which may be empty, may be filled with a single value, or may be filled with a vector of values. An empty column corresponds to “don't know”. A vector of values corresponds to “all of the above”. That is, adding the listing in line 24 below to the stack:
24 [A: a1, B: [b1, b2]]
using, for example, set(A,a1, B,new Object[ ]{b1, b2}), is equivalent to adding the two listings of lines 25-26 below:
25 [A: a1, B: b1]
26 [A: a1, B: b2]
If there are multiple vectors in a listing, the effect is to compute the cross product: a separate listing for every possible combination. That is, adding the following listing to the stack:
27 [A: [a1, a2], B: [b1, b2], C: c]
is equivalent to adding the following four listings:
28 [A: a1, B: b1, C: c]
29 [A: a1, B: b2, C: c]
30 [A: a2, B: b1, C: c]
31 [A: a2, B: b2, C: c]
3.2 The Listing Stack Class and the Emit Method
A listing stack is a representation of a table having a sequence of rows. The emit function converts a listing stack into a table. Emit function processing can be described by the following rule:
Given a listing stack is, a row r is a candidate member of the resulting table if and only if r can be formed by starting with an all-null row and then repeatedly selecting some row s from is and filling in any null fields in r with the corresponding field values from s.
After all possible candidate rows r have been generated, remove any duplicates, as well as any rows that are subsumed by another row. (A row s subsumes a row r if they are the same except that in one or more fields r has null and s has a non-null value.) The resulting set of rows derived in this way the information closure of Is. The rule is a little abstract, so let's look at some examples. First, the simple cases:
When the elements of the listing stack all have the same fields filled, like the four-element stack above, the resulting table is the same as the listing stack. When the elements of the listing stack all have different fields filled, the resulting table has a single row with all the fields combined. That is, the listing stack:
32 [A: a1, B: b1  ]
33 [   C: c1  ]
34 [    D: d1]
is equivalent to the single row table:
35 [A: a1, B: b1, C: c1, D: d1]
When there are several listings with one set of fields, and other listings with another set of fields, they combine as follows:
36 [A: a1, B: b1  ]
37 [A: a2, B: b2  ]
38 [  C: c1, D: dl]
39 [  C: c2, D: d2]
is equivalent to the four row table:
40 [A: a1, B: b1, C: c1, D: d1]
41 [A: a1, B: b1, C: c2, D: d2]
42 [A: a2, B: b2, C: c1, D: d1]
43 [A: a2, B: b2, C: c2, D: d2]
Note that it is not equivalent to the 16-row table that would result from the full cross product. If two attribute/value pairs appear together in a row (like [A: a1, B: b1]), they will stay together.
In every case, the creation of the resulting table can be described as “group together the similar listings, and form the cross product of the groups”, with the understanding that an empty column in a listing does not mean “no entries” (or else the cross product would be empty), rather it means “unknown entries”.
Now for the complicated case: when there are listings that overlap in the fields they have filled. First we'll limit ourselves to exactly two listings, called L1 and L2. The resulting table is then formed from filling in the non-empty columns of L1 with the corresponding columns of L2, and similarly filling the non-empty columns of L2 from L1. For example:
44 [A: a1, B: b1, C: c1  ]
45 [  B: b2, C: c2, D: d1]
is equivalent to the table:
46 [A: a1, B: b1, C: c1, D: d1]
47 [A: a1, B: b2, C: c2, D: d1]
In a more complex example, the following listing stack:
48 [N: Name1, F:Fax1 ]
49 [N: Name2,  P: Phone2, ]
50 [          A:Area ]
51 [         C: Sunnyvale, S: CA, T: Manager ]
52 [         C: Boston , S: MA, T:Programmer .. ]
is equivalent to this table:
53 [N: Name1, F:Fax1, P: Phone2, A: Area, C: Sunnyvale, S: CA,
T: Manager]
54 [N: Name2, F:Fax1, P: Phone2, A: Area, C: Sunnyvale, S: CA,
T: Manager]
55 [N: Name1, F:Fax1, P: Phone2, A: Area, C: Boston, S: MA,
T:Programmer]
56 [N: Name2, F:Fax1, P: Phone2, A: Area, C: Boston, S: MA,
T:Programmer]
The processing steps are that the F, P, and A columns must be filled with the only possible value. Remaining is one group of two rows with values for C, S and T, and another group of two rows with values for N (and either F or P, but these have already been dealt with). Therefore, form the cross-product of the two N rows with the two C/S/T rows to get four rows, and then fill in the blanks with the only possible values.
This result may or may not be exactly what was desired. If the fax and phone numbers are for the office in general, and just happened to be listed near the two names, then this is correct. But if the fax is associated with one name and the phone with another, then it is necessary to set P to NOVALUE for Name1, and set F to NOVALUE for name2. The result is the listing stack:
57 [N: Name1, F: Fax1, P:NOVALUE ]
58 [N: Name2, F: NOVALUE P:Phone2, ]
59 [ A:Area ]
60 [ C: Sunnyvale, S: CA, T:Manager ]
61 [ C: Boston, S: MA, T:Programmer ]
which is equivalent to this table:
62 [N: Name1, F:Fax1, P: NO VALUE, A: Area, C: Sunnyvale,
S: CA, T: Manager ]
63 [N: Name2, F:NOVALUE, P: Phone2, A: Area, C: Sunnyvale,
S: CA, T: Manager ]
64 [N: Name1, F:Fax1, P: NOVALUE, A: Area, C: Boston, S: MA,
T: Programmer ]
65 [N: Name2, F:NOVALUE, P: Phone2, A: Area, C: Boston, S: MA,
T: Programmer]
In general, each listing comprises fields from a closed set—in other words, a complete set of fields for which no other values are permitted. For example, City/State/Zip go together, as do Job Title and Job category. The set of potential values can be restricted by setting all values not of interest to NOVALUE.
Further defaults are possible. In a particular embodiment, further defaults will be provided by allowing listings that are marked as “default” in some way. Then, for example, a listing could give “408” as the default area code, and fill in this value for listings that were missing an area code, but would not propagate the value to listings that did have an area code. In terms of the information closure algorithm, values from a default listing can be copied into a row r only if there is no other row that could fill that field, given the current state of r.
Information listings could be defined having associated priority or probability meta data.
3.2 Listing Stack to Table Algorithm
FIG. 2D depicts a flowchart 205 showing the steps for computing an information closure for a set of rows in a listing stack. In a step 240, a cross product is computed for the first row in the listing stack. In a step 242, the cross product computed in step 240 is added to a list of accepted rows. In a decisional step 244, a check is done for any further remaining rows in the linkage stack. If a remaining row is found, then in a step 246, a selective cross product is computed on the remaining row and the list of accepted rows started in step 240. Otherwise, if no further rows remain, then in a step 248, the list of accepted rows is reduced by eliminating rows having identical fields. Finally, in a step 250, the resulting list of accepted rows is provided as the information closure. The pseudo code for this algorithm is depicted in the lines below:
66 function getRows(Listing stack 1s) {
67 row1 = pop off the top element of 1s
68 rows = the cross product of fields in row1
69 // This is just {row1} if row1 has no Vector-valued fields.
70 for each remaining row r in 1s {
71 rows = selectiveCrossProduct(rows, r)
72 }
73 eliminate duplicates from rows
74 return rows
75 }
FIG. 2E depicts a flowchart 207 showing the component steps for step 246 of FIG. 2E, computing the selective cross product for a list of accepted rows and a remaining row. In a step 260, an interim result is initialized to empty. Next, in a decisional step 262, a determination is made whether there are any further rows in the list of accepted rows to process. If there are further rows to process, then processing of the next accepted row in the list continues in a step 264, in which a new row r′ is computed from the accepted row extended with non-empty fields of the remaining row passed to the routine. Then, in a step 266, a new row n′ is computed from the remaining row passed to the routine extended with non-empty fields of the accepted row. Next, in a step 268, rows r′ and n′ are added to the result, and processing continues with the next row in the accepted row list at step 262. If in step 262, there are no further rows to process, then in a step 270, the routine returns the result as the selective cross product. The pseudo code for this algorithm is depicted in the lines below:
76 function selectiveCrossProduct(rows, newRow) {
77 result = empty table
78 for each row r in rows {
79 r′ = r extended with the non-empty fields in newRow
80 n′ = newRow extended with the non-empty fields in r
81 add rows r′ and n′ to result
82 }
83 return result
84 }
3.3 The mark( ) and reset( ) Methods
The mark( ) and reset( ) methods also operate on the listing stack. They keep track of progress during the traversal of a site, and release information that is no longer of interest, because the corresponding table entries have already been emitted. For example, a site may have some common information, such as a header page (perhaps the contact phone and fax numbers), followed by specific information for each job on separate pages. The corresponding wrapper would be:
85 // get and set the common information
86 set(PHONE, s1, FAX, s2);
87 while(...) {
88 mark();
89 // get and set the information for this job
90 set(TITLE, s3, LOCATION, s4, ...);
91 reset();
92 emit;
93 }
The reset( ) sets the listing stack back to the state it was in at the previous mark( ), thereby discarding the job just processed, but keeping the common information that came before the mark( ). In certain instances, omitting the mark( ) and reset( ) yields the same results because of the way the information closure algorithm is defined. However, it is more efficient to include mark( )/reset( ) when possible.
The described method can be especially useful in the area of electronic commerce applications. For example, an online book or music store could be created using the methods of the invention to process semistructured information about music and book offerings. The person of ordinary skill in the art will appreciate that semistructured infromation about other products and services can be processed by embodiments according to the present invention.
4.0 Conclusion
In conclusion the present invention provides for a system for automated extraction of information from a plurality of semistructured information sources. An advantage of the present invention is that information is automatically propagated to related tuples. A further advantage of the present invention is that it enables the use of a relational database to organize information obtained from a semistructured source, such as web pages on the world wide web. A yet further advantage of the present invention is that it enables data mining to be accomplished using a relational database.
Other embodiments of the present invention and its individual components will become readily apparent to those skilled in the art from the foregoing detailed description. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive. It is therefore not intended that the invention be limited except as indicated by the appended claims.

Claims (30)

What is claimed is:
1. A system for extracting information from a semistructured information source comprising:
a listing stack for holding extracted information;
a means for matching at least one extractor to said semistructured information to return a list of potential matches;
a means for iterating through said list of potential matches;
a means for retrieving information from a particular match in said list of potential matches;
a means for adding a particular match into said listing stack;
means for computing a cross product of fields in a first row from said listing stack;
means for adding said cross product of the fields in said first row to a list of accepted rows;
means for computing a selective cross product from a remaining row r and the list of accepted rows, for each remaining row r in a plurality of remaining rows in said listing stack; and
means for removing from the list of accepted rows at least one of a plurality of rows having identical fields.
2. The system of claim 1 further comprising:
a means for indicating the start of a group of associated information in said listing stack.
3. The system of claim 2 further comprising: a means for indicating the end of a group of associated information in said listing stack.
4. The system of claim 1 further comprising: a table for holding information;
a means for converting information stored in said listing stack to a plurality of rows in said table.
5. The system of claim 1 further comprising:
a means for returning a string for a URL.
6. The system of claim 1 wherein said means for matching further comprises a mask for controlling the matching process.
7. The system of claim 1 wherein said semistructured information comprises real estate listings.
8. The system of claim 1 wherein said semistructured information comprises job listings.
9. The system of claim 1 wherein said semistructured information comprises items for purchase or sale.
10. A method for extracting information from a semistructured information source into a listing stack comprising:
examining said semistructured information to identify patterns of interest;
examining the patterns of interest to identify attributes that correspond to fields of a relational database schema;
generating a wrapper based upon the attributes;
matching at least one extractor to said semistructured information to return a list of potential matches using the wrapper;
iterating through said list of potential matches;
retrieving information from a particular match in said list of potential matches; and
adding a particular match into said listing stack.
11. The method of claim 10 further comprising:
indicating the start of a group of associated information in said listing stack.
12. The method of claim 10 further comprising:
indicating the end of a group of associated information in said listing stack.
13. The method of claim 10 further comprising:
converting information stored in said listing stack to a plurality of rows in a table.
14. The method of claim 10 further comprising:
returning a string for a URL.
15. The method of claim 10 wherein said matching further comprises:
controlling the matching process using a mask.
16. The method of claim 10 wherein said semistructured information comprises:
real estate listings.
17. The method of claim 10 wherein said semistructured information comprises job listings.
18. The method of claim 10 wherein said semistructured information comprises items for purchase or sale.
19. A computer programming product for extracting information from a semistructured information source and storing said information so extracted into a listing stack comprising:
code for examining said semistructured information to identify patterns of interest;
code for examining the patterns of interest to identify attributes that correspond to fields of a relational database schema;
code for generating a wrapper based upon the attributes;
code for matching at least one extractor to said semistructured information using the wrapper to return a list of potential matches;
code for iterating through said list of potential matches;
code for retrieving information from a particular match in said list of potential matches;
code for adding a particular match into said listing stack; and
a computer readable storage medium for holding said codes.
20. The computer programming product of claim 19 further comprising:
code for indicating the start of a group of associated information in said listing stack.
21. The computer programming product of claim 20 further comprising:
code for indicating the end of a group of associated information in said listing stack.
22. The computer programming product of claim 19 further comprising:
code for converting information stored in said listing stack to a plurality of rows in a table.
23. The computer programming product of claim 19 further comprising:
code for returning a string for a URL.
24. The computer programming product of claim 19 wherein said code for matching further comprises code for controlling the matching process under a mask.
25. The computer programming product of claim 19 wherein said semistructured information comprises real estate listings.
26. The computer programming product of claim 19 wherein said semistructured information comprises job listings.
27. The computer programming product of claim 19 wherein said semistructured information comprises items for purchase or sale.
28. The system of claim 1 wherein the at least one extractor is a regular expression.
29. The system of claim 10 wherein the at least one extractor is a regular expression.
30. The system of claim 19 wherein the at least one extractor is a regular expression.
US10/000,743 1997-11-21 2001-11-30 Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information Expired - Lifetime US6571243B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/000,743 US6571243B2 (en) 1997-11-21 2001-11-30 Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US6612597P 1997-11-21 1997-11-21
US19642198A 1998-11-19 1998-11-19
US10/000,743 US6571243B2 (en) 1997-11-21 2001-11-30 Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US19642198A Continuation 1997-11-21 1998-11-19

Publications (2)

Publication Number Publication Date
US20020062312A1 US20020062312A1 (en) 2002-05-23
US6571243B2 true US6571243B2 (en) 2003-05-27

Family

ID=26746388

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/000,235 Expired - Lifetime US6539378B2 (en) 1997-11-21 2001-11-30 Method for creating an information closure model
US10/000,743 Expired - Lifetime US6571243B2 (en) 1997-11-21 2001-11-30 Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/000,235 Expired - Lifetime US6539378B2 (en) 1997-11-21 2001-11-30 Method for creating an information closure model

Country Status (1)

Country Link
US (2) US6539378B2 (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010032205A1 (en) * 2000-04-13 2001-10-18 Caesius Software, Inc. Method and system for extraction and organizing selected data from sources on a network
US20020178161A1 (en) * 1999-03-31 2002-11-28 Jonathan P. Brezin Optimization of system performance based on communication relationship
US6678681B1 (en) * 1999-03-10 2004-01-13 Google Inc. Information extraction from a database
US6851089B1 (en) * 1999-10-25 2005-02-01 Amazon.Com, Inc. Software application and associated methods for generating a software layer for structuring semistructured information
US20050289103A1 (en) * 2004-06-29 2005-12-29 Xerox Corporation Automatic discovery of classification related to a category using an indexed document collection
US20050289456A1 (en) * 2004-06-29 2005-12-29 Xerox Corporation Automatic extraction of human-readable lists from documents
US20060031085A1 (en) * 2004-07-30 2006-02-09 Paul Postel Electronic brokerage system and methods of performing the same
US20060069673A1 (en) * 2004-09-29 2006-03-30 Hitachi Software Engineering Co., Ltd. Text mining server and program
US20060106772A1 (en) * 2004-11-15 2006-05-18 International Business Machines Corporation Optimization of communication of data structures using program analysis
US20060206448A1 (en) * 2005-03-11 2006-09-14 Adam Hyder System and method for improved job seeking
US20060206517A1 (en) * 2005-03-11 2006-09-14 Yahoo! Inc. System and method for listing administration
US20060212466A1 (en) * 2005-03-11 2006-09-21 Adam Hyder Job categorization system and method
US20060229899A1 (en) * 2005-03-11 2006-10-12 Adam Hyder Job seeking system and method for managing job listings
US20060242123A1 (en) * 2005-04-23 2006-10-26 Cisco Technology, Inc. A California Corporation Hierarchical tree of deterministic finite automata
US20060265267A1 (en) * 2005-05-23 2006-11-23 Changsheng Chen Intelligent job matching system and method
US20060265266A1 (en) * 2005-05-23 2006-11-23 Changesheng Chen Intelligent job matching system and method
US20060265270A1 (en) * 2005-05-23 2006-11-23 Adam Hyder Intelligent job matching system and method
US20070055664A1 (en) * 2005-09-05 2007-03-08 Cisco Technology, Inc. Pipeline sequential regular expression matching
US20070214134A1 (en) * 2006-03-09 2007-09-13 Microsoft Corporation Data parsing with annotated patterns
US7308446B1 (en) * 2003-01-10 2007-12-11 Cisco Technology, Inc. Methods and apparatus for regular expression matching
US20080195646A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Self-describing web data storage model
US7542951B1 (en) 2005-10-31 2009-06-02 Amazon Technologies, Inc. Strategies for providing diverse recommendations
US7584159B1 (en) 2005-10-31 2009-09-01 Amazon Technologies, Inc. Strategies for providing novel recommendations
US7689530B1 (en) 2003-01-10 2010-03-30 Cisco Technology, Inc. DFA sequential matching of regular expression with divergent states
US7769752B1 (en) 2004-04-30 2010-08-03 Network Appliance, Inc. Method and system for updating display of a hierarchy of categories for a document repository
US7958164B2 (en) 2006-02-16 2011-06-07 Microsoft Corporation Visual design of annotated regular expression
US8135704B2 (en) 2005-03-11 2012-03-13 Yahoo! Inc. System and method for listing data acquisition
US8375067B2 (en) 2005-05-23 2013-02-12 Monster Worldwide, Inc. Intelligent job matching system and method including negative filtration
US8571930B1 (en) 2005-10-31 2013-10-29 A9.Com, Inc. Strategies for determining the value of advertisements using randomized performance estimates
US20140181633A1 (en) * 2012-12-20 2014-06-26 Stanley Mo Method and apparatus for metadata directed dynamic and personal data curation
US8914383B1 (en) 2004-04-06 2014-12-16 Monster Worldwide, Inc. System and method for providing job recommendations
US9779390B1 (en) 2008-04-21 2017-10-03 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path benchmarking
US10181116B1 (en) 2006-01-09 2019-01-15 Monster Worldwide, Inc. Apparatuses, systems and methods for data entry correlation
US10387839B2 (en) 2006-03-31 2019-08-20 Monster Worldwide, Inc. Apparatuses, methods and systems for automated online data submission
US11995613B2 (en) 2014-05-13 2024-05-28 Monster Worldwide, Inc. Search extraction matching, draw attention-fit modality, application morphing, and informed apply apparatuses, methods and systems

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6765919B1 (en) 1998-10-23 2004-07-20 Brocade Communications Systems, Inc. Method and system for creating and implementing zones within a fibre channel system
US7124144B2 (en) * 2000-03-02 2006-10-17 Actuate Corporation Method and apparatus for storing semi-structured data in a structured manner
US7152062B1 (en) 2000-11-21 2006-12-19 Actuate Corporation Technique for encapsulating a query definition
US7707159B2 (en) * 2000-03-02 2010-04-27 Actuate Corporation Method and apparatus for storing semi-structured data in a structured manner
US7031976B1 (en) * 2000-05-26 2006-04-18 Sprint Communications Company L.P. Computer framework and method for isolating a business component from specific implementations of a datastore
WO2002027551A2 (en) 2000-08-01 2002-04-04 Nimble Technology, Inc. Nested conditional relations (ncr) model and algebra
US7181508B1 (en) * 2000-11-09 2007-02-20 Oki Data Americas, Inc. System and method for communicating, monitoring and configuring a device operatively connected to a network
US6625613B2 (en) * 2001-02-26 2003-09-23 Motorola, Inc. Automatic generation of SQL for frame completion
US7133864B2 (en) * 2001-08-23 2006-11-07 Syngenta Participations Ag System and method for accessing biological data
JP2003208434A (en) * 2001-11-07 2003-07-25 Nec Corp Information retrieval system, and information retrieval method using the same
US7035841B2 (en) * 2002-07-18 2006-04-25 Xerox Corporation Method for automatic wrapper repair
US8712986B2 (en) * 2004-04-07 2014-04-29 Iac Search & Media, Inc. Methods and systems providing desktop search capability to software application
US20060041532A1 (en) * 2004-08-17 2006-02-23 Nikolov Momchil B System for automating the process of retrieving and replying to job postings
US7734606B2 (en) * 2004-09-15 2010-06-08 Graematter, Inc. System and method for regulatory intelligence
US7512626B2 (en) * 2005-07-05 2009-03-31 International Business Machines Corporation System and method for selecting a data mining modeling algorithm for data mining applications
US7509337B2 (en) * 2005-07-05 2009-03-24 International Business Machines Corporation System and method for selecting parameters for data mining modeling algorithms in data mining applications
US7516152B2 (en) * 2005-07-05 2009-04-07 International Business Machines Corporation System and method for generating and selecting data mining models for data mining applications
US7877460B1 (en) 2005-09-16 2011-01-25 Sequoia International Limited Methods and systems for facilitating the distribution, sharing, and commentary of electronically published materials
US20070198479A1 (en) * 2006-02-16 2007-08-23 International Business Machines Corporation Streaming XPath algorithm for XPath expressions with predicates
US8099415B2 (en) * 2006-09-08 2012-01-17 Simply Hired, Inc. Method and apparatus for assessing similarity between online job listings
US7844783B2 (en) * 2006-10-23 2010-11-30 International Business Machines Corporation Method for automatically detecting an attempted invalid access to a memory address by a software application in a mainframe computer
US20090144654A1 (en) * 2007-10-03 2009-06-04 Robert Brouwer Methods and apparatus for facilitating content consumption
GB2460045A (en) * 2008-05-13 2009-11-18 Triad Group Plc Analysing multiple data sources for a user request using business and geographical data, with selected rule sets to filter the data on the databases.
US8468119B2 (en) 2010-07-14 2013-06-18 Business Objects Software Ltd. Matching data from disparate sources
CN110311859A (en) * 2019-06-12 2019-10-08 深圳市科楠科技开发有限公司 A kind of converting system based on the SLIP of FPGA to gigabit Ethernet
CN111797279B (en) * 2020-07-17 2024-01-19 西安数据如金信息科技有限公司 Method and device for storing data

Citations (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4631673A (en) 1985-01-22 1986-12-23 International Business Machines Corporation Method for refreshing multicolumn tables in a relational data base using minimal information
US4917588A (en) 1987-05-07 1990-04-17 Theodor Grabener Pressensysteme Gmbh & Co. Kg Press for molding articles from powdered materials and drive means therefor
US4918593A (en) 1987-01-08 1990-04-17 Wang Laboratories, Inc. Relational database system
US5307484A (en) 1991-03-06 1994-04-26 Chrysler Corporation Relational data base repository system for managing functional and physical data structures of nodes and links of multiple computer networks
US5386556A (en) 1989-03-06 1995-01-31 International Business Machines Corporation Natural language analyzing apparatus and method
US5457792A (en) 1991-11-07 1995-10-10 Hughes Aircraft Company System for using task tables and technical data from a relational database to produce a parsed file of format instruction and a standardized document
US5544355A (en) 1993-06-14 1996-08-06 Hewlett-Packard Company Method and apparatus for query optimization in a relational database system having foreign functions
US5649186A (en) 1995-08-07 1997-07-15 Silicon Graphics Incorporated System and method for a computer-based dynamic information clipping service
US5659729A (en) 1996-02-01 1997-08-19 Sun Microsystems, Inc. Method and system for implementing hypertext scroll attributes
US5692181A (en) 1995-10-12 1997-11-25 Ncr Corporation System and method for generating reports from a computer database
US5706501A (en) 1995-02-23 1998-01-06 Fuji Xerox Co., Ltd. Apparatus and method for managing resources in a network combining operations with name resolution functions
US5706507A (en) 1995-07-05 1998-01-06 International Business Machines Corporation System and method for controlling access to data located on a content server
US5708806A (en) 1991-07-19 1998-01-13 Inso Providence Corporation Data processing system and method for generating a representation for and for representing electronically published structured documents
US5708825A (en) 1995-05-26 1998-01-13 Iconovex Corporation Automatic summary page creation and hyperlink generation
US5721903A (en) 1995-10-12 1998-02-24 Ncr Corporation System and method for generating reports from a computer database
US5721851A (en) 1995-07-31 1998-02-24 International Business Machines Corporation Transient link indicators in image maps
US5737592A (en) 1995-06-19 1998-04-07 International Business Machines Corporation Accessing a relational database over the Internet using macro language files
US5748954A (en) 1995-06-05 1998-05-05 Carnegie Mellon University Method for searching a queued and ranked constructed catalog of files stored on a network
US5761663A (en) 1995-06-07 1998-06-02 International Business Machines Corporation Method for distributed task fulfillment of web browser requests
US5806066A (en) 1996-03-26 1998-09-08 Bull Hn Information Systems Inc. Method of integrating schemas of distributed heterogeneous databases
US5826258A (en) 1996-10-02 1998-10-20 Junglee Corporation Method and apparatus for structuring the querying and interpretation of semistructured information
US5864848A (en) * 1997-01-31 1999-01-26 Microsoft Corporation Goal-driven information interpretation and extraction system
US5870739A (en) 1996-09-20 1999-02-09 Novell, Inc. Hybrid query apparatus and method
US5873079A (en) 1996-09-20 1999-02-16 Novell, Inc. Filtered index apparatus and method
US5884304A (en) 1996-09-20 1999-03-16 Novell, Inc. Alternate key index query apparatus and method
US5890147A (en) 1997-03-07 1999-03-30 Microsoft Corporation Scope testing of documents in a search engine using document to folder mapping
US5903893A (en) 1997-09-15 1999-05-11 International Business Machines Corporation Method and apparatus for optimizing a merge-join operation across heterogeneous databases
US5913214A (en) * 1996-05-30 1999-06-15 Massachusetts Inst Technology Data extraction from world wide web pages
US5926652A (en) 1996-12-20 1999-07-20 International Business Machines Corporation Matching of wild card patterns to wild card strings associated with named computer objects
US5943665A (en) 1997-09-09 1999-08-24 Netscape Communications Corporation Method and system for performing conceptual joins across fields of a database
US5956720A (en) * 1997-02-06 1999-09-21 At & T Corp Method and apparatus for web site management
US5963949A (en) 1997-12-22 1999-10-05 Amazon.Com, Inc. Method for data gathering around forms and search barriers
US5991756A (en) 1997-11-03 1999-11-23 Yahoo, Inc. Information retrieval from hierarchical compound documents
US6009410A (en) * 1997-10-16 1999-12-28 At&T Corporation Method and system for presenting customized advertising to a user on the world wide web
US6029182A (en) * 1996-10-04 2000-02-22 Canon Information Systems, Inc. System for generating a custom formatted hypertext document by using a personal profile to retrieve hierarchical documents
US6085190A (en) 1996-11-15 2000-07-04 Digital Vision Laboratories Corporation Apparatus and method for retrieval of information from various structured information
US6094645A (en) 1997-11-21 2000-07-25 International Business Machines Corporation Finding collective baskets and inference rules for internet or intranet mining for large data bases
US6102969A (en) * 1996-09-20 2000-08-15 Netbot, Inc. Method and system using information written in a wrapper description language to execute query on a network
US6108651A (en) 1997-09-09 2000-08-22 Netscape Communications Corporation Heuristic co-identification of objects across heterogeneous information sources
US6108666A (en) 1997-06-12 2000-08-22 International Business Machines Corporation Method and apparatus for pattern discovery in 1-dimensional event streams
US6131092A (en) 1992-08-07 2000-10-10 Masand; Brij System and method for identifying matches of query patterns to document text in a document textbase
US6138117A (en) 1998-04-29 2000-10-24 International Business Machines Corporation Method and system for mining long patterns from databases
US6167393A (en) 1996-09-20 2000-12-26 Novell, Inc. Heterogeneous record search apparatus and method
US6272495B1 (en) 1997-04-22 2001-08-07 Greg Hetherington Method and apparatus for processing free-format data
US20010013035A1 (en) 1997-02-25 2001-08-09 William W. Cohen System and method for accessing heterogeneous databases

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4918588A (en) 1986-12-31 1990-04-17 Wang Laboratories, Inc. Office automation system with integrated image management
US6247018B1 (en) * 1998-04-16 2001-06-12 Platinum Technology Ip, Inc. Method for processing a file to generate a database

Patent Citations (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4631673A (en) 1985-01-22 1986-12-23 International Business Machines Corporation Method for refreshing multicolumn tables in a relational data base using minimal information
US4918593A (en) 1987-01-08 1990-04-17 Wang Laboratories, Inc. Relational database system
US4917588A (en) 1987-05-07 1990-04-17 Theodor Grabener Pressensysteme Gmbh & Co. Kg Press for molding articles from powdered materials and drive means therefor
US5386556A (en) 1989-03-06 1995-01-31 International Business Machines Corporation Natural language analyzing apparatus and method
US5307484A (en) 1991-03-06 1994-04-26 Chrysler Corporation Relational data base repository system for managing functional and physical data structures of nodes and links of multiple computer networks
US5708806A (en) 1991-07-19 1998-01-13 Inso Providence Corporation Data processing system and method for generating a representation for and for representing electronically published structured documents
US5457792A (en) 1991-11-07 1995-10-10 Hughes Aircraft Company System for using task tables and technical data from a relational database to produce a parsed file of format instruction and a standardized document
US6131092A (en) 1992-08-07 2000-10-10 Masand; Brij System and method for identifying matches of query patterns to document text in a document textbase
US5544355A (en) 1993-06-14 1996-08-06 Hewlett-Packard Company Method and apparatus for query optimization in a relational database system having foreign functions
US5706501A (en) 1995-02-23 1998-01-06 Fuji Xerox Co., Ltd. Apparatus and method for managing resources in a network combining operations with name resolution functions
US5708825A (en) 1995-05-26 1998-01-13 Iconovex Corporation Automatic summary page creation and hyperlink generation
US5748954A (en) 1995-06-05 1998-05-05 Carnegie Mellon University Method for searching a queued and ranked constructed catalog of files stored on a network
US5761663A (en) 1995-06-07 1998-06-02 International Business Machines Corporation Method for distributed task fulfillment of web browser requests
US5737592A (en) 1995-06-19 1998-04-07 International Business Machines Corporation Accessing a relational database over the Internet using macro language files
US5706507A (en) 1995-07-05 1998-01-06 International Business Machines Corporation System and method for controlling access to data located on a content server
US5721851A (en) 1995-07-31 1998-02-24 International Business Machines Corporation Transient link indicators in image maps
US5649186A (en) 1995-08-07 1997-07-15 Silicon Graphics Incorporated System and method for a computer-based dynamic information clipping service
US5721903A (en) 1995-10-12 1998-02-24 Ncr Corporation System and method for generating reports from a computer database
US5692181A (en) 1995-10-12 1997-11-25 Ncr Corporation System and method for generating reports from a computer database
US5659729A (en) 1996-02-01 1997-08-19 Sun Microsystems, Inc. Method and system for implementing hypertext scroll attributes
US5806066A (en) 1996-03-26 1998-09-08 Bull Hn Information Systems Inc. Method of integrating schemas of distributed heterogeneous databases
US6282537B1 (en) * 1996-05-30 2001-08-28 Massachusetts Institute Of Technology Query and retrieving semi-structured data from heterogeneous sources by translating structured queries
US5913214A (en) * 1996-05-30 1999-06-15 Massachusetts Inst Technology Data extraction from world wide web pages
US5884304A (en) 1996-09-20 1999-03-16 Novell, Inc. Alternate key index query apparatus and method
US6167393A (en) 1996-09-20 2000-12-26 Novell, Inc. Heterogeneous record search apparatus and method
US5870739A (en) 1996-09-20 1999-02-09 Novell, Inc. Hybrid query apparatus and method
US5873079A (en) 1996-09-20 1999-02-16 Novell, Inc. Filtered index apparatus and method
US6102969A (en) * 1996-09-20 2000-08-15 Netbot, Inc. Method and system using information written in a wrapper description language to execute query on a network
US5826258A (en) 1996-10-02 1998-10-20 Junglee Corporation Method and apparatus for structuring the querying and interpretation of semistructured information
US6029182A (en) * 1996-10-04 2000-02-22 Canon Information Systems, Inc. System for generating a custom formatted hypertext document by using a personal profile to retrieve hierarchical documents
US6085190A (en) 1996-11-15 2000-07-04 Digital Vision Laboratories Corporation Apparatus and method for retrieval of information from various structured information
US5926652A (en) 1996-12-20 1999-07-20 International Business Machines Corporation Matching of wild card patterns to wild card strings associated with named computer objects
US5864848A (en) * 1997-01-31 1999-01-26 Microsoft Corporation Goal-driven information interpretation and extraction system
US5956720A (en) * 1997-02-06 1999-09-21 At & T Corp Method and apparatus for web site management
US6295533B2 (en) 1997-02-25 2001-09-25 At&T Corp. System and method for accessing heterogeneous databases
US20010013035A1 (en) 1997-02-25 2001-08-09 William W. Cohen System and method for accessing heterogeneous databases
US5890147A (en) 1997-03-07 1999-03-30 Microsoft Corporation Scope testing of documents in a search engine using document to folder mapping
US6272495B1 (en) 1997-04-22 2001-08-07 Greg Hetherington Method and apparatus for processing free-format data
US6108666A (en) 1997-06-12 2000-08-22 International Business Machines Corporation Method and apparatus for pattern discovery in 1-dimensional event streams
US6108651A (en) 1997-09-09 2000-08-22 Netscape Communications Corporation Heuristic co-identification of objects across heterogeneous information sources
US5943665A (en) 1997-09-09 1999-08-24 Netscape Communications Corporation Method and system for performing conceptual joins across fields of a database
US5903893A (en) 1997-09-15 1999-05-11 International Business Machines Corporation Method and apparatus for optimizing a merge-join operation across heterogeneous databases
US6009410A (en) * 1997-10-16 1999-12-28 At&T Corporation Method and system for presenting customized advertising to a user on the world wide web
US5991756A (en) 1997-11-03 1999-11-23 Yahoo, Inc. Information retrieval from hierarchical compound documents
US6263327B1 (en) 1997-11-21 2001-07-17 International Business Machines Corporation Finding collective baskets and inference rules for internet mining
US6094645A (en) 1997-11-21 2000-07-25 International Business Machines Corporation Finding collective baskets and inference rules for internet or intranet mining for large data bases
US5963949A (en) 1997-12-22 1999-10-05 Amazon.Com, Inc. Method for data gathering around forms and search barriers
US6138117A (en) 1998-04-29 2000-10-24 International Business Machines Corporation Method and system for mining long patterns from databases

Non-Patent Citations (17)

* Cited by examiner, † Cited by third party
Title
Ashish et al. "Wrapper Generation for Semi-structured Internet Sources", 5-97.
Carey et al. "Towards Heterogenous Multimedia Information Systems: The Garlic Approach". '95.
Chawathe et al. "The TSIMMIS Project: Integration of Heterogenous Information Sources".10-94.
Date, C.J. (1986). "An Introduction to Database Systems, vol. I, Fourth Edition," Reading:Addison-Wesley, QA76.9.D3D37, 132-136.
Garcia-Molina et al. "The TSIMMIS Approach to Mediation: Data Models and Languages", p. 1-17. '95.
Hammer et al. "Browsing Object Databases Through the Web". 10-96.
Hammer et al. "Extracting Semistructured Information from the Web", 5-97.
Harrison, Coleman "An Adaptive Query Language for Object-Oriented Databases: Automotive Navigation Through Partially Specified Data Structures", p. 1-31, 10-94.
Ioannidis, Y.E. and Wong, E. (1991). "Towards an Algebraic Theory of Recursion," Journal of the ACM, 38(2): 329-381.
Levy et al. "Query-Answering Algorithms for Information Agents". 6-96.
Mishra, P. and Eich, M.H. (1992). "Join Processing in Relational Databases," ACM Computing Surverys, 24(1):63-113.
Papakonstantinou et al. "MedMaker: A Mediation System Based on Declarative Specifications". 2-96.
Papakonstantinou et al. "Object Exchange Across Heterogenous Information Sources". 5-95.
Papakonstantinoue et al. "A Query Translation Scheme for Rapid Implementation of Wrappers (Extended Version)" '95.
Quass et al, "LORE: A Lightweight Object REpository for Semistructured Data", p. 549. 6-96.
Quass et al. "Querying Semistructured Heterogenous Information", p. 1-41, 12-95.
Wiederhold, Gio "Interoperation, Mediation, and Ontologies," Proceedings International Symposium on Fifth Generation Computer Systems, Workshop on Heterogeous Cooperative Knowledge-Bases, v. W3, p. 33-48, Tokyo, Japan. 12-94.

Cited By (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332371B1 (en) 1999-03-10 2012-12-11 Google Inc. Information extraction from a database
US6678681B1 (en) * 1999-03-10 2004-01-13 Google Inc. Information extraction from a database
US7650330B1 (en) 1999-03-10 2010-01-19 Google Inc. Information extraction from a database
US8037065B1 (en) 1999-03-10 2011-10-11 Google Inc. Information extraction from a database
US8589387B1 (en) 1999-03-10 2013-11-19 Google Inc. Information extraction from a database
US20020178161A1 (en) * 1999-03-31 2002-11-28 Jonathan P. Brezin Optimization of system performance based on communication relationship
US7039639B2 (en) * 1999-03-31 2006-05-02 International Business Machines Corporation Optimization of system performance based on communication relationship
US6851089B1 (en) * 1999-10-25 2005-02-01 Amazon.Com, Inc. Software application and associated methods for generating a software layer for structuring semistructured information
US20010032205A1 (en) * 2000-04-13 2001-10-18 Caesius Software, Inc. Method and system for extraction and organizing selected data from sources on a network
US7418440B2 (en) * 2000-04-13 2008-08-26 Ql2 Software, Inc. Method and system for extraction and organizing selected data from sources on a network
US7308446B1 (en) * 2003-01-10 2007-12-11 Cisco Technology, Inc. Methods and apparatus for regular expression matching
US7689530B1 (en) 2003-01-10 2010-03-30 Cisco Technology, Inc. DFA sequential matching of regular expression with divergent states
US8914383B1 (en) 2004-04-06 2014-12-16 Monster Worldwide, Inc. System and method for providing job recommendations
US7769752B1 (en) 2004-04-30 2010-08-03 Network Appliance, Inc. Method and system for updating display of a hierarchy of categories for a document repository
US20050289456A1 (en) * 2004-06-29 2005-12-29 Xerox Corporation Automatic extraction of human-readable lists from documents
US7558792B2 (en) * 2004-06-29 2009-07-07 Palo Alto Research Center Incorporated Automatic extraction of human-readable lists from structured documents
US7529731B2 (en) 2004-06-29 2009-05-05 Xerox Corporation Automatic discovery of classification related to a category using an indexed document collection
US20050289103A1 (en) * 2004-06-29 2005-12-29 Xerox Corporation Automatic discovery of classification related to a category using an indexed document collection
US8244736B2 (en) 2004-07-30 2012-08-14 FASTXchange, Inc. Electronic brokerage system and methods of performing the same
US20060031085A1 (en) * 2004-07-30 2006-02-09 Paul Postel Electronic brokerage system and methods of performing the same
US20060069673A1 (en) * 2004-09-29 2006-03-30 Hitachi Software Engineering Co., Ltd. Text mining server and program
US7302427B2 (en) * 2004-09-29 2007-11-27 Hitachi Software Engineering Co., Ltd. Text mining server and program
US20060106772A1 (en) * 2004-11-15 2006-05-18 International Business Machines Corporation Optimization of communication of data structures using program analysis
US20060229899A1 (en) * 2005-03-11 2006-10-12 Adam Hyder Job seeking system and method for managing job listings
US20060206505A1 (en) * 2005-03-11 2006-09-14 Adam Hyder System and method for managing listings
US20060206448A1 (en) * 2005-03-11 2006-09-14 Adam Hyder System and method for improved job seeking
US20060206517A1 (en) * 2005-03-11 2006-09-14 Yahoo! Inc. System and method for listing administration
US20060212466A1 (en) * 2005-03-11 2006-09-21 Adam Hyder Job categorization system and method
US8135704B2 (en) 2005-03-11 2012-03-13 Yahoo! Inc. System and method for listing data acquisition
US7707203B2 (en) * 2005-03-11 2010-04-27 Yahoo! Inc. Job seeking system and method for managing job listings
US7702674B2 (en) 2005-03-11 2010-04-20 Yahoo! Inc. Job categorization system and method
US7680855B2 (en) 2005-03-11 2010-03-16 Yahoo! Inc. System and method for managing listings
US7680854B2 (en) 2005-03-11 2010-03-16 Yahoo! Inc. System and method for improved job seeking
US7765183B2 (en) * 2005-04-23 2010-07-27 Cisco Technology, Inc Hierarchical tree of deterministic finite automata
US20060242123A1 (en) * 2005-04-23 2006-10-26 Cisco Technology, Inc. A California Corporation Hierarchical tree of deterministic finite automata
US8977618B2 (en) 2005-05-23 2015-03-10 Monster Worldwide, Inc. Intelligent job matching system and method
US8375067B2 (en) 2005-05-23 2013-02-12 Monster Worldwide, Inc. Intelligent job matching system and method including negative filtration
US20060265267A1 (en) * 2005-05-23 2006-11-23 Changsheng Chen Intelligent job matching system and method
US8433713B2 (en) 2005-05-23 2013-04-30 Monster Worldwide, Inc. Intelligent job matching system and method
US8527510B2 (en) 2005-05-23 2013-09-03 Monster Worldwide, Inc. Intelligent job matching system and method
US20060265270A1 (en) * 2005-05-23 2006-11-23 Adam Hyder Intelligent job matching system and method
US9959525B2 (en) 2005-05-23 2018-05-01 Monster Worldwide, Inc. Intelligent job matching system and method
US20060265266A1 (en) * 2005-05-23 2006-11-23 Changesheng Chen Intelligent job matching system and method
US20070055664A1 (en) * 2005-09-05 2007-03-08 Cisco Technology, Inc. Pipeline sequential regular expression matching
US7499941B2 (en) 2005-09-05 2009-03-03 Cisco Technology, Inc. Pipeline regular expression matching
US7584159B1 (en) 2005-10-31 2009-09-01 Amazon Technologies, Inc. Strategies for providing novel recommendations
US8571930B1 (en) 2005-10-31 2013-10-29 A9.Com, Inc. Strategies for determining the value of advertisements using randomized performance estimates
US7542951B1 (en) 2005-10-31 2009-06-02 Amazon Technologies, Inc. Strategies for providing diverse recommendations
US10181116B1 (en) 2006-01-09 2019-01-15 Monster Worldwide, Inc. Apparatuses, systems and methods for data entry correlation
US7958164B2 (en) 2006-02-16 2011-06-07 Microsoft Corporation Visual design of annotated regular expression
US7860881B2 (en) * 2006-03-09 2010-12-28 Microsoft Corporation Data parsing with annotated patterns
US20070214134A1 (en) * 2006-03-09 2007-09-13 Microsoft Corporation Data parsing with annotated patterns
US10387839B2 (en) 2006-03-31 2019-08-20 Monster Worldwide, Inc. Apparatuses, methods and systems for automated online data submission
US20080195646A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Self-describing web data storage model
US9830575B1 (en) 2008-04-21 2017-11-28 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path taxonomy
US9779390B1 (en) 2008-04-21 2017-10-03 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path benchmarking
US10387837B1 (en) 2008-04-21 2019-08-20 Monster Worldwide, Inc. Apparatuses, methods and systems for career path advancement structuring
US20140181633A1 (en) * 2012-12-20 2014-06-26 Stanley Mo Method and apparatus for metadata directed dynamic and personal data curation
US11995613B2 (en) 2014-05-13 2024-05-28 Monster Worldwide, Inc. Search extraction matching, draw attention-fit modality, application morphing, and informed apply apparatuses, methods and systems

Also Published As

Publication number Publication date
US20020062312A1 (en) 2002-05-23
US6539378B2 (en) 2003-03-25
US20020062222A1 (en) 2002-05-23

Similar Documents

Publication Publication Date Title
US6571243B2 (en) Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information
US5963949A (en) Method for data gathering around forms and search barriers
US6102969A (en) Method and system using information written in a wrapper description language to execute query on a network
US6292802B1 (en) Methods and system for using web browser to search large collections of documents
US6954778B2 (en) System and method for accessing directory service via an HTTP URL
US7072984B1 (en) System and method for accessing customized information over the internet using a browser for a plurality of electronic devices
US6665662B1 (en) Query translation system for retrieving business vocabulary terms
US6353830B1 (en) Graphical interface for object-relational query builder
JP3548098B2 (en) Method and system for providing a native language query service
US6571232B1 (en) System and method for browsing database schema information
US6658624B1 (en) Method and system for processing documents controlled by active documents with embedded instructions
US6609121B1 (en) Lightweight directory access protocol interface to directory assistance systems
US6671681B1 (en) System and technique for suggesting alternate query expressions based on prior user selections and their query strings
US5956720A (en) Method and apparatus for web site management
US20090125809A1 (en) System and Method for Adapting Information Content for an Electronic Device
US20020042789A1 (en) Internet search engine with interactive search criteria construction
US20040068498A1 (en) Parallel tree searches for matching multiple, hierarchical data structures
US20040205076A1 (en) System and method to automate the management of hypertext link information in a Web site
US6397206B1 (en) Optimizing fixed, static query or service selection and execution based on working set hints and query signatures
US20040220946A1 (en) Techniques for transferring a serialized image of XML data
WO1998002813A9 (en) Object-oriented method and apparatus for information delivery
WO1998002813A1 (en) Object-oriented method and apparatus for information delivery
JP2005182835A (en) Method of creating data server for different kind of data source
WO2002087135A2 (en) System and method for adapting information content for an electronic device
US20040049495A1 (en) System and method for automatically generating general queries

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: A9.COM, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AMAZON.COM, INC.;REEL/FRAME:015065/0818

Effective date: 20040810

AS Assignment

Owner name: A9.COM, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AMAZON.COM, INC.;REEL/FRAME:015341/0120

Effective date: 20040810

Owner name: A9.COM, INC.,CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AMAZON.COM, INC.;REEL/FRAME:015341/0120

Effective date: 20040810

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: AMAZON.COM, INC., WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:A9.COM, INC.;REEL/FRAME:018627/0142

Effective date: 20061213

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12

AS Assignment

Owner name: AMAZON TECHNOLOGIES, INC., WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AMAZON.COM, INC.;REEL/FRAME:034742/0773

Effective date: 20150107