Academia.eduAcademia.edu

Web mining

Web Mining Web Mining Robert Jesuraj K Subject: Data and text mining Department of Computer Science and Engineering (Computational Linguistics) Government Engineering College Sreekrishnapuram Palakkad April 18, 2012 Web Mining CONTENTS 1 Introduction 2 Web content mining Crawlers 3 Web Search mining PageRank Clever 4 Web Usage mining Preprocessing Pattern Discovery Pattern Analysis Web Mining Introduction CONTENTS 1 Introduction 2 Web content mining Crawlers 3 Web Search mining PageRank Clever 4 Web Usage mining Preprocessing Pattern Discovery Pattern Analysis Web Mining Introduction Size of world wide web 1999: 350 million pages growth rate 1 million pages a day web mining Data mining applied to the world-wide-web mining of the data related to the web Web Mining Introduction Classification of web data Content of the actual web page Intrapage structures includes HTML or XML for the page Interpage structure → actual linkage structure between web pages Usage data describe how web pages are accessed by visitors User profiles include demographic and registration information obtained about users Can also include information found on cookies Web Mining Introduction Classses in Web mining tasks Web Mining Introduction Classes explained... Web content mining: Contents include text as well as graphics data Web page content mining: traditional searching of web pages via content Search result mining: further search of pages found from a previous search Web structure mining: information obtained on actual organzation of pages on the web Web usage mining: uses the logs of Web access General access packet tracking looks at history of web pages visited Customized usage tracking Usage may be general or may be targeted to specific usage or users Web Mining Web content mining CONTENTS 1 Introduction 2 Web content mining Crawlers 3 Web Search mining PageRank Clever 4 Web Usage mining Preprocessing Pattern Discovery Pattern Analysis Web Mining Web content mining Web content mining Most search engines are keyword-based. web content mining goes beyond the basic IR technology. How to improve traditional search engines techiniques? concept hierarchies synonyms user profiles analysing the links between pages Web crawlers → Search engines must have crawlers to search the Web and gather information Data mining techniques will provide efficiency, effectiveness and scalability in searching Web Mining Web content mining Taxonomy of Web mining 2 types: agent based Software systems (agents) perform the content mining intelligent search agents, information filtering, personalized Web agents use of not only keyword based but using content and user profile data database approach View Web data as belonging to a database multilevel database query languages that target the Web Web Mining Web content mining Techniques to summarize information found Inverted file index Search engines retrieve relevant information using keyword-based techniques Problem associated with retrieval of data from Web documents is that they are not structured as in traditional databases No schema or division into attributes Web pages are defined using hypertext markup language (HTML) HTML is semistructured, making querying more difficult Extensible markup language (XML) Structured documents and facilitate easier mining Web Mining Web content mining Crawlers Crawlers Robot (Spider or Crawlers) programs that traverses the hypertext structures in the Web Seed URL starting url of the crawlers All URL’s are saved using queue data structure Collect information from each page (such as, extract keywords and store in indices for users) traditional crawlers visits entire web and replaces index periodic crawlers Crawlers visit certain number of pages and then stop, build and index, update the index less human intervention incremental crawlers Selectively searches the Web and only updates the index incrementally focused crawlers visits pages related to particular focus Web Mining Web content mining Crawlers Content mining a type of text mining Web Mining Web content mining Crawlers Focused crawlers Only visits the links from a page if that page is determined to be relevant Classifier is static after learning phase Components: Classifier which assigns relevance score to each page based on crawl topic. Classifiers are used to relate documents to topic. Also determine how useful outgoing links are. Distiller to identify hub pages. Hub pages contains links to many relevant pages. Must be visited even if not high relevance score Crawler visits pages to based on classifier and distiller scores Web Mining Web content mining Crawlers Hierarchical classification tree User browses on the Web and identifies the documents that are of interest nodes are marked good, indicating that this node in the tree has associated with it document(s) that are of interest These documents are then used as the seed documents to begin the focused crawling Each document is classified into a leaf node of the taxonomy tree Web Mining Web content mining Crawlers focus hard focus follows links if there is an ancestor of this node that has been marked as good soft focus identifies the probability that a page d is relevant as X R(d) = P(c/d) (1) good(c) c : node in the tree good(c) : indication that it has been labeled to be of interest The priority of visiting a page not yet visited is the maximum of the relevance of pages that have been visited and point to it Web Mining Web content mining Crawlers Hierarchical classification approach Hierarchical classification approach uses hierarchical taxonomy naive Bayes classifier P(ci /d) = P(Ci−1 /d)P(ci /ci−1 , d) (2) Using Bayes rule P(ci /ci−1 , d) = P(ci /ci−1 )P(d/ci ) P s P(d/s) s : sibling of ci P(d/ci ) found using Bernoulli model (3) Web Mining Web content mining Crawlers Context focused crawler (CFC) Crawling in two steps construction of context graphs and classifiers using the seed documents as training samples. crawling is performed using the classifiers to guide it. Context graphs Roots represents a seed document and nodes at each level links to nodes at next higher level Node in graph have path length n from seed document to document (represents indirect links) CFC uses TF-IDF technique. Web Mining Web content mining Crawlers Context graph Web Mining Web content mining Crawlers Harvest System based on use of caching, indexing, crawling Harvest system uses gatherer and brokers Gatherer obtains information for indexing from internet service provider brokers provides index and query interface Harvest gatherers use the Essence System in collecting data Essence classifies documents using semantic index Web Mining Web content mining Crawlers Virtual Web View Multiple Layered DataBase (MLDB) built on top of the web Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels. Web Mining Web content mining Crawlers WebML WebML : Web data mining query language Documents accessed using data mining operations and lists of keywords Four primitive operations based on concept hierarchies 1 COVERS : One concept covers another if it is higher (ancestor) in the hierarchy 2 COVERED BY : reverse of COVERS 3 LIKE : concept is a synonym 4 CLOSE TO : sibling in the hierarchy Web Mining Web content mining Crawlers WebML example The query finds all documents at the level of “www.engr.smu.edu” that have a keyword that covers the keyword cat: SELECT * FROM document in “ www.engr.sum.edu” WHERE ONE OF keywords COVERS “cat” Web Mining Web content mining Crawlers Personalization Web access or contents tuned to better fit the desires of each user. Manual techniques identify user’s preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles. Web Mining Web Search mining CONTENTS 1 Introduction 2 Web content mining Crawlers 3 Web Search mining PageRank Clever 4 Web Usage mining Preprocessing Pattern Discovery Pattern Analysis Web Mining Web Search mining Web Search mining Used to classify Web pages to create similarity measures between documents PageRank (Google) Increase the effectiveness of search engines and improve their efficiency. Used to measure importance of page and prioritize pages. The PageRank value of a page calculated based on number of pages that point to it. Measure based on number of back links to a page. Web Mining Web Search mining PageRank PageRank Given a page p, Bp set of pages that point to p, Fp set of links out of p PageRankPR(p) = c X PR(q) Nq q∈Bp Here Nq =| Fq | c : value between {0,1} used for normalization (4) Web Mining Web Search mining PageRank PageRank problem: rank sink rank sink when a cyclic reference occurs, the PR values of these pages increases PR ′ (p) = c X PR(q) + cE (v ) Nq q∈Bp E(v) : vector that adds an artificial link PageRank does not count all links the same. Values are normalized by the number of links in the page (5) Web Mining Web Search mining Clever Clever System developed at IBM finds both authoritative pages and hubs Authoritative pages: Highly important pages Best sources for requested information Hub pages: Contain links to highly important pages Search is done by finding best hubs and authorities by creating weights Web Mining Web Search mining Clever HITS Hyperlink-induced topic search (HITS) finds hubs and authoritative pages Retrieve relevant pages using keywords (query) Hubs and authoritative measures are returned for these pages Web Mining Web Usage mining CONTENTS 1 Introduction 2 Web content mining Crawlers 3 Web Search mining PageRank Clever 4 Web Usage mining Preprocessing Pattern Discovery Pattern Analysis Web Mining Web Usage mining Web Usage mining Performs mining on Web usage data or Web logs Web logs: listing of page reference data also called Clickstream data because each entry corresponds to mouse click Logs maintained at client/server side Example: link to pages <A,B,A,C> Web Mining Web Usage mining User profile profile of an user created by using sequence of pages → personalization predicting desired pages improve overall performance of future access improve design of Web pages and modification to sites Use of specific advertisement Statistics of usage accessing Web page Web Mining Web Usage mining Activities in Web usage mining 1 preprocessing reformatting Web log data before processing 2 pattern discovery look to find hidden patterns within log data 3 pattern analysis looking and interpreting the results of discovery activities Web Mining Web Usage mining Issues in Web log mining 1 Identification of exact user not possible using web logs (use of proxy servers, client side caching, firewalls) 2 Difficult to find exact sequence of pages as done by user 3 Legal issues Web Mining Web Usage mining Preprocessing Preprocessing Web usage log is reformatted and cleansed Steps: cleansing user identification session identification path completion formating Definition Let P be a set of literals, called pages or clicks, and U be a set of users. A log is a set of triples {<u1 ,p1 ,t1 >, . . .,<un ,pn ,tn >} where ui belongs to U, pi belongs to P, and ti is a timestamp. Web Mining Web Usage mining Preprocessing Preprocessing cont’d... Source destination sites listed as a URL or an IP address Source site → user ID Destination site → page ID Web browsing information also included Log entries with figures (gif, jpg, png) removed session login and logout noted for each pages Cookies information Path completion → adding missing pages eg. user visits page A and then page C, but there is no link from A to C, then at least one page in this path is missing. Algorithms are used to infer these pages. Web Mining Web Usage mining Preprocessing Data structures Trie Common prefixes of strings are shared Suffix tree compressed version of trie Properties of suffix tree: each internal node excepy root has at least two children each edge represents a nonempty subsequence subsequences represented by sibling edges begin with different symbols eg. {ABOUT, CAT, CATEGORY} Web Mining Web Usage mining Preprocessing Sample tries Web Mining Web Usage mining Pattern Discovery Pattern Discovery Finding traversal patterns of Web pages by users Association rules Duplicate page references Sequential patterns Web Mining Web Usage mining Pattern Analysis Pattern Analysis Identify useful information MINT query language g-sequence vector consisting of page visiting sequence and wildcards (eg b*c → many b’s followed by c) Short time visitors data are neglected Web pages abstracted to concepts (Concept hierarchies) In e-commerce g-sequences of two patterns are compared if first n page are same Web Mining Web Usage mining Pattern Analysis Thank you ...........THANK YOU...........