GitHub - Chandramani/MR_Design

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cooccurence		cooccurence
util		util
wordcount		wordcount
.gitattributes		.gitattributes
.gitignore		.gitignore
README		README

Repository files navigation

All Design Pattern Algorithm from "Data-Intensive Text Processing with MapReduce" Jimmy Lin and Chris Dyer University of Maryland, College Park. This project is an implementation of the algorithms.
WordCountInMapperCombining.java

IMCDP -> In Memmory combiner Design Pattern
Implementation global IMCDP

In the global IMCDP approach, instead of using an associative array per key-value input, we use an associative array per mapper.
The global IMCDP approach may run into a memory limitation issue.
If the associative array becomes very large and to the point where memory runs out, your mapper task will certainly crash.In this implememtation we flush out the Map regularly. Reducer is the hadoop library class IntSumReducer imported into the java file, no custom reducer required.

Word Co-occurence.

we focus on the problem of building word co-occurrence matrices from large corpora, a common task in corpus linguistics and statistical natural
language processing.This task is quite common in text processing and provides the starting point to many other algorithms, e.g., for computing statistics such as pointwise mutual infor-mation , for unsupervised sense clustering etc.

ComputeCooccurrenceMatrixPairs.java

The Pairs Design Pattern

The mapper processes each input document and emits intermediate key-value pairs with each co-occurring word pair as the key and the integer one (i.e.,
the count) as the value. This is straightforwardly accomplished by two nested loops: the outer loop iterates over all words (the left element in the pair), and the inner loop iterates over all neighbors of the ?rst word (the right element in the pair). The neighbors of a word can either be de?ned in terms of a sliding window or some other contextual unit such as a sentence.

ComputeCooccurrenceMatrixStripes.java

The Stripes Design Pattern

Like the pairs approach, co-occurring word pairs are generated by two nested loops. However, the major di?erence is that instead of emitting intermediate key-value pairs for each co-occurring word pair, co-occurrence information is First stored in an associative array. The mapper emits key-value pairs with words as keys and corresponding associative arrays as values, where each associative array encodes the co-occurrence counts of the neighbors of a particular word (i.e., its context).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Chandramani/MR_Design_Patterns

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages