GitHub - shaobohou/madios: My implementation of the ADIOS algorithm

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
maths		maths
test		test
utils		utils
.gitignore		.gitignore
ADIOSUtils.h		ADIOSUtils.h
BasicSymbol.cpp		BasicSymbol.cpp
BasicSymbol.h		BasicSymbol.h
EquivalenceClass.cpp		EquivalenceClass.cpp
EquivalenceClass.h		EquivalenceClass.h
LexiconUnit.h		LexiconUnit.h
License.txt		License.txt
ModifiedADIOS.pro		ModifiedADIOS.pro
ParseTree.h		ParseTree.h
RDSGraph.cpp		RDSGraph.cpp
RDSGraph.h		RDSGraph.h
RDSNode.cpp		RDSNode.cpp
RDSNode.h		RDSNode.h
README		README
SearchPath.cpp		SearchPath.cpp
SearchPath.h		SearchPath.h
SignificantPattern.cpp		SignificantPattern.cpp
SignificantPattern.h		SignificantPattern.h
SpecialLexicons.cpp		SpecialLexicons.cpp
SpecialLexicons.h		SpecialLexicons.h
main.cpp		main.cpp

Repository files navigation

This is an implementation of the ADIOS algorithm. This implementation is more efficient than the original author's but it can produce different outputs due to undocumented approximations used in the original author's implementation.

One possible modification to the original ADIOS algorithm is to favour Significant Patterns that are shorter in length and therefore more likely to represent basic grammar units. Applying this approach to sentences at word level, the algorithm does learn grammars that are more consistent with intuitions but when applied to sentences at character level, the algorithm tend to find non-sensical word fragments that do not correspond to intuition. I suspect that additional weighting based on the pattern's length and number of occurrences is required. The relevant code is commented out in RDSGraph.cpp at lines 356-359 and 362-365, if you wish to experiment.

Although it does not affect the correctness of the grammar produced by the current implementation, it can be made more efficient by grouping identitical distilled paths into a single production rule.


Usage:
ModifiedADIOS <filename> <eta> <alpha> <context_size> <coverage> ---OPTIONAL--- <number_of_new_sequences>

<filename>,     file name of the input corpus
<eta>,          threshold of detecting divergence in the RDS graph, usually set to 0.9
<alpha>,        significance test threshold, usually set to 0.01 or 0.1
<context_size>, size of the context window used for search for Equivalence Class, usually set to 5 or 4, a value less 3 will mean that no                 equivalence class can be found.
<coverage>,     threhold for bootstrapping Equivalence classes, usually set to 0.65. Higher values will result in less bootstrapping.

the optional parameter <number_of_new_sequences> asks the executable to generate new sentences from the distilled grammar.


An example raw corpus file (* and # are the start and end delimiters):

 *  Cindy believes that Joe believes that to please is easy  #
 *  Cindy believes that Cindy believes that to read is tough  #
 *  Pam thinks that Jim believes that to please is tough  #
 *  Beth believes that George believes that to please is easy  #
 *  Pam believes that Cindy believes that to read is easy  #
 *  Beth thinks that Beth thinks that to read is tough  #
 *  that Cindy is easy to read annoys Cindy  #
 *  that the cat is eager to please disturbs the cat  #
 *  that the cow is easy to read annoys the horse  #
 *  that Cindy is eager to please bothers the dog  #
 *  that the horse is easy to please annoys Jim  #
 *  Pam thinks that Cindy thinks that to please is easy  #
 *  that the dog is easy to read worries the horse  #
 *  Cindy believes that George believes that to read is easy  #
 *  Pam thinks that Pam believes that to read is easy  #
 *  Beth believes that Joe believes that to read is tough  #
 *  Cindy believes that Cindy believes that to please is easy  #
 *  that the horse is tough to please disturbs the cat  #
 *  Beth believes that Joe believes that to read is tough  #
 *  Cindy thinks that Joe believes that to please is easy  #



with eta = 0.9, alpha = 0.01, context_size = 5 and coverage = 0.65, the following grammar is distilled:

Lexicon 0: START   ------->  0  []
Lexicon 1: END   ------->  0  []
Lexicon 2: Cindy   ------->  1  [31]
Lexicon 3: believes   ------->  1  [27]
Lexicon 4: that   ------->  1  [28]
Lexicon 5: Joe   ------->  1  [31]
Lexicon 6: to   ------->  2  [30   35]
Lexicon 7: please   ------->  1  [29]
Lexicon 8: is   ------->  2  [30   37]
Lexicon 9: easy   ------->  1  [36]
Lexicon 10: read   ------->  1  [29]
Lexicon 11: tough   ------->  0  []
Lexicon 12: Pam   ------->  0  []
Lexicon 13: thinks   ------->  1  [27]
Lexicon 14: Jim   ------->  1  [31]
Lexicon 15: Beth   ------->  1  [31]
Lexicon 16: George   ------->  0  []
Lexicon 17: annoys   ------->  0  []
Lexicon 18: the   ------->  1  [34]
Lexicon 19: cat   ------->  1  [33]
Lexicon 20: eager   ------->  1  [36]
Lexicon 21: disturbs   ------->  0  []
Lexicon 22: cow   ------->  1  [33]
Lexicon 23: horse   ------->  1  [33]
Lexicon 24: bothers   ------->  0  []
Lex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

License

shaobohou/madios

Folders and files

Latest commit

History

Repository files navigation