GitHub - vitalius/BinaryTree: Testing Binary-Tree look-up performance on word keys

What is it?

Attempt to empirically show the difference in search time between Ordered and Unordered Binary Tree.

Setup

We're going to populate Ordered and Unordered Binary Trees with nodes that use words (strings of text) for keys, see Node.java. In case of Ordered tree, String.compareTo() is used to order keys based on lexicographical order. While Unordered tree is populated by randomly picking a path (left/right) from root to the leaf level. See Ordered.java and Unordered.java

The expected result for searching Ordered Binary Tree is O(clog(n)), and O(cn) for Unordered (pre-order tree traversal). Where n is the number of nodes (unique terms) in the Binary Tree and c is some constant representing the beefines of my computer.

Procedures

Build Ordered/Unordered Binary Tree with sub-set (size n) of terms from the corpus.
Compute the average search time by searching for every term in the tree and then taking the average.
Perform the experiment as the size of sub-set expands from 1,...,entire corpus.

Experiment 1

A short story by Ellena Ashley, The Dragon Rock is used for the first experiment (dragon-rock.txt). Dragon Rock contains 636 unique terms out of 1429 total, which means total of 636 nodes in a Binary Tree, and depth of 7 when tree is balanced.

The difference between Ordered and Unordered search is visible, however, due to much noise from Java's VM and unbalace in trees, it is hard to approximate the function that best fits the results. We need more data.

Since the number of unique terms grows approximately linearly with the total number of terms

Let's go ahead and use all of the terms.

Looks better but still too noisy for approximation. How about we copy/paste the story 9 times.

Looks like we now have enough data to start approximating O(n) equation, however the Binary Tree stops growing after all unique terms are exhausted (after about 2000 terms).

Clearly, we need a better corpus.

Experiment 2

The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (holmes.txt) contains 9580 unique words out of 120376 total!

Search in Unordered Binary Tree looks linear with respect to n while Ordered is way at the bottom accross entire graph.

We can now approximate Unordered Binary Tree search. Linear function should look something like t = c*n where we know n (number of nodes) and t (milliseconds) average time it takes to find a node. Averaging c by looking at few experiment data points we find that on this machine, c ~= 0.000016.

Similarly when we zoom-in on Ordered Binary Tree data, it begins to resemble a logarithmic curve: t = c*log(n) where c ~= 0.000055.

Wrap-up

We have empirically shown the difference in search times between Unordered (pre-order tree traversal) and Ordered Binary Trees.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
graphs		graphs
out/production/BinaryTree		out/production/BinaryTree
src		src
BinaryTree.iml		BinaryTree.iml
README.mkd		README.mkd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is it?

Setup

Procedures

Experiment 1

Experiment 2

Wrap-up

About

Releases

Packages

Languages

vitalius/BinaryTree

Folders and files

Latest commit

History

Repository files navigation

What is it?

Setup

Procedures

Experiment 1

Experiment 2

Wrap-up

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages