Skip to content

This repository contains an educational project aimed to solve a real-world problem involving core concepts of computer science and programming.

Notifications You must be signed in to change notification settings

hyperfence/dsodn-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Storage over Distributed Networks (C++)



Note: This is an educational project created for the sole purpose of self learning.


Introduction

Let's consider a scenario where the data is not located on a single machine but rather stored on multiple machines geo-distributed across the Internet. In such a scenario, searching or retrieval of data is very challenging, as it is hard to determine on which machine the required data is stored. This challenge can be solved by using a Data Structure that can efficiently search for data stored on a single machine. We will be mainly using Hash Tables along with Linked Lists and Binary Search Trees to design an efficient solution for this scenario.


Concept Illustration

Concept Illustration



Implementation

This section explains the logical implementation of this project.

Identifier Space Concept

In our implementation we assume that all the machines geo-distrubuted across the internet are present in an identifier space and linked with each other in the form of a circle. Maximum number of machines in this circle depends upon the size of the identifier space which is defined once initially in the beginning. A more professional term used for such arrangement is called "Ring DHT". Below is an arbitrary example of machines arranged according to above specification.

Identifier Space

On the right side of above picture we have six machines and on the left side the black dots represent these machines in the identifier space. The above identifier space has a size of 4 bits which means it can hold a maximum of 2^size = 2^4 = 16 machines. Notice that our identifier space has 6 out of 16 machines active which are connected with each other in the form of a circle. Due to this circular property of our space we call it the Ring DHT where DHT stands for Distributed Hash Tables(we'll discuss about DHTs later). In the above picture, each active machine has a unique id assigned to it which tells it apart from other machines in the identifier space. In above case 3, 5, 8, 10, 13 and 14 are the ids assigned to the machines respectively.


Data Storage Concept

These machines in the identifier space are responsible for holding some sort of data which can be videos, images, music, text files, etc. For the sake of simplicity the data is stored in the form of key/value pairs. These key/value pairs are taken input from the user and are arranged in the form of a Binary Search Tree. Every active machine in the above identifier space has a single binary tree associated with it which is used to store data key/value pairs of that machine. The key or value can by anything, it can be in the form of a string or a number, for example, a user can enter "Good Movie" as key and "Toy Story 3" as its value. Now this key will be hashed through a hash function and the result from that function will map the value "Toy Story 3" to a specific active machine in the identifier space. This result from hash function will have specific rules, firstly that result will be an integer value so that it can easily be compared with the machine ids and lastly this result will always be a value that falls in the size of the identifier space(eg. in the case above the result will be from 0 to 15 only). In our case consider that key hashed to 10, if you look in the identifier space above there is an active machine with id 10, so, the value "Toy Story 3" will be stored on that machine. But what if the key hashed to a value whose machine is not active like 11, 12 or 0? The answer to this question is quite simple, earlier we said that a machine can store more than one data key/value pairs now let us see how exactly it can store them. Let's assume that after entering the "Good Movie" as key and "Toy Story 3" as value the user enters the following key/value pairs:

  • Key: "Awesome Game"   Value: "Witcher III"         [Hashed Key = 0]
  • Key: "A Document"         Value: "My Passport"      [Hashed Key = 9]
  • Key: "Best TV Show"       Value: "Peaky Blinders"   [Hashed Key = 15]

The data "Witcher III", "My Passport" & "Peaky Blinders" will be stored on machines 3, 10 & 3 respectively. From this we can deduce that each active machine is responsible for holding data whose hashed key is either equal to the current machine ID or less than current machine but greater than the ID of predecessor machine. eg. the key "A Document" hashes to 9 but as there is no active machine with id 9, so, the data is stored on the next active machine which is 10.


Data Collision Concept

Above we discussed how the data is stored on the active machines of the identifier space but what if more than one key hashes to a same id(eg. consider "Awesome Game" & "Good Movie" now both hash to value of 10)? In order solve this problem we use the concept of chaining lists. Every node of the binary search tree of a machine holds a singly linked list instead of just holding a single key/value pair. Whenever more than two keys hash to the same id they are inserted in the relevant linked list of BST(Binary Search Tree) of the machine. And to keep track or distinguish a key/value pair of that linked list we use a unique "encryption key" which is associated with every key/value pair of that linked list. This "encryption key" is generated from the same hash function which hashes the simple key to hashed-key. The term DHT which was used earlier is just the concept of hashing a key and then storing its value somewhere. We have already discussed this hashing and storing in detail above.



Development Team

The entire project was developed and coded by the following team of computer science students:


Source

This problem statement was sourced from the semester project given by our university.

University: National University of Computer & Emerging Science
Problem Statement: PDF File

About

This repository contains an educational project aimed to solve a real-world problem involving core concepts of computer science and programming.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages