Skip to content

Task 2 Due: 4.07.2017

Christopher Schmidt edited this page Jun 23, 2017 · 4 revisions

In a second step implement a hash join on two tables. Again measure the execution time in different settings, when both tables are stored in local memory, when both tables are remote, when one table is local and the other is one is remote. Should the table from remote memory be copied over to local memory, before executing the join? What influence does the size of the tables have on this decision (e.g., large table of entries gets joined with small fact table)? Where should we execute the join?

As a minimum consider two tables, a large table with 100 Integer-columns and 10 million entries and a smaller table with 10 Integer-columns and a varying number of entries. Sizes should include 2,000; 20,000; 200,000 2,000,000 entries that are joined via one column with correspondingly 1,000 ; 10,000; 100,000; 1,000,000 distinct values, that find a join partner in the large table with 10 million entries.

Bonus: How do your measurements scale with an increased number of entries in each of the tables (100 million vs 20 million)?

Deliverable: We expect an executable that we can execute, compiles & runs the benchmarks and plots the according results into box-plot graphs (pdf).

Clone this wiki locally