Skip to content
/ Join Public

Large scale data processing is a crucial topic in web 2.0 era. The project has provides you a Join (which is one of the most important operation in database) Implementation using Hadoop.

Notifications You must be signed in to change notification settings

logicmd/Join

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Join via Hadoop

在Web 2.0时代大规模数据处理非常重要,本项目只在提供一个利用Hadoop实现的Join操作(数据库中最重要的操作)。

Structure

论文在doc中,实验结果在exp中,源代码放在src中。

─edu
  └─pku
      ├─broadcast
      │      BroadcastJoin.java
      │      BroadcastMapper.java
      │
      ├─mapside
      │      MapSideJoin.java
      │      SequenceFileIO.java
      │      Sort.java
      │
      ├─reduceside
      │      JoinReducer.java
      │      ReduceSideJoin.java
      │      TableOneMapper.java
      │      TableTwoMapper.java
      │
      ├─reducesidenew
      │      ReduceSideJoinNew.java
      │
      ├─test
      │      ConcatTest.java
      │      DatasetGen.java
      │      JobConfTest.java
      │
      └─util
              DatasetFactory.java
              TableOneParser.java
              TableTwoParser.java
              TextPair.java

所有源码放在 edu.pku.* 的Package下面。

edu.pku.util 是整个Join框架所需的工具集,包括两个Table的Parser用于封装parse过程,让map阶段显得更干净。DatasetFactory用于生产测试数据。TextPair是一个自定义的数据类型,用于生产text对。

  • edu.pku.mapside 是Mapside Join的实现,包括Mapper和Join过程。
  • edu.pku.reduceside 是Reduceside Join的实现,包括Mapper,Reducer和Join过程。
  • edu.pku.reducesidenew 是Reduceside Join NEW API的实现,包括Mapper,Reducer和Join过程。
  • edu.pku.broadcast 是BroadcastJoin的实现,包括Mapper和Join过程。
  • edu.pku.test 是少量中间test。

About

Large scale data processing is a crucial topic in web 2.0 era. The project has provides you a Join (which is one of the most important operation in database) Implementation using Hadoop.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages