Skip to content

Spark GraphX - Pregel, PageRank and Dijkstra on a social graph

License

Notifications You must be signed in to change notification settings

artem0/spark-graphx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph analysis with Pregel and PageRank algorithms

Implemented operations

  1. Most connected user in social graph based on graph degrees.
  2. Degree of separation for single user based on Breadth-first search with Pregel. Input is user's id - output is list of tuple with users and degree of separation between them.
  3. Degree of separation between two defined users, as degree of separation for the single user, it's based on Breadth first search with Pregel. Input is two user's ids - output is degree of separation between them.
  4. Measuring importance/rating of users with PageRank with two strategies: iterative and until convergence (Pregel based).
  5. Connected component for social graph - under the hood it delegates to Pregel.
  6. Triangle count - the number of triangles passing through each vertex.
  7. Page Rank for defined users: dynamic and iterative versions.

PageRank

PageRank measures the importance of each vertex in a social graph. Spark allows to build PageRank with two strategies: dynamically, this implementation uses the Pregel interface and runs PageRank until convergence and iterative, it runs PageRank for a fixed number of iterations. Dynamically approach with strategy until convergence denotes that computation will continue until [R(t+1)-R(t)] < e. Convergence is achieved when the error rate for any vertex in the graph falls below a given threshold. The error rate of a vertex is defined as the difference between the “real” score of the vertex R(Vi) and the score computed at iteration k, R^K(Vi) Since the real score is not known apriori, this error rate is approximated with the difference between the scores computed at two successive iterations: R(t+1) and R(t).

//dynamic version
val dynamicRank = socialGraph.graph.pageRank(tol = 0.0001)

//iterative version
val iterativeRank = socialGraph.graph.staticPageRank(numIter = 20)

Running

Launching with SBT like a simple batch job:

>> sbt

>> runMain com.github.graphx.pregel.jobs.social.SocialGraphJoB

Shortest path problem with Dijkstra’s algorithm

In graph theory, the shortest path problem is the problem of finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized.

This project solves shortest path problem with Dijkstra's algorithm with relying on Pregel algorithm for propagating messages.

The next code snapshot demonstrates how easy you can implement similar algorithms:

initialGraph.pregel(Double.PositiveInfinity)(
      (_, dist, newDist) => math.min(dist, newDist),
      triplet => {
        //Distance accumulator
        if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
          Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
        } else {
          Iterator.empty
        }
      },
      (a, b) => math.min(a, b)
    )

Connected Components

Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example:

val component = graph.connectedComponents().vertices

Variable component has type Graph[VertexId, ED] and contains tuple of vertex id and lowest vertex id in a component.

Running

Launching with SBT:

sbt runMain com.github.graphx.pregel.ssp.demo.ShortestPathProblemDemo

Graph storing/representation

Social graph is represented in the next form: one user -> multiple friends Datasets has the next structure: user_id array of related user's ids, file is stored in resources directory - UserGraph.txt. For example, user with id 5988 has friends with ids 748 1722 3752 will be represented in pretty obvious form: 5988 748 1722 3752

Tested data is stored in resource directory - UserNames.tsv - file in format user_id -> user_name, necessary for joining with graph of contacts. It can be treated as a simple tuple - id from UserGraph.txt and name of vertex of graph. Names of user are random, any coincidence are accidental.

Licence: GNU General Public License v3.0