Skip to content

Mageswaran1989/aja

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AJA - Accomplish Joyfull Adventures

Data Science with Spark and Scala


###Topics Explored

  • Scala ~ Tantra
  • Spark ~ Tej
  • Data Science
  • Build tool called SBT
  • Distributed DataBases Query Engines

####Basics

  • Scala foundation
  • Features of Scala
  • Setup Spark and Scala on Unbuntu and Windows OS
  • Install IDE's for Scala
  • Run Scala Codes on Scala Shell
  • Understanding Data types in Scala
  • Implementing Lazy Values
  • Control Structures
  • Looping Structures
  • Functions
  • Procedures
  • Collections
  • Arrays and Array Buffers
  • Map's, Tuples and Lists

####Object Oriented Programming in Scala

  • Implementing Classes
  • Implementing Getter & Setter
  • Object & Object Private Fields
  • Implementing Nested Classes
  • Using Auxilary Constructor
  • Primary Constructor
  • Companion Object
  • Apply Method
  • Understanding Packages
  • Override Methods
  • Type Checking
  • Casting
  • Abstract Classes

####Functional Programming in Scala

  • Understanding Functional programming in Scala
  • Implementing Traits
  • Layered Traits
  • Rich Traits
  • Anonymous Functions
  • Higher Order Functions
  • Closures and Currying
  • Performing File Processing

####Breeze ~ Linear Algebra

What is Spark?

  • Review: From Hadoop MapReduce to Spark
  • Review: HDFS
  • Review: YARN
  • Spark Overview

Spark Basics

  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

####Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations

####Aggregating Data with Pair RDDs

  • Key-Value Pair RDDs
  • Map-Reduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Hands-On Exercise: Write and Run a Spark Application
  • Configuring Spark Properties
  • Logging

Parallel Processing

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence

####Spark Streaming

  • Spark Streaming Overview
  • Example: Streaming Request Count
  • DStreams
  • Developing Spark Streaming Applications
  • Multi-Batch Operations
  • State Operations
  • Sliding Window Operations
  • Advanced Data Sources

####Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Improving Spark Performance

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues
  • Diagnosing Performance Problems

####Spark SQL and DataFrames

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive-on-Spark

####Spark Machine Learning

####GraphX

##Data Science

####Machine Learning in Scala


Project Structure

  • Android : Android + Scala integration!
  • docs : All reference materials
  • data : Datasets used in the implementation

##Build Environment Linux Ubuntu 12.04+

Git Links

##Wiki

Contribution

Let us begin our jouney from here!

Contact: [email protected]