Skip to content

melmarsezio/Big-Data-Management

Repository files navigation

Big-Data-Management

All projects and assignments of Course COMP9313 Big Data Management will be pushed here.

Topics covered in this course:

  • Data Process and management
    • Volume/ Velocity/ Variety
    • Veracity/ Visibility/ Value
    • Big Data Processes
      • Data Management
        • Acquiition and Recording
        • Extraction, Cleaning and Annotation
        • Integration, Aggregation and Representation
      • Analytics
        • Modelling and Analysis
        • Interpretation
    • Architecture: Cloud Computing (SaaS/ PaaS/ IaaS)
  • Hadoop
    • HDFS
    • YARN
    • MapReduce
    • Data Access(Hbase, Hive, Pig, Mahout)/ Tools(Hue, Sqoop)
  • Data Curation
    • Ingestion/ Validation/ Transformation/ Correction/ Consolidation/ Visualization
    • Tools: Data Tamer/ ZenCrowd/ CrowdDB/ Talend/ Pentaho Data Integration
  • Hadoop Security
    • Authentication
      • Kerberos (TGT/ TGS)
    • Authorization
      • HDFS/ YARN
    • Encryption
    • Monitoring and Auditing
      • Jobs on NameNodes and JobTrackers/ Authorization Failure/ Authentication Failures
  • Spark
    • Spark SQL/ Spark Streaming/ GraphX/ MLlib
    • Spark Workflow
      • SparkContext
      • Cluster manager
      • Spark executor
    • RDDs (Resilient Distributed Datasets)
      • Traits: In-Memory/ Immutable/ Lazy evaluated/ Cacheable/ Parallel/ Typed/ Partitioned
    • RDD Operations
      • Transformation (returns new RDD)
      • Action (evaluates and returns new value)
    • Lineage Graph
    • RDD Persistence: Cache/ Persist
    • DAG of operators
    • Narrow/ Wide Transformation
  • Apache Pig
    • Architecture: Parser/ Optimizer/ Compiler/ Execution Engine
    • Execution Modes: Local Mode/ MapReduce Mode/ Tez Mode/ Spark Mode
    • Pig Data Model: Atom/ Tuple/ Bag/ Map/ Relation
    • Grunt/ Pig Latin
  • NoSQL/ Elastic Search
    • CAP Theorem: Consistency, Availability, Partition-tolerance
    • NoSQL Taxonomy
      • Key-Value stores: DynamoDB
      • Column stores: BigTable (Google), HBase (Apache)
      • Document stores: MongoDB, ElasticSearch (supports JSON, XML, etc)
      • Graph databases: Neo4j, FlockDB
    • ElasticSearch
      • REST API
      • ElasticSearch Elements: Cluster, Node, Shard, Index, Type, Mapping, Document, Replicas
      • Search APIs
  • Process Mining
    • Petri nets/ BPMN
    • Event logs, alpha-algorithm, conformance checking
    • Decision trees

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages