Skip to content

Latest commit

 

History

History
158 lines (136 loc) · 5.75 KB

README.MD

File metadata and controls

158 lines (136 loc) · 5.75 KB

Greenplum

The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum Database is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. https://www.greenplum.org

How to use Greenplum with PXF to read data from Hadoop (Cloudera)

This repository demonstrates how to use Greenplum with PXF to read text data from hadoop.

Table of Contents

  1. Pre-requisites
  2. Start Docker-compose
  3. Configure Greenplum
  4. Configure Cloudera
  5. Use PSQL to read data via PXF

Pre-requisites:

Start Docker compose

Once you have cloned this repository, you can run the command ./runDocker.sh -t usecase1 -c up, in order to start both Greenplum and Cloudera docker instances.

The assumption: docker and docker-compose are already installed on your machine.

Run command to start both Greenplum and Cloudera instances

$ ./runDocker.sh -t usecase1 -c up
WARNING: Found orphan containers (postgresql) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Recreating gpdbsne ... done
Recreating quickstart.cloudera ... done
Attaching to gpdbsne, quickstart.cloudera
gpdbsne     | /etc/sysconfig/run-parts
gpdbsne     | /usr/bin/run-parts
gpdbsne     | Running /docker-entrypoint.d
gpdbsne     | /docker-entrypoint.d/startInit.sh:
gpdbsne     |
gpdbsne     | init is running
gpdbsne     | /docker-entrypoint.d/startSSH.sh:

How to access Greenplum docker instance:

You can use this command docker exec -it gpdbsne bin/bash to access Greenplum docker instance. Or you can run this script accessDockerGPDBSNE.sh.

For example:

$ docker exec -it gpdbsne bin/bash
[root@gpdbsne /]#

Configure Greenplum

Once you have access to Greenplum docker instance, you can create database, table with some sample data.

  1. Use gpadmin user by using su - gpadmin
  2. Disable user impersonation by editing /home/gpadmin/pxf/conf/pxf-env.sh
[gpadmin@gpdbsne conf]$ vi pxf-env.sh

Make sure you change PXF_USER_IMPERSONATION=false

export PXF_USER_IMPERSONATION=false

Or you can run this script /code/usecase1/pxf/disablePXFUserImpersonation.sh that performs the same operation. 3. Create a configuration for Hadoop cluster by copying template files

$ cp /home/gpadmin/pxf/templates/*.xml /home/gpadmin/pxf/servers/default
  1. Change /home/gpadmin/pxf/servers/default/core-site.xml to use "quickstart.cloudera".
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs:https://quickstart.cloudera:8020</value>
    </property>
</configuration>

Or you can run this script /code/usecase1/pxf/addClouderaToPXFCoreSiteXML.sh that performs the same operation.

  1. Create sample database with the name "pxf_db" The scripts to create database and sample data is found at /code/usercase1/pxf/.

Next, run the command /code/usecase1/pxf/setupDB.sh

$ /code/usecase1/pxf/setupDB.sh        
GRANT
GRANT
CREATE EXTERNAL TABLE
  1. Verify database and table is created Use `psql -U gpadmin -d pxf_db -c "\dE"``
[gpadmin@gpdbsne ~]$ psql -U gpadmin -d pxf_db -c "\dE"
                     List of relations
 Schema |        Name         | Type  |  Owner  | Storage  
--------+---------------------+-------+---------+----------
 public | pxf_hdfs_textsimple | table | gpadmin | external
(1 row)

Configure Cloudera

This section describes how to setup Cloudera.

Setup example data on cloudera

  1. You can use this command docker exec -it quickstart.cloudera bin/bash to access Cloudera docker instance. For example:
$ docker exec -it quickstart.cloudera bin/bash
[root@quickstart /]#
  1. This example shows how to use a script to creates sample text file.
[root@quickstart cloudera]# /code/usecase1/cloudera/setupHdfsTextSimpleExample.sh
Using root to run this script.
Found hadoop /data directory
Recursively delete files under /data
Deleted /data
## pxf_hdfs_simple.txt is created under this directory /data/pxf_examples/
Found 1 items
-rw-r--r--   1 root supergroup        100 2018-05-07 22:29 /data/pxf_examples/pxf_hdfs_simple.txt
[root@quickstart cloudera]#

Use PSQL to read data via PXF

Make sure you are accessing GPDB docker instance.

  1. Use psql on GPDB docker instance
$
$ psql -U gpadmin -d pxf_db -c "select * from pxf_hdfs_textsimple;"
location   | month | num_orders | total_sales
-------------+-------+------------+-------------
Prague      | Jan   |        101 |     4875.33
 Rome      | Mar   |         87 |     1557.39
 Bangalore | May   |        317 |     8936.99
 Beijing   | Jul   |        411 |    11600.67
(4 rows)
  1. Verify the number of records in this table pxf_hdfs_textsimple
$ psql -U gpadmin -d pxf_db -c "select count(*) from pxf_hdfs_textsimple;"
count
-------
    4
(1 row)

Links:

Howto: