DataLakeHouse: Trino (PrestoSQL) with S3 in Kubernetes Baremetal

Trino, HiveMetStore Standalone, MinoS3, Postgres DB in Kind cluster

Based and expanded from https://github.com/joshuarobinson/trino-on-k8s and from https://joshua-robinson.medium.com/presto-powered-s3-data-warehouse-on-kubernetes-aea89d2f40e8

Main changes - Kind cluster, Minio S3, Updated Trino and changes for that, Updated Hive standalone Dockerfile, Postgres Database

What is a DataLakeHouse ?

This is a mouthful; I hope everyone has heard of DataWareHouse. That's old school now based on Hadoop and HDFS file system; which is hard to operate and maintain

The cool kid on the block is S3, and DataLake is data on S3. DataLakeHouse is data on S3, but with a SQL Table and Schema kept for that and hence data that can be queried

System Diagram

Components

Trino - Is a Distributed SQL Engine
Hive - We use only Hive Metadata Server Standalone here; This uses a DB (we use Postgres) to persist. This stores the Table structure
S3 - This holds the data as S3 object; The meta-structure for this is stored and kept in Hive
Parquet - The data in S3 is stored in Apache Parquet , binary compressed and columnar data format

Step 0: Install Kind cluster

Create a Kind Cluster, with memory limitation

kind create cluster --config /home/alex/coding/preso_hive/kind-cluster-config.yaml

Install Minio for S3

Via helm

https://github.com/minio/minio/tree/master/helm/minio
helm install mino-test -f mino/values.yaml  minio/minio

Create a bucket test from the UI

kubectl port-forward svc/mino-test-minio-console 9001

Step 1: Install Postgres in Kubernetes with Kubegres Operator

Postgres in Kubernetes https://www.kubegres.io/

https://www.kubegres.io/doc/getting-started.html

kubectl apply -f https://raw.githubusercontent.com/reactive-tech/kubegres/v1.15/kubegres.yaml
kubectl apply -f postgres/postgres-secret.yaml
kubectl apply -f postgres/kubegres-porstrescluster.yaml

kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
mypostgres-1-0   1/1     Running   0          22m
mypostgres-2-0   1/1     Running   0          22m
mypostgres-3-0   1/1     Running   0          22m

Manually Create a DB after installing Postgres

 kubectl  exec -it mypostgres-1-0 /bin/sh
 psql -U postgres
 <password from the secret>

postgres=# create database metadata;
CREATE DATABASE

Check if the DB is cerated

postgres=# \list
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
-----------+----------+----------+------------+------------+-----------------------
 metadata  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres

Other commands

\c metadata (connect to metadata DB)

\dt (list the tables - After Step 2.1)

Get the Postgres Service

Update it in connection string in hive/metastore-cfg.yaml and hive/hive-initschema.yaml

Step 2: Install Hive Metadata Standalone

Step 2.1

Run the hive/hive-initschema.yaml Job to initialize the schema in the Postgres table

kubectl apply -f hive/hive-initschema.yaml

and verify if the tables are created properly

 kubectl  exec -it mypostgres-1-0 /bin/sh
 psql -U postgres 
 postgresSuperUserPsw

\list
\c metadata
\dt
\q

To drop table in use

SELECT                  
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity ;

drop database metadata;

Step 2.2

Create the S3 secrets for Hive

kubectl create secret generic my-s3-keys --from-literal=access-key=’minio’ --from-literal=secret-key=’minio123’

Get the Mino/S3 service

kubectl get svc
kubernetes                ClusterIP   10.96.0.1       <none>        443/TCP    7d
metastore                 ClusterIP   10.96.189.200   <none>        9083/TCP   4m8s
mino-test-minio           ClusterIP   10.96.149.113   <none>        9000/TCP   6d18h
mino-test-minio-console   ClusterIP   10.96.236.45    <none>        9001/TCP   6d18h
mino-test-minio-svc       ClusterIP   None            <none>        9000/TCP   6d18h
mypostgres                ClusterIP   None            <none>        5432/TCP   29m
mypostgres-replica        ClusterIP   None            <none>        5432/TCP   6d23h
trino                     ClusterIP   10.96.249.19    <none>        8080/TCP   63s                                       3h43m

Update in hive\metastroe-cfg.yaml for S3 and Postgres

Note especially the below property. We are pointing the metastore.warehouse.dir to the S3 location; All Schemas and tables will hereby get created in S3.

   <property>
      <name>metastore.warehouse.dir</name>
      <value>s3a:https://test/warehouse</value>
   </property>

NOTE Giving the service in Kind cluster give read timeouts from Hive when it is trying to write to Minio. So for Mino the endpoint IP is mentiond in metastore-cfg.yaml. This means that every time the Kind cluster restarts the Endpoints have to reset and hive redeployed for now.

$ kubectl get ep | grep mini
mino-test-minio                10.244.1.14:9000,10.244.1.15:9000,10.244.1.2:9000 + 1 more...   13d

<property>
      <name>fs.s3a.endpoint</name>
      <value>http:https://10.244.1.14:9000</value>
</property>

Step 2.3

First build the HiveMetastore Standalone

docker build -t hivemetastore:3.1.3.5 -f hive/Dockerfile ./hive
docker tag hivemetastore:3.1.3.5 alexcpn/hivemetastore:3.1.3.5
docker push alexcpn/hivemetastore:3.1.3.5

Step 2.4

Install/Re-insall the hive Metadata server

kubectl apply -f hive/metastore-cfg.yaml && kubectl delete -f hive/hive-meta-store-standalone.yaml  && kubectl create -f hive/hive-meta-store-standalone.yaml

Step 4. Install Trino (PrestoSQL)

Configure the Postgres,S3, and Metastrore Service first in trino\trino_cfg.yaml

kubectl apply -f trino/trino_cfg.yaml
kubectl apply -f trino.yaml

Port forward to see the UI

kubectl   port-forward svc/trino 8080  &

Part 2

Create a table in S3 via Trino

Access Trino CLI

Give the Service of trino in the server argument below

kubectl exec -it trino-cli /bin/bash 
/bin/trino --server trino:8080 --catalog hive --schema default

Try to create a schema using S3

We are using the built in test datastrore tpcds to create tables and it will be auto-populated;

show schemas from tpcds;
show tables  from tpcds.tiny;

trino:default> CREATE SCHEMA hive.tpcds WITH (location = 's3a:https://test/warehouse/tpcds/');
trino:default> CREATE TABLE tpcds.store_sales AS SELECT * FROM tpcds.tiny.store_sales;
CREATE TABLE: 120527 rows

Query 20220617_125702_00006_sqada, FINISHED, 3 nodes
Splits: 14 total, 14 done (100.00%)
20.24 [121K rows, 0B] [5.95K rows/s, 0B/s]

You can see that in S3 the files are written

trino:default> select count(*) from tpcds.store_sales;
 _col0  
--------
 120527 
(1 row)

You can see the queries getting executed via the Trino UI

Handy commands

kubectl apply -f hive/metastore-cfg.yaml && kubectl delete -f hive/hive-meta-store-standalone.yaml  && kubectl create -f hive/hive-meta-store-standalone.yaml

kubectl apply -f trino/trino_cfg.yaml && kubectl delete -f trino/trino.yaml && kubectl create -f trino/trino.yaml

kubectl   port-forward svc/trino 8080 
kubectl port-forward svc/mino-test-minio-console 9001

kubectl exec -it trino-cli /bin/bash 
/bin/trino --server trino:8080 --catalog hive --schema default

Installing Redash

From https://github.com/getredash/contrib-helm-chart

Optional: Install Redash

Redash is a GUI to execute SQL queries using various Data sources. Trino is also supported. And it can be used to analyze data

Minor changes below for proper installation

helm repo add redash https://getredash.github.io/contrib-helm-chart/

Get the template instead of directly installing

helm template  myredash -f redash/my-values.yaml  redash/redash >> redash/deployment.yaml

and updated the following to latest,as Image pull error was there in older; Also made the passwords in my-value smaller as some erros 
were coming

- name: myredash-postgresql
 image: docker.io/bitnami/postgresql:14.4.0-debian-11-r1
- name: redis
 image: docker.io/bitnami/redis:6.0.16-debian-11-r7

Once installed - Port forward to see the GUI

kubectl port-forward svc/myredash 8081:80

You can configure the Trino data source like below; and use redash for query execution and visualization

Query Execution and Visualization

Further tests are described here More tests, Transactions, Update, CSV etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataLakeHouse: Trino (PrestoSQL) with S3 in Kubernetes Baremetal

Trino, HiveMetStore Standalone, MinoS3, Postgres DB in Kind cluster

What is a DataLakeHouse ?

System Diagram

Step 0: Install Kind cluster

Step 1: Install Postgres in Kubernetes with Kubegres Operator

Step 2: Install Hive Metadata Standalone

Step 2.1

Step 2.2

Step 2.3

Step 2.4

Step 4. Install Trino (PrestoSQL)

Part 2

Access Trino CLI

Handy commands

Installing Redash

Optional: Install Redash

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
hive		hive
mino		mino
postgres		postgres
python		python
redash		redash
trino		trino
README.md		README.md
kind-cluster-config.yaml		kind-cluster-config.yaml
testqueries.md		testqueries.md

alexcpn/presto_in_kubernetes

Folders and files

Latest commit

History

Repository files navigation

DataLakeHouse: Trino (PrestoSQL) with S3 in Kubernetes Baremetal

Trino, HiveMetStore Standalone, MinoS3, Postgres DB in Kind cluster

What is a DataLakeHouse ?

System Diagram

Step 0: Install Kind cluster

Step 1: Install Postgres in Kubernetes with Kubegres Operator

Step 2: Install Hive Metadata Standalone

Step 2.1

Step 2.2

Step 2.3

Step 2.4

Step 4. Install Trino (PrestoSQL)

Part 2

Access Trino CLI

Handy commands

Installing Redash

Optional: Install Redash

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages