cuXfilter

cuXfilter is inspired from the Crossfilter library, which is a fast, browser-based filtering mechanism across multiple dimensions and offers features do groupby operations on top of the dimensions. One of the major limitations of using Crossfilter is that it keeps data in-memory on a client-side browser, making it inefficient for processing large datasets. cuXfilter uses cuDF on the server-side, while keeping the dataframe in the GPU throughout the session. This results in sub-second response times for histogram calculations, groupby operations and querying datasets in the range of 10M to 200M rows (multiple columns).

Installation

To build using Docker:

Edit the config.json file to reflect accurate IP, dataset name, and mapbox token values.
1. add your server ip address to the server_ip property in the format: http:https://server.ip.addr
2. add demo_mapbox_token for running the GTC demo
3. download the dataset 146M_predictions_v2.arrow from here
docker build -t user_name/viz .
docker run --runtime=nvidia -d -p 3000:3000 -p 3004:3004 -p 3005:3005 -p 3009:3009 --name rapids_viz -v /folder/with/data:/usr/src/app/node_server/uploads user_name/viz

Config.json Parameters:

server_ip: ip address of the server machine, needs to be set before building the docker container
cuXfilter_port_external: port on which the cuXfilter api can be accessed externally outside the docker container, default is 3000. (Internally cuXfilter runs on port 3000). Port needs to be published while running the container(-p 3000:3000).
demos_serve_port_external: port on which examples are to be served externally, default is 3004.(Internally demos are served on port 3004). Port needs to be published while running the container(-p 3004:3004).
gtc_demo_port_external: port on which mortgage demo is served externally, default is 3005. (Internally mortgage demo runs on port 3005) Port needs to be published while running the container(-p 3005:3005).
sanic_server_port_cudf_internal: sanic_server(cudf) runs on this port, internal to the container and can only be accessed by the node_server. Do not publish this port
sanic_server_port_pandas_internal: sanic_server(pandas) runs on this port, internal to the container and can only be accessed by the node_server. Do not publish this port
whitelisted_urls_for_clients: list of whitelisted urls for clients to access node_server. User can add a list of urls(before building the container) he/she plans to develop on as origin, to avoid CORs issues.
jupyter_port: port on which the jupyter integration example with cuXfilter will run
demo_mapbox_token: mapbox token for the mortgage demo. Can be created for free here
demo_dataset_name: dataset name for the example and mortgage demo. Default value: '146M_predictions_v2'. Can be downloaded from here

With the default settings:

Access the crossfilter demos at http:https://server.ip.addr:3004/demos/examples/index.html

Access the GTC demos at http:https://server.ip.addr:3005/

Access jupyter integration demo at http:https://server.ip.addr:3009/

Architecture

Docker container(python_sanic <--> node) SERVER <<<===(socket.io)===>>> browser(client-side JS)

Server-side

Sanic server

The sanic server interacts with the node_server, and maintains dataframe objects in memory, throughout the user session. There are two instances of the sanic_server running all the time, one at port 3002 (handling all cudf dataframe queries) and the other at port 3003 (handling all pandas dataframe queries, incase anyone wants to compare performance). This server is not exposed to the cuXfilter-client.js library, and is accessable only to the node-server, which acts as a load-balancer between cuXfilter-client.js library and the sanic server.

Files:
1. app/views.py -> handles all routes, and appends each response with calculation time
2. app/utilities/cuXfilter_utils.py -> all cudf crossfilter functions
3. app/utilities/numbaHistinMem.py -> histogram calculations using numba for a cudf.Series(ndarray)
4. app/utilities/pandas_utils.py -> all pandas crossfilter functions
Node server

The Node server is exposed to the cuXfilter-client.js library and handles socket.io incoming requests and responses. It handles user-sessions, and gives an option to the client-side to perform cross-browser/cross-tab crossfiltering too.

Files:
1. routes/cuXfilter.js -> handles all socket-io routes, and appends each response with the amount of time spent by the node_server for each request
2. routes/utilities/cuXfilter_utils.js -> utility functions for communicating with the sanic_server and handling the responses

Client-side

cuXfilter-client.js

A javascript(es6) client library that provides crossfilter functionality to create interactive vizualizations right from the browser

Documentation and examples can be found here

Memory Limitations

Currently, there are a few memory limitations for running cuXfilter.

Dataset size should be half the size of total GPU memory available. This is because the GPU memory usage spikes around 2X, in case of groupby operations.

This will not be an issue once dask_gdf engine is implemented(assuming the user has access to multiple GPUs)

Troubleshooting

In case the server becomes unresponsive, here are the steps you can take to resolve it:

Check if the gpu memory is full, using the nvidia-smi command. If the gpu memory usage seems full and frozen, this may be due to the cudf out of memory error, which may happen if the dataset is too large to fit into the GPU memory. Please refer Memory limitations while choosing datasets

A docker container restart might solve the issue temporarily.

File Conversion

Currently, cuXfilter supports only arrow file format as input. The python_scripts folder in the root directory provides a helper script to convert csv to arrow file. For more information, follow this link

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
client_side		client_side
demos		demos
node_server		node_server
python_scripts		python_scripts
sanic_server		sanic_server
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.json		config.json
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuXfilter

Table of Contents

Installation

Architecture

Server-side

Client-side

Memory Limitations

Troubleshooting

File Conversion

About

Releases

Packages

Languages

startecon/cuxfilter

Folders and files

Latest commit

History

Repository files navigation

cuXfilter

Table of Contents

Installation

Architecture

Server-side

Client-side

Memory Limitations

Troubleshooting

File Conversion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages