This is a Google App Engine application for streaming JSON into BigQuery, inspired by Luke Cempre's post on AnalyticsPros
When you request the URL /bq-streamer
with the parameter ?bq={'json':'example'}
then it will start a new task to put that JSON into a partitioned table in BigQuery.
The task is activated via a POST request to /bq-task
with the same JSON as passed to /bq-streamer
See the Google App Engine Samples for how to deploy.
- Create a dataset and date partitioned BigQuery table to receive the hits. Probably want to delete data after some time in prod.
- Create empty table > set table name > add schema > Options: Partitioning to "DAY"
- Add any other fields to the table that you wish to send in, the script by default also adds
ts
as a STRING that is a UNIX timestamp so add that too. Any unset fields won't be seen by default. - Edit the
app.yaml
fieldenv_variables
to your BigQuery details, and your secret code word:
Example:
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: .*
script: main.app
#[START env]
env_variables:
DATASET_ID: tests
TABLE_ID: realtime
SECRET_SALT: changethistosomethingunique
#[END env]
- Deploy the app (see below)
- Call the
https://your-app-id.appost.com/bq-streamer?bq={"field_name":"field_value", "field_name2":"field_value2"}
to add the fields to your BigQuery table.
For testing you can call in the browser the URL via GET
but for production call via POST
with the body JSON available to the bq
field.
Other examples:
https://your-app-id.appspot.com/bq-streamer?bq={'bar':'blah5','foo':'hi'}
-
The data won't appear in the BQ table preview quickly but you can query the table via something like
SELECT * FROM dataset.tableID
to see the realtime hits seconds after the hit it made. Turn offUSE CACHED RESULTS
. It also adds ats
field with a unix timestamp of when the hit was sent to BigQuery. -
View the logs for any errors
https://console.cloud.google.com/logs/viewer
- Download the Google App Engine Python SDK for your platform.
- Open terminal then browse to the folder containing
app.yaml
- The app requires extra libraries to be installed. You need to install the dependencies with
pip
.
This installs the libraries to a new folder lib
in the app directory. It most likely won't need to add anything.
pip install -t lib -r requirements.txt
-
Deploy via:
gcloud app deploy --project [YOUR_PROJECT_ID]
Optional flags:
- Include the
--project
flag to specify an alternate Cloud Platform Console project ID to what you initialized as the default in the gcloud tool. Example:--project [YOUR_PROJECT_ID]
- Include the -v flag to specify a version ID, otherwise one is generated for you. Example:
-v [YOUR_VERSION_ID]
- Visit
https://your-app-id.appost.com
to view your application.
For more information on App Engine:
For more information on Python on App Engine:
- Maximum 32MB per HTTP request
- concurrent task queues: 1000M if paid, 100k if free
- 500 tasks per second per queue = 1.8M per hour = 43.2M per day
- 100k rows per second per BQ table
Included is also a class to query the entire BigQuery table, in production you would want to limit query to greater than a timestamp in ts to avoid it being too large.
Visiting https://your-app-id.appspot.com/bq-get
will get you the BQ table in JSON format - it has no caching enabled so it will also be the freshest results.
To protect privacy, a hash also has to be supplied. This is generated via the secret salt name you should change in setup to something unique for you, and then generate it in python via:
import hashlib
# your query
q = "SELECT * FROM %s.%s"
# the unique secret word in the app.yaml environment vars
salt = "SECRETWORD"
## use this in the API call for parameter `hash`
hashlib.sha224(q+salt).hexdigest()
Or view the expected hash in the error logs when you attempt connection
By default it will return the most recent record of the table - pass limit=X
to get more.
Example:
https://your-app-id.appspot.com/bq-get?limit=1000&hash=63780cbd6c3f6e632b57d9f8f70ea7edcd3c6eb5cbdd1b3183ba28b6
The output is a list of lists like this:
[
["ts", "1483705837.05", null],
["blah4", "1483710114.48", null],
["1", "4", null],
["1", "4", null],
["this_is_json", "ts", null],
["h2i", "1483714626.43", "blah6"],
["y", "e", null],
["blah3", "1483707843.84", null],
["1", "4", null],
["y", "e", null],
["y", "e", null],
["blah3", "1483709017.45", null],
["y", "e", null],
["h2i", "1483714325.09", "blah6"],
["blah3", "1483710007.28", null],
["1", "4", null],
["h23i", "1483716480.93", "bl8h7"],
["hi", "1483710547.94", "blah5"],
["h2i", "1483716200.83", "blah7"]
]
By default the query is:
query = 'SELECT * FROM %s.%s LIMIT %s' % (datasetId, tableId, limit)
You can use your own query by supplying a q
parameter to the URL. It uses Standard SQL, not legacy. Use the %s.%s
in your query that will be filled in with the correct dataset and tableId. The limit URL parameter is ignored when supplying your own SQL.
https://your-app-id.appspot.com/bq-get?hash=XXXXXq=SELECT * FROM %s.%s LIMIT 10
Applications can then use this data for display.
- Poll every minute from GoogleSheets https://cloud.google.com/solutions/real-time/fluentd-bigquery
- Js: https://epochjs.github.io/epoch/
- https://www.quora.com/What-s-a-good-real-time-data-visualization-framework
- https://stackoverflow.com/questions/33480302/creating-a-shiny-app-with-real-time-data
- https://shiny.rstudio.com/gallery/reactive-poll-and-file-reader.html
- https://stackoverflow.com/questions/40424407/real-time-chart-on-r-shiny
<script>
var bqArray = {};
bqArray["fieldname"] = "{{dataLayer}}";
bqArray["fieldname2"] = "{{dataLayer2}}";
jQuery.post("https://YOUR-PROJECT-ID.appspot.com/bq-streamer", {"bq":JSON.stringify(bqArray)});
</script>