Skip to content

Substrate ETL (inspired by ethereum-etl) exposes Substrate blockchain data (presently Polkadot and Kusama networks) for loading into convenient formats like CSVs and relational databases. All finalized data is published in a public dataset in BigQuery daily to hourly.

Notifications You must be signed in to change notification settings

flipchan/substrate-etl

 
 

Repository files navigation

Substrate ETL

IMPORTANT UPDATE!!! As of May 18, 2023, substrate-etl datasets are now in Google's BigQuery Public Datasets: bigquery-public-data.crypto_polkadot and bigquery-public-data.crypto_kusama!

This project is part of the Polkadot Data Bounty Treasury-funded bounty proposal, with curation led by the Parity data team, as part of a broader "Polkadot Data Alliance" (see select * from polkadot/Dotlake).

Using Substrate ETL, users can query Polkadot and Kusama networks for large scale analysis data of blocks, extrinsics, events, balances, logs, transfers and xcmtransfers. Substrate ETL relies on Colorful Notion's Polkaholic.io indexing of Polkadot + Kusama networks into public Google BigQuery datasets (one dataset for each relay chain).

Network Summary: (All-time, Monthly, Daily)

Included in each summary are sample queries and a complete breakdown. Chain data is appended daily.

A summary dashboard of the last 30 days and the last 1 hour is available:

Quick Start

bigquery-public-data is a public project within BigQuery which anyone can access. You see hundreds of BigQuery tables in the public crypto_polkadot and crypto_kusama datasets. Just open the query editor and try some of the queries below:

See all available tables/views via AAA_tableschema (Schema)

select * from `bigquery-public-data.crypto_polkadot.AAA_tableschema`

See all available tables/views of paraid 2000

select * from `bigquery-public-data.crypto_polkadot.AAA_tableschema` where table_id like '%2000'

Get blocks of paraid 2000 (Schema):

select * from `bigquery-public-data.crypto_polkadot.blocks2000` where DATE(block_time) >= "2023-04-01" and DATE(block_time) <= "2023-04-30"

Get extrinsics of paraid 2000 (Schema):

select * from `bigquery-public-data.crypto_polkadot.extrinsics2000` where DATE(block_time) >= "2023-04-01" and DATE(block_time) <= "2023-04-30"

Get XCM Transfers of Polkadot Network (Schema):

select * from `bigquery-public-data.crypto_polkadot.xcmtransfers` where DATE(origination_ts) >= "2023-04-01" and DATE(origination_ts) <= "2023-04-30"

Schemas for several of the most common source tables are listed below -- a full list of schemas can be found here with further details below.

Public Datasets in BigQuery

Substrate data for each chain is held in 9 tables in one of 2 public datasets, with one dataset for each relay chain and all its parachains. By convention, relaychain data is considered "paraid=0".

Project: (Location: US)

  • bigquery-public-data

Datasets:

  • crypto_polkadot
  • crypto_kusama

Tables: (replace {paraID} with a specific para ID, e.g. 2000 for acala)

  • Blocks: bigquery-public-data.crypto_${relayChain}.blocks${paraID} (date-partitioned by block_time) - Schema
  • Extrinsics: bigquery-public-data.crypto_${relayChain}.extrinsics${paraID} (date-partitioned by block_time) - Schema
  • Events: bigquery-public-data.crypto_${relayChain}.events${paraID} (date-partitioned by block_time) - Schema
  • Transfers: bigquery-public-data.crypto_${relayChain}.transfers${paraID} (date-partitioned by block_time) - Schema
  • Balances: bigquery-public-data.crypto_${relayChain}.balances${paraID} (date-partitioned by ts) - Schema
  • Active Accounts: bigquery-public-data.crypto_${relayChain}.accountsactive${paraID} (date-partitioned by ts) - Schema
  • Passive Accounts: bigquery-public-data.crypto_${relayChain}.accountspassive${paraID} (date-partitioned by ts) - Schema
  • New Accounts: bigquery-public-data.crypto_${relayChain}.accountsnew${paraID} (date-partitioned by ts) - Schema
  • Reaped Accounts: bigquery-public-data.crypto_${relayChain}.accountsreaped${paraID} (date-partitioned by ts) - Schema
  • Assets: bigquery-public-data.crypto_${relayChain}.assets (relaychain-wide table: filter on para_id as needed) - Schema
  • XCM Assets: bigquery-public-data.crypto_${relayChain}.xcmassets (relaychain-wide table: filter on para_id as needed) - Schema
  • XCM Transfers: bigquery-public-data.crypto_${relayChain}.xcmtransfers (relaychain-wide table: filter on origination_para_id or destination_para_id as needed; date-partitioned by origination_ts) - Schema
  • XCM Messages: bigquery-public-data.crypto_${relayChain}.xcm (relaychain-wide table: filter on origination_para_id or destination_para_id as needed; date-partitioned by origination_ts) - Schema

Thus polkadot relay chain blocks are held in bigquery-public-data.crypto_polkadot.blocks0, acala blocks are stored in bigquery-public-data.crypto_polkadot.blocks2000, and similarly for any chain / table name.

See Definitions for how the tables are constructed and tentative definitions.

Every chain has a auto generated README with the chains tables explicitly enumerated, and includes sample queries.

Notes:

  • System tables (chains, assets, xcmassets, xcmtransfers, xcm) are not specific to any parachain and apply to the whole relay chain.
  • All tables (except for chains, assets and xcmassets) are date-partitioned to support low cost, high speed scans.
  • If a parachain has a renewal, the first paraid assigned is used for subsequent renewals.

AAA_tableschema.json

See:

Field BigQuery Type
table_id STRING
time_partitioning_field STRING
table_cols STRING REPEATED
table_schema INTEGER

For a quick overview of the available tables/views within the dataset, please query AAA_tableschema view

chains.json

See:

Field Type BigQuery Type
para_id bigint INTEGER
chain_name string STRING
id hex_string STRING
ss58_prefix bigint INTEGER
symbol string INTEGER
decimals string INTEGER

blocks.json

See:

Field Type BigQuery Type
hash hex_string STRING
parent_hash hex_string STRING
number bigint INTEGER
state_root hex_string STRING
extrinsics_root hex_string STRING
block_time bigint TIMESTAMP
author_ss58 string STRING
author_pub_key string STRING
spec_version bigint INTEGER
relay_block_number bigint INTEGER
relay_state_root hex_string STRING
extrinsic_count bigint INTEGER
event_count bigint INTEGER
transfer_count bigint INTEGER

extrinsics.json

See:

Field Type BigQuery Type
hash hex_string STRING
extrinsic_id string STRING
block_time bigint TIMESTAMP
block_number bigint INTEGER
block_hash hex_string STRING
lifetime JSON JSON
section string STRING
method string STRING
params JSON JSON
fee bigint INTEGER
weight bigint INTEGER
signed boolean BOOLEAN
signer_ss58 string STRING
signer_pub_key hex_string STRING

events.json

Field Type BigQuery Type
event_id string STRING
section string STRING
method string STRING
data JSON JSON
extrinsic_id string STRING
extrinsic_hash hex_string STRING
block_time bigint TIMESTAMP
block_number bigint INTEGER
block_hash hex_string STRING

transfers.json

Field Type BigQuery Type
event_id string STRING
section string STRING
method string STRING
data JSON JSON
extrinsic_id string STRING
extrinsic_hash hex_string STRING
block_time bigint TIMESTAMP
block_number bigint INTEGER
block_hash hex_string STRING
from_ss58 string STRING
to_ss58 string STRING
from_pub_key string STRING
to_pub_key string STRING
asset string STRING
price_usd float FLOAT64
amount_usd float FLOAT64
symbol string STRING
decimals int8 INTEGER
amount float FLOAT64
raw_amount bigint INTEGER

balances.json

Field Type BigQuery Type
symbol string STRING
address_ss58 string STRING
address_pubkey string STRING
ts bigint TIMESTAMP
id string STRING
chain_name string STRING
para_id int INTEGER
free bigint FLOAT
free_usd float FLOAT
reserved float FLOAT
reserved_usd float FLOAT
misc_frozen float FLOAT
misc_frozen_usd float FLOAT
frozen float FLOAT
frozen_usd float FLOAT

xcmtransfers.json

Field Type BigQuery Type
symbol string STRING
price_usd float FLOAT
origination_ts bigint TIMESTAMP
origination_chain_name string STRING
origination_id string STRING
origination_extrinsic_hash hex_string STRING
origination_extrinsic_id string STRING
origination_transfer_index bigint INTEGER
origination_xcm_index bigint INTEGER
origination_transaction_hash hex_string STRING
origination_msg_hash hex_string STRING
origination_is_msg_sent boolean BOOLEAN
origination_block_number bigintn INTEGER
origination_section string STRING
origination_method string STRING
origination_para_id bigint INTEGER
origination_sender_ss58 string STRING
origination_sender_pub_key string STRING
destination_para_id bigint INTEGER
origination_amount_sent float FLOAT
origination_amount_sent_usd float FLOAT
origination_tx_fee float FLOAT
origination_tx_fee_usd float FLOAT
origination_tx_fee_symbol string STRING
origination_is_fee_item boolean BOOLEAN
origination_sent_at bigint INTEGER
destination_execution_status string STRING
destination_chain_name string STRING
destination_para_id integer INTEGER
destination_beneficiary_ss58 string STRING
destination_beneficiary_pub_key string STRING
destination_extrinsic_id string STRING
destination_event_id string STRING
destination_block_number bigint INTEGER
destination_ts bigint TIMESTAMP
destination_amount_received float FLOAT
destination_amount_received_usd float FLOAT
destination_teleport_fee float FLOAT
destination_teleport_fee_usd float FLOAT
destination_teleport_fee_symbol string STRING
xcm_info JSON JSON
xcm_info_last_update_time bigint TIMESTAMP

Notes:

See this blog post for a tutorial with representative queries covering the above.

Reporting / Missing data

Generally the data is complete as can be, but the sole data source is the Polkaholic.io indexer.

From this single source, the primary causes of missing data stem from:

  • Chains that do not provide a public RPC node. Most of the time, these chains are new with very little actual activity.
  • Chains that have a public RPC node, but no RPC Endpoint is an archive nodes
  • Chains that are being onboarded
  • Some blocks fail decoding due to chain halting, or are missing an up-to-date node.js API package for type definitions.

A daily/hourly github actions process summarizes the state of the index for:

and for every single chain that is being indexed. See the report Issues column for chains with systemic issues or blocks that are missing. Generally the last 24 hrs have blocks that are missing that are filled in by the end of the day.

Design choices:

  • All temporal BigQuery datasets are date-partitioned and split into multiple tables by {paraId} to enable low-cost low-latency BigQuery scans for specific date, parachain combinations. Timestamped data use BigQuery TIMESTAMP date types.
  • Addresses are provided in “public key” (signer_pub_key) and SS58 Address (signer_ss58) form to support multi-chain queries with wild card table selection egselect * from polkadot.extrinsics* where signer_pub_key='<pubkey>' aggregates multi-chain transactional history for a given account.
  • When assets are mentioned (transfers, xcmtransfers), we "decimalize" the output and include basic USD price valuation if possible. Many assets are not valued with USD values in this way.

Roadmap

Spring/Summer 2023

  • Initial table designs {blocks, extrinsics, events, trasnfers, balances, xcmTransfers}
  • Daily/hourly dump via Github workflow
  • Hourly summary report for all reachable parachains
  • On-chain activity metrics: {active, passive, reaped, new} users, {numSignedExtrinsics, numTransfers, numXCMTransfers} modeled in bigQuery
  • GKE systematization, Reliability Improvements
  • Integration with XCM Global-Asset Registry repo

Fall/Winter 2023

  • Basic Wasm contract support (psp22, events, bytecode)
  • XCM Message table redesign
  • Full XCMv3 Multilocation support
  • New functionality based on community feedback

Spring/Summer 2024

  • Bridgehub integration
  • DEX table for Statemine/Statemint's DotSwap trade volume analytics
  • Reporting on Comparison to other ecosystems also modelled in BigQuery
  • Solochain, testnet integration based on community feedback
  • New functionality based on community feedback

Your feedback and your ideas are important -- please submit an issue or reach out to us on Telegram (@sourabhniyogi) or Matrix.

Contributions

Contributions are welcome. Contributors will be invited to a dedicated Telegram group and are held to the Polkadot communities' Code of Conduct.

Gcloud docs:

https://cloud.google.com/docs/authentication/getting-started

About

Substrate ETL (inspired by ethereum-etl) exposes Substrate blockchain data (presently Polkadot and Kusama networks) for loading into convenient formats like CSVs and relational databases. All finalized data is published in a public dataset in BigQuery daily to hourly.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 97.2%
  • Shell 2.8%