Skip to content

Configuring Power BI Direct Query to Azure Cosmos DB via Apache Spark (HDI)

Denny Lee edited this page Jan 7, 2018 · 7 revisions

A powerfully fun way to visualize your data in Azure Cosmos DB is to use Power BI. While there is an ODBC Driver (refer to Connect to Azure Cosmos DB using BI analytics tools with the ODBC driver), this method requires you to download all of the data from Azure Cosmos DB into Power BI.

To workaround this issue, one technique is to use the azure-cosmosdb-spark connector which allows you to use Apache Spark as the bridge between Power BI and Azure Cosmos DB. Power BI has direct query capabilities to Apache Spark and with the azure-cosmosdb-spark connector, you can create direct connectivity from Power BI to Azure Cosmos DB.

Power  BI DQ > Spark > Azure Cosmos DB

Note, these are alpha working instructions and we will over time simplify how to do this so it will be easier for you to configure this.

Setup

You will need the following components

Configuration

The key configuration here is the ability to copy the azure-cosmosdb-spark JARs to the worker nodes on your HDI cluster.

Getting the azure-cosmosdb-spark JARs

Note, these instructions are necessary as we have not yet developed a script action to automatically copy the JARs to the head and worker nodes. We have the item Create script action to copy the azure-cosmosdb-spark JARs to all HDI head and worker nodes #79 that once addressed, will minimize the instructions below.

To get the jars, please build the code using mvn clean package or you can download them from the releases folder. As of this writing, the latest version of the JARS can be found in azure-cosmosdb-spark_2.1.0_2.11-1.0.0.

Grab these JARs and be prepared to upload them to your HDI cluster worker nodes.

Copying the JARS to your HDI cluster worker nodes

The goal here is to copy the azure-cosmosdb-spark JARS to the `/usr/hdp/current/spark2-client/jars on your worker and head nodes of your cluster.

Obtain the head and worker node IP addresses

To get this information, you will need to log into your Azure HDI cluster and copy down a list of your head and worker nodes. To do this, first you will log into the Azure Portal and connect to your HDI cluster such as the image below.

You will need to click on Cluster Dashboard.

From here, you click on HDInsight Cluster Dashboard.

Then click on Hosts and you see the list of head nodes (prefix of hn) and worker nodes (prefix of wn).

Record this list of nodes as you will need to copy the JARs into the each one of these nodes.

Copy the JARs to the sshuser folder of your active head node

Note, configuring your HDI cluster using a private/public key will make things easier as you will be able to scp the files without specifying a password

You will run the scp command below from your own box and copy the files from your own box to the active head node. The command should look something like this:

scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar [email protected]:~/

**IMPORTANT: Do this for all for all the JAR except the uber-jar OR just copy the uber-jar.

Copy the JARs to the sshuser folder of all the other worker and backup head node

You will run this command from the after you ssh into the active head node.

The command should look something this:

scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar [email protected]:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar [email protected]:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar [email protected]:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar [email protected]:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar [email protected]:~/

Do this for all JARs for all nodes.

SSH into each node and copy the files from the user folder to /usr/hdp/current/spark2-client/jars

This next step allows you to copy the JARs from the ~/ folder to /usr/hdp/current/spark2-client/jars. Note, you will need to ssh into each node to perform this command. The command should look something like this:

sudo cp azure-documentdb-spark-0.0.3-SNAPSHOT.jar /usr/hdp/current/spark2-client/jars

Remember to do this for all nodes of your HDI cluster.

Restart your Spark service

Now that all the JARs are copied to the nodes, you will need to restart your Spark service. To do this, go to Azure Portal > HDI Cluster > Cluster Dashboard > HDI cluster dashboard > Services > Spark2. Then click on Service Actions > Restart All.

Creating Hive (Spark) table with Cosmos DB Data Source

Now that your HDI cluster can automatically connect to your Azure Cosmos DB cluster, you will need to build a table - the source of which is a Cosmos DB query. This way from the Power BI perspective, it is only connecting to a Spark SQL table (not realizing that it is a Cosmos DB data source underneath).

The easiest way to do this is to ssh into your active head node and then run spark-shell, e.g.

spark-shell --master yarn

Then you can run a command to create an external table such as the one below:

spark.sql("create table flights using com.microsoft.azure.cosmosdb.spark options (endpoint 'https://doctorwho.documents.azure.com:443/', database 'DepartureDelays', collection 'flights_pcoll', masterkey '$masterkey')")

Thus if you run the show tables command, it should look something like this:

spark.sql("show tables").show()
+---------------+-----------+
|      tableName|isTemporary|
+---------------+-----------+
|        flights|      false|
|hivesampletable|      false|
+---------------+-----------+

And now you can test it by running Spark SQL queries against it, such as:

-- Test the table works
spark.sql("select * from flights limit 10").show()
spark.sql("select date, delay, distance, origin, destination from flights limit 10").show()


-- Test a group by statement
spark.sql("select destination, count(1) from flights where origin = 'SEA' group by destination").show()

Connect Power BI to Spark

Now, you can connect Power BI to Spark as noted in the instructions Apache Spark BI using data visualization tools with Azure HDInsight.

The only difference is now you will connect to a table (such as flights above) where its underlying source is an Azure Cosmos DB query.

Tips and Tricks

Based on the initial tests, here are some quick tips:

  • Because Power BI can ask a lot of queries (especially if you have a lot of widgets), you may want to shrink the scope of the query by creating a view. For example, if I only care about flights departing from Seattle, I can create the view below that limits Power BI to only request for flights departing Seattle. You're reducing the amount of data transferred as well as taking advantage of Cosmos DB indexing at the same time.
spark.sql("CREATE VIEW originSEA AS SELECT date, distance, delay, origin, destination FROM flights WHERE origin = 'SEA'")

Talk to us

As noted, this is an early preview and we will be updating this document as we optimize the query performance as well as the installation process. Meanwhile, please do not hesitate to submit issues, suggest any PRs, or contact us at [email protected].