Skip to content

Writing Test Runs

Khoa Dang edited this page Aug 4, 2017 · 1 revision

Below are the results of some writing test runs using the different Spark to CosmosDB connector methods.

CosmosDB Configurations

Below are the Cosmos DB configurations used:

  • Partitioned Collection: 250,000 RUs, 50 partitions
    • Collection 1: customerfactors_part 110GB with 19.80M documents
    • Collection 2: customerfactors 462.44GB with 64.00M documents

Apache Spark Configurations

Below are the Apache Spark configurations used:

  • HDI Cluster: Spark 2.1 HDI 3.5 multi-node Spark cluster, 2 masters, 2 zookeepers, driver memory 10g
    • Cluster 1: 4 workers (8 cores, 28GB memory), 32 cores, 128GB memory
      • Config 1.1: executor: 7 cores, 22GB memory
      • Config 1.2: executor: 2 cores, 6GB memory
    • Cluster 2: 9 workers (8 cores, 56GB memory), 72 cores, 504GB memory
      • Config 2.1: executor: 7 cores, 22GB memory
    • Cluster 3: 14 workers (4 cores, 28GB memory), 56 cores, 392GB memory
      • Config 3.1: executor: 3 cores, 22GB memory

Writing scenarios

Data migration: reading from a CosmosDB collection and writing directly to another CosmosDB collection

import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config

val readConfig = Config(Map(
    "Endpoint" -> "",
    "Masterkey" -> "",
    "Database" -> "writingtests",
    "Collection" -> "customerfactors",
    "PageSize" -> "2000000",
    "preferredRegions" -> "North Europe;",
    "SamplingRatio" -> "1.0"))

val readDf = spark.sqlContext.read.cosmosDB(readConfig)

val writeConfig = Config(Map(
    "Endpoint" -> "",
    "Masterkey" -> "",
    "Database" -> "writingtests",
    "Collection" -> "customerfactors_shadow", 
    "WritingBatchSize" -> "500",
    "preferredRegions" -> "North Europe;",
    "SamplingRatio" -> "1.0"))

val writtenDf = coll.write.cosmosDB(writeConfig) 

Writing performance

Below are the results of writing performance on various setups.

Cluster Configuration Collection # of rows Writing batch size Run time Avg. task time Batch per executor Peak RU/s End to End throughput
1 1.1 2 64M 500 01:18:00 00:35:00 2 120k 13782
1 1.1 1 19M 1000 00:18:00 00:09:06 2 18333
1 1.2 1 19M 500 01:42:00 2 3235
1 1.2 1 19M 2500 02:06:00 01:00:00 2 70k 2619
2 2.1 1 19M 1000 00:09:12 00:09:12 1 120k 35869
3 3.1 1 19M 1000 00:18:00 00:09:30 2 200k 18333
3 3.1 2 64M 1000 01:18:00 00:30:00 2 200k 13782