ICU-22419 Performance improvements of collated string comparison #2505

Krechals · 2023-06-18T13:18:44Z

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22419
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

Currently, collated string comparisons are optimized to find the first byte where the prefix doesn't match by performing a byte comparison.

However, a comparison of one byte at a time is used to evaluate the prefix mismatch.

Making a double word comparison to discover the prefix mismatch is one way to improve the performance. In this PR, the double word comparison is performed only when both strings have the same alignment:

(uintptr_t)left % sizeof(uint64_t) == (uintptr_t)right % sizeof(uint64_t)

This leads to a 10x performance improvement when the strings have some binary equal prefix. Results based on google/benchmark: ICU Comparison Benchmarks

Benchmark:

#include <iostream>

#include <benchmark/benchmark.h>
#include <unicode/stsearch.h>
#include <unicode/tblcoll.h>
#include <unicode/ustring.h>
#include <unicode/coll.h>
#include <unicode/coleitr.h>

int compare(const std::string& strA, const std::string& strB, icu::Collator* collator) {
    UErrorCode status;
    icu::StringPiece viewA(strA.data(), strB.length());
    icu::StringPiece viewB(strB.data(), strA.length());
    return collator->compareUTF8(viewA, viewB, status);
}

static void BM_Run(benchmark::State& state) {
  const size_t size = state.range(0);
  std::string strA(size, 'a');
  std::string strB(size, 'a');

  UErrorCode status = U_ZERO_ERROR;
  icu::Locale locale("en");
  icu::Collator* collator = icu::Collator::createInstance(locale, status);
  if (U_FAILURE(status)) {
    std::cerr << "Failed to create Collator instance: " << u_errorName(status) << std::endl;
    return;
  }  

  for (auto _ : state) {
    if (compare(strA, strB, collator) != 0) {
	  std::cerr << "Wrong result" << std::endl;
	  return;
    }
  }
}

// Register the function as a benchmark
BENCHMARK(BM_Run)->RangeMultiplier(2)->Range(1, 1 << 15);

// Run the benchmark
BENCHMARK_MAIN();

jira-pull-request-webhook · 2023-06-22T17:48:07Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

See unicode-org#2505

jira-pull-request-webhook · 2023-06-22T17:49:08Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

yumaoka assigned markusicu Jun 22, 2023

Krechals force-pushed the performance_improvements_string_comparison branch from 2c060dd to 032a011 Compare June 22, 2023 17:48

ICU-22419 Performance improvements of collated string comparison

5097ed3

See unicode-org#2505

Krechals force-pushed the performance_improvements_string_comparison branch from 032a011 to 5097ed3 Compare June 22, 2023 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22419 Performance improvements of collated string comparison #2505

ICU-22419 Performance improvements of collated string comparison #2505

Krechals commented Jun 18, 2023 •

edited

jira-pull-request-webhook bot commented Jun 22, 2023

jira-pull-request-webhook bot commented Jun 22, 2023

ICU-22419 Performance improvements of collated string comparison #2505

Are you sure you want to change the base?

ICU-22419 Performance improvements of collated string comparison #2505

Conversation

Krechals commented Jun 18, 2023 • edited

Checklist

jira-pull-request-webhook bot commented Jun 22, 2023

jira-pull-request-webhook bot commented Jun 22, 2023

Krechals commented Jun 18, 2023 •

edited