Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22419 Performance improvements of collated string comparison #2505

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Krechals
Copy link

@Krechals Krechals commented Jun 18, 2023

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22419
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

Currently, collated string comparisons are optimized to find the first byte where the prefix doesn't match by performing a byte comparison.

However, a comparison of one byte at a time is used to evaluate the prefix mismatch.

Making a double word comparison to discover the prefix mismatch is one way to improve the performance. In this PR, the double word comparison is performed only when both strings have the same alignment:

(uintptr_t)left % sizeof(uint64_t) == (uintptr_t)right % sizeof(uint64_t)

This leads to a 10x performance improvement when the strings have some binary equal prefix. Results based on google/benchmark: ICU Comparison Benchmarks

Benchmark:

#include <iostream>

#include <benchmark/benchmark.h>
#include <unicode/stsearch.h>
#include <unicode/tblcoll.h>
#include <unicode/ustring.h>
#include <unicode/coll.h>
#include <unicode/coleitr.h>

int compare(const std::string& strA, const std::string& strB, icu::Collator* collator) {
    UErrorCode status;
    icu::StringPiece viewA(strA.data(), strB.length());
    icu::StringPiece viewB(strB.data(), strA.length());
    return collator->compareUTF8(viewA, viewB, status);
}

static void BM_Run(benchmark::State& state) {
  const size_t size = state.range(0);
  std::string strA(size, 'a');
  std::string strB(size, 'a');

  UErrorCode status = U_ZERO_ERROR;
  icu::Locale locale("en");
  icu::Collator* collator = icu::Collator::createInstance(locale, status);
  if (U_FAILURE(status)) {
    std::cerr << "Failed to create Collator instance: " << u_errorName(status) << std::endl;
    return;
  }  

  for (auto _ : state) {
    if (compare(strA, strB, collator) != 0) {
	  std::cerr << "Wrong result" << std::endl;
	  return;
    }
  }
}

// Register the function as a benchmark
BENCHMARK(BM_Run)->RangeMultiplier(2)->Range(1, 1 << 15);

// Run the benchmark
BENCHMARK_MAIN();

@Krechals Krechals force-pushed the performance_improvements_string_comparison branch from 2c060dd to 032a011 Compare June 22, 2023 17:48
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@Krechals Krechals force-pushed the performance_improvements_string_comparison branch from 032a011 to 5097ed3 Compare June 22, 2023 17:49
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants