Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to combine "on conflict do nothing" and "stream_to"? #881

Open
chao-mu opened this issue Aug 12, 2024 · 12 comments
Open

Possible to combine "on conflict do nothing" and "stream_to"? #881

chao-mu opened this issue Aug 12, 2024 · 12 comments

Comments

@chao-mu
Copy link

chao-mu commented Aug 12, 2024

Hello,

I am inserting a massive (hundreds of millions) of rows. Ideally, I want to use the streaming api, however there will likely be conflicts from past iterations. I had an implementation where I would process the workload in chunks and check ahead of time whether something existed before inserting, but that is rife with issues around concurrency and seems far too complex.

Is there a way to get stream_to::table to ignore duplicates?

Thank you,
Me.

@jtv
Copy link
Owner

jtv commented Aug 12, 2024

I doubt ON CONFLICT would work with streams, but I can't think of a better way either so it's worth a try.

@chao-mu
Copy link
Author

chao-mu commented Aug 12, 2024

Oh my goodness, thank you so much for responding! I'm in deep water with a job that currently will take over 200 hours to finish each time it runs if I don't optimize the DB side! I'm processing 90+ million chess games and the moves of each of those games. Needless to say, that's a LOT of inserts.

How would I supply that?

This is how I do it elsewhere:

  const pqxx::table_path games_table_path = {"games"};
  auto games_table = pqxx::stream_to::table(tx, games_table_path, {"column1", "column2", ... }

  for (auto const &game : games) {
    auto row = std::make_tuple(...)
   games_table << row;
}

Also, currently my non stream version of the real work of the process (saving the individual chess boards per move) prepares a statement then with a pqxx::work and runs exec_prepared separately for each row I need to insert. Is there a way to send them all at once?

@chao-mu
Copy link
Author

chao-mu commented Aug 12, 2024

I feel like there's probably a better way than what I'm doing.

How I'm preparing the query

void prepare_insert_boards(pqxx::connection &conn) {
  conn.prepare(
      "insert_boards",
      "insert into boards (hash, white_bishops, white_rooks, white_queens, "
      "white_knights, white_king, white_pawn, black_bishops, black_rooks, "
      "black_queens, black_knights, black_king, black_pawn, white_to_move, "
      "castle_rights, enpassant_sq) values ($1, $2, $3, $4, $5, $6, $7, $8, "
      "$9, $10, $11, $12, $13, $14, $15, $16) on conflict (hash) do nothing");
}

How I'm inserting all the boards in a batch of games

void insert_boards(pqxx::work &tx, const std::vector<Game> &games) {
  std::map<std::string, bool> seen;
  // Add missing boards
  for (const auto &game : games) {
    for (const auto &continuation : game.continuations) {
      const auto &board = continuation.board;

      if (seen[board.hash]) {
        continue;
      }

      tx.exec_prepared("insert_boards", board.hash,
                       to_bits(board.white_bishops), to_bits(board.white_rooks),
                       to_bits(board.white_queens),
                       to_bits(board.white_knights), to_bits(board.white_king),
                       to_bits(board.white_pawn), to_bits(board.black_bishops),
                       to_bits(board.black_rooks), to_bits(board.black_queens),
                       to_bits(board.black_knights), to_bits(board.black_king),
                       to_bits(board.black_pawn), board.white_to_move,
                       board.castle_rights, board.enpassant_sq);

      seen[board.hash] = true;
    }
  }
}

And here's the to_bits code. (The columns used for it are bit(64) in postgresql

std::string to_bits(std::uint64_t value) {
  std::bitset<64> bits(value);

  return bits.to_string();
}

@jtv
Copy link
Owner

jtv commented Aug 14, 2024

A prepared statement is probably not saving you much time, if anything.

Instead, you could try composing a big INSERT statement ad hoc, which inserts multiple rows at once.

@jtv
Copy link
Owner

jtv commented Aug 14, 2024

There are also some tricks that only work under very specific circumstances, such as temporarily disabling constraints or temporarily making the table unlogged. But there are risks to those tricks so I'd start away from them as long as possible.

@chao-mu
Copy link
Author

chao-mu commented Aug 14, 2024

Is there a way, in a where clause, to express "in ($1, $2, $3 ... $n)" without building that part of the query manually? For example, in some frameworks you can say "in (?)" and pass an array for that value.

Also, I found my above code failed with an error regarding copy missing column data (for some reason it truncated after the first column I wrote, despite it working fine for a different query) until I rewrote it to use write_values instead of <<.

@jtv
Copy link
Owner

jtv commented Aug 15, 2024

Is there a way, in a where clause, to express "in ($1, $2, $3 ... $n)" without building that part of the query manually? For example, in some frameworks you can say "in (?)" and pass an array for that value.

Have you tried just putting the values in a std::vector and passing that as a single parameter?

Also, I found my above code failed with an error regarding copy missing column data (for some reason it truncated after the first column I wrote, despite it working fine for a different query) until I rewrote it to use write_values instead of <<.

The operator<<() way of inserting is kind of on its way to deprecation. Streams in general are to me a bit of a failed experiment from the 1990s. It sounds to me like there was some kind of confusion with types.

@jtv
Copy link
Owner

jtv commented Aug 16, 2024

(Some of the same questions came up in #879.)

@Dich0tomy
Copy link

I'm going to chime in about the code as well.
You seem to be using a bitset to convert from a uint64 to a string byte representation and using an std::map to check for seen pieces.

One simple optimization that would already help a bit would be changing the std::map to std::unordered_map, going further you could even use a library that provides a faster implementation for a lookup table.

If it's possible I'd also check if it's possible to make board.hash be a natural number, in which case you could use a contiguous range for marking seen boards which should also improve the runtime.

Maybe it's also possible to eliminate duplications before inserting so that you don't have to do that as well?

As for converting the the representation of whatever these white/black_pieces things are to bits, why do that? Simply saving the number as-is and then reading it will also yield the bits underneath and you can convert back to a string representation when (and if) you actually need it. I'm also not entirely sure if using a bitset for that is optimal. It's been a while since I run some benchmarks with it, but std::bitset is not generally praised as an optimal solution for converting to binary representations.

In the end, what I'm proposing may not yield significant results, so I'd benchmark anyway if you care that much about speed, but I wanted to leave an insight, since it's an important matter.

@jtv
Copy link
Owner

jtv commented Oct 5, 2024

I guess we're kind of stuck on this one. I don't think there's a way to "upsert" using COPY.

@chao-mu
Copy link
Author

chao-mu commented Oct 6, 2024

The core motivation was to be able to use the bitwise operators in queries as matching on bit boards with masks and such is standard. Was hoping to so that db side

@jtv
Copy link
Owner

jtv commented Nov 10, 2024

Sorry, late-occurring thought @chao-mu: have you considered streaming the data into a temporary table, and then doing the merging from the temporary table into the destination table?

@jtv jtv changed the title Possible to combine "on clonflict do nothing" and "stream_to"? Possible to combine "on conflict do nothing" and "stream_to"? Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants