Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor handling of embedded magic.mgc. #4989

Open
wants to merge 21 commits into
base: dev
Choose a base branch
from

Conversation

teo-tsirpanis
Copy link
Member

SC-25167
SC-47655
SC-47656
SC-47657
SC-47658

This PR overhauls the facilities to embed and load the magic.mgc file that is needed by libmagic:

  • The most important change is the removal of magic_mgc_gzipped.bin.tar.bz2. This file contained a copy of magic.mgc that was compressed, converted to escaped C characters, packed and compressed again to take less space, and stored in source control, so that at build time to get unpacked and #included in mgc_dict.cc. Because this file was being prepared ahead of time by a manually invoked C++ program, this approach had the disadvantage that it tied the Core to a specific version of libmagic. This was made evident in Update libmagic to version 5.45 #4673, where just updating libmagic was not enough; we also had to update magic_mgc_gzipped.bin.tar.bz2.

    What we do now is rely on CMake to find magic.mgc and perform its entire preparation at build time. The C++ program was rewritten to be a CMake script, which makes it much simpler and enables it to run on cross-compilation scenarios. The script accepts the uncompressed magic.mgc file, compresses it and produces a header file of the following format:

    static const unsigned char magic_mgc_compressed_bytes[] = {
    0x28, 0xb5, 0x2f, 0xfd, …
    };
    constexpr size_t magic_mgc_compressed_size = sizeof(magic_mgc_compressed_bytes);
    // Editorial note: we used to prepend the decompressed size at the start of the
    // binary blob, but this was non-standard and could not be easily done by CMake.
    constexpr size_t magic_mgc_decompressed_size = 7041352;
  • The algorithm to compress magic.mgc was changed from gzip to zstd, resulting in a 17.9% reduction of the compressed size (from 333067 το 273500 bytes).

  • Tests for mgc_dict were also updated to use Catch2, and were wired to run along with the other standalone unit tests.

    • This necessitated to make an object library for mgc_dict, which was done as well.

Validated by successfully running unit_mgc_dict locally.


TYPE: BUILD
DESC: Improve embedding of magic.mgc and allow compiling with any libmagic version.

It was rewritten from C++ to CMake, and compression is now being done with CMake's commands.
This allows us to pack the upstream `magic.mgc` file and stop keeping a pre-compressed and pre-escaped one in source control.
The size of the uncompressed file used to be kept at the start of the binary file. We no longer have the capability to easily modify binary files with CMake, so the script generates a complete header, alongside a constexpr variable with the uncompressed size.
It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.
Compressed size dropped from 333067 to 270578 bytes.
Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.
Higher levels require CMake 3.26+.
@teo-tsirpanis teo-tsirpanis force-pushed the teo/mgc-dict-refactor branch 2 times, most recently from d649ecc to fdf5e45 Compare June 6, 2024 23:39
Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs updated copyright dates throughout.

There's a whole lot of old C-style code here. Since the code is being changed, it needs to be updated as well. This is particularly true for test code.

Conversion of tests to Catch2 seems incomplete. There is still too much residue of the old integrated C-style test code. A bunch of the special return values and uses of errcnt need to be broken out somehow.

The conversion script needs to ensure that it's temporary files are only the build tree.

Comment on lines 112 to 113
ConstBuffer* input_buffer,
PreallocatedBuffer* output_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function. It shouldn't use pointers.

There are just two uses of this. The one above has the same check that's copypasted below, and the new one below takes addresses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I also removed the duplicate check above.

Comment on lines 112 to 113
ConstBuffer* input_buffer,
PreallocatedBuffer* output_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ConstBuffer* input_buffer,
PreallocatedBuffer* output_buffer) {
ConstBuffer& input_buffer,
PreallocatedBuffer& output_buffer) {

string(MAKE_C_IDENTIFIER ${INPUT_FILENAME} INPUT_VARIABLE)

message(DEBUG "Compressing ${INPUT_FILE} to ${compressed_file}")
file(ARCHIVE_CREATE OUTPUT "${compressed_file}" PATHS ${INPUT_FILE} FORMAT raw COMPRESSION Zstd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as though this temporary file is being created inside the source tree. We need to ensure that it's happening in the build directory somewhere. Ordinarily it won't be a problem, but we should avoid the possibility entirely. It's not a big change from what's already here.

Most reliable would seem to be to add WORKING_DIRECTORY to the add_custom_command call and not use REPLACE_FILENAME here, which puts the temporary in the same directory as the source.

Unfortunately CMake doesn't seem to have a facility for creating temporary directories.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as though this temporary file is being created inside the source tree.

No, it is being created in the same directory as the output file, which for our cases is specified to be in the build directory. The syntax highlighting of line 37 is a bit confusing, OUTPUT_FILE is the name of the variable, not an argument name.

@@ -38,8 +38,8 @@ using tiledb::common::Status;

class magic_dict {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be deleted. The only constructor is getting deleted and there's only static functions left. In order to keep the code in tiledb_filestore.cc compiling, it can become a namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/** holds the expanded data until application exits. */
static shared_ptr<tiledb::sm::ByteVecValue> expanded_buffer_;
static const tiledb::sm::ByteVecValue& expanded_buffer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a free function, not a class member.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 59 to 61
FILE* infile = nullptr;
infile = fopen(TILEDB_PATH_TO_MAGIC_MGC, "rb");
if (!infile) {
fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC);
return 1;
}
REQUIRE(infile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer that these tests be rewritten with C++ I/O rather than C I/O, but I also admit that it might be a small scope expansion. There's not much I/O, though, so it's not a significant expansion. There's already a need in this PR to do more to update these tests, in addition, so this can go along with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten.

fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC);
return 1;
}
REQUIRE(infile);

fseek(infile, 0L, SEEK_END);
uint64_t magic_mgc_len = ftell(infile);
fseek(infile, 0L, SEEK_SET);

char* magic_mgc_data = tdb_new_array(char, magic_mgc_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is never a need to allocate a local variable when the size is known in advance. Since the lengths are all constexpr, we can use std::array here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uncompressed file is 6.7MB big. Won't an std::array here overflow the stack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lengths are all constexpr

I would prefer to keep the generated header an implementation detail and include it only in magic_mgc.cc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make it a vector; at present it leaks memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to vector.

fprintf(stderr, "NO errors encountered in mgc_dict validation\n");
return 0;
}
REQUIRE(errcnt == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't need this if all the tests to which it might apply are executed with CHECK or REQUIRE. That will take some more rewrites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. We now make a proper CHECK instead of keeping the count of errors.

Comment on lines 456 to 459
REQUIRE(proc_list(file_data_sizes1, true) == 0);
REQUIRE(proc_list(file_data_sizes2, true) == 0);
REQUIRE(proc_list(file_data_sizes1, false) == 0);
REQUIRE(proc_list(file_data_sizes2, false) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look like they're independent and should be in separate SECTION at least, if not separate TEST_CASE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking of removing all but the first section. The two file data lists have the same content but in different order, and testing both with and without reusing the magic_t is not very worthwhile IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline feedback I removed all but the first case, and further simplified the test.

fprintf(stderr, "ERROR reading data from %s\n", TILEDB_PATH_TO_MAGIC_MGC);
return 4;
}
REQUIRE(fread(magic_mgc_data, 1, magic_mgc_len, infile) == magic_mgc_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a check that the file size is equal to the length we expect it to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens a bit below, in line 72.

@teo-tsirpanis
Copy link
Member Author

All feedback has been addressed, this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants