-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor handling of embedded magic.mgc
.
#4989
base: dev
Are you sure you want to change the base?
Conversation
c83c7ea
to
7ba9432
Compare
It was rewritten from C++ to CMake, and compression is now being done with CMake's commands. This allows us to pack the upstream `magic.mgc` file and stop keeping a pre-compressed and pre-escaped one in source control. The size of the uncompressed file used to be kept at the start of the binary file. We no longer have the capability to easily modify binary files with CMake, so the script generates a complete header, alongside a constexpr variable with the uncompressed size.
It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.
Compressed size dropped from 333067 to 270578 bytes. Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.
Higher levels require CMake 3.26+.
…thout needing a pool.
…icate that this is a CMake script.
d649ecc
to
fdf5e45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs updated copyright dates throughout.
There's a whole lot of old C-style code here. Since the code is being changed, it needs to be updated as well. This is particularly true for test code.
Conversion of tests to Catch2 seems incomplete. There is still too much residue of the old integrated C-style test code. A bunch of the special return values and uses of errcnt
need to be broken out somehow.
The conversion script needs to ensure that it's temporary files are only the build tree.
ConstBuffer* input_buffer, | ||
PreallocatedBuffer* output_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new function. It shouldn't use pointers.
There are just two uses of this. The one above has the same check that's copypasted below, and the new one below takes addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I also removed the duplicate check above.
ConstBuffer* input_buffer, | ||
PreallocatedBuffer* output_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConstBuffer* input_buffer, | |
PreallocatedBuffer* output_buffer) { | |
ConstBuffer& input_buffer, | |
PreallocatedBuffer& output_buffer) { |
string(MAKE_C_IDENTIFIER ${INPUT_FILENAME} INPUT_VARIABLE) | ||
|
||
message(DEBUG "Compressing ${INPUT_FILE} to ${compressed_file}") | ||
file(ARCHIVE_CREATE OUTPUT "${compressed_file}" PATHS ${INPUT_FILE} FORMAT raw COMPRESSION Zstd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as though this temporary file is being created inside the source tree. We need to ensure that it's happening in the build directory somewhere. Ordinarily it won't be a problem, but we should avoid the possibility entirely. It's not a big change from what's already here.
Most reliable would seem to be to add WORKING_DIRECTORY
to the add_custom_command
call and not use REPLACE_FILENAME
here, which puts the temporary in the same directory as the source.
Unfortunately CMake doesn't seem to have a facility for creating temporary directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as though this temporary file is being created inside the source tree.
No, it is being created in the same directory as the output file, which for our cases is specified to be in the build directory. The syntax highlighting of line 37 is a bit confusing, OUTPUT_FILE
is the name of the variable, not an argument name.
tiledb/sm/misc/mgc_dict.h
Outdated
@@ -38,8 +38,8 @@ using tiledb::common::Status; | |||
|
|||
class magic_dict { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should be deleted. The only constructor is getting deleted and there's only static functions left. In order to keep the code in tiledb_filestore.cc
compiling, it can become a namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tiledb/sm/misc/mgc_dict.h
Outdated
|
||
/** holds the expanded data until application exits. */ | ||
static shared_ptr<tiledb::sm::ByteVecValue> expanded_buffer_; | ||
static const tiledb::sm::ByteVecValue& expanded_buffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a free function, not a class member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
FILE* infile = nullptr; | ||
infile = fopen(TILEDB_PATH_TO_MAGIC_MGC, "rb"); | ||
if (!infile) { | ||
fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC); | ||
return 1; | ||
} | ||
REQUIRE(infile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really prefer that these tests be rewritten with C++ I/O rather than C I/O, but I also admit that it might be a small scope expansion. There's not much I/O, though, so it's not a significant expansion. There's already a need in this PR to do more to update these tests, in addition, so this can go along with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC); | ||
return 1; | ||
} | ||
REQUIRE(infile); | ||
|
||
fseek(infile, 0L, SEEK_END); | ||
uint64_t magic_mgc_len = ftell(infile); | ||
fseek(infile, 0L, SEEK_SET); | ||
|
||
char* magic_mgc_data = tdb_new_array(char, magic_mgc_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is never a need to allocate a local variable when the size is known in advance. Since the lengths are all constexpr
, we can use std::array
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uncompressed file is 6.7MB big. Won't an std::array
here overflow the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lengths are all
constexpr
I would prefer to keep the generated header an implementation detail and include it only in magic_mgc.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make it a vector; at present it leaks memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to vector.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
fprintf(stderr, "NO errors encountered in mgc_dict validation\n"); | ||
return 0; | ||
} | ||
REQUIRE(errcnt == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't need this if all the tests to which it might apply are executed with CHECK
or REQUIRE
. That will take some more rewrites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. We now make a proper CHECK
instead of keeping the count of errors.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
REQUIRE(proc_list(file_data_sizes1, true) == 0); | ||
REQUIRE(proc_list(file_data_sizes2, true) == 0); | ||
REQUIRE(proc_list(file_data_sizes1, false) == 0); | ||
REQUIRE(proc_list(file_data_sizes2, false) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests look like they're independent and should be in separate SECTION
at least, if not separate TEST_CASE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking of removing all but the first section. The two file data lists have the same content but in different order, and testing both with and without reusing the magic_t
is not very worthwhile IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline feedback I removed all but the first case, and further simplified the test.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
fprintf(stderr, "ERROR reading data from %s\n", TILEDB_PATH_TO_MAGIC_MGC); | ||
return 4; | ||
} | ||
REQUIRE(fread(magic_mgc_data, 1, magic_mgc_len, infile) == magic_mgc_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a check that the file size is equal to the length we expect it to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens a bit below, in line 72.
Testing the case when they are opened and closed each time has little value.
All feedback has been addressed, this is ready for review. |
SC-25167
SC-47655
SC-47656
SC-47657
SC-47658
This PR overhauls the facilities to embed and load the
magic.mgc
file that is needed by libmagic:The most important change is the removal of
magic_mgc_gzipped.bin.tar.bz2
. This file contained a copy ofmagic.mgc
that was compressed, converted to escaped C characters, packed and compressed again to take less space, and stored in source control, so that at build time to get unpacked and#include
d inmgc_dict.cc
. Because this file was being prepared ahead of time by a manually invoked C++ program, this approach had the disadvantage that it tied the Core to a specific version of libmagic. This was made evident in Update libmagic to version 5.45 #4673, where just updating libmagic was not enough; we also had to updatemagic_mgc_gzipped.bin.tar.bz2
.What we do now is rely on CMake to find
magic.mgc
and perform its entire preparation at build time. The C++ program was rewritten to be a CMake script, which makes it much simpler and enables it to run on cross-compilation scenarios. The script accepts the uncompressedmagic.mgc
file, compresses it and produces a header file of the following format:The algorithm to compress
magic.mgc
was changed from gzip to zstd, resulting in a 17.9% reduction of the compressed size (from 333067 το 273500 bytes).Tests for
mgc_dict
were also updated to use Catch2, and were wired to run along with the other standalone unit tests.mgc_dict
, which was done as well.Validated by successfully running
unit_mgc_dict
locally.TYPE: BUILD
DESC: Improve embedding of
magic.mgc
and allow compiling with any libmagic version.