Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats command writes output file even when --output is not set #1794

Closed
mhkeller opened this issue May 5, 2024 · 7 comments
Closed

stats command writes output file even when --output is not set #1794

mhkeller opened this issue May 5, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@mhkeller
Copy link

mhkeller commented May 5, 2024

Describe the bug

When running qsv stats --typesonly my_file.csv, I get the stats in stdout but it also writes two files next to the file I am reading in:

my_file.stats.csv
my_file.stats.csv.json

I would prefer to not write any files.

To Reproduce
Steps to reproduce the behavior:

  1. Download this file iris.csv
  2. Run this command qsv stats iris.csv --typesonly
  3. See that the files have been written to disk

Expected behavior

I'm not sure if this is a bug but the behavior is surprising and it would be great if there were an option to not write out any files.

The docs describe an --output flag to write output. I would expect this function to only create output if set via a flag.

If these files are necessary for other qsv commands, it would be helpful to include a flag to optionally not write them.

Screenshots/Backtrace/Sample Data
If applicable, add screenshots/backtraces/sample data to help explain your problem.

Desktop (please complete the following information):

  • OS: MacOs
  • qsv Version

qsv 0.127.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.622;python-3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)];to;polars-0.39.2;self_update-10-10;12.80 GiB-1016.88 MiB-4.13 GiB-16.00 GiB (aarch64-apple-darwin compiled with Rust 1.78) compiled

@jqnatividad
Copy link
Owner

Hi @mhkeller ,
stats is the heart of qsv and the main reason why I wrote it.

As you inferred, it's used by other commands to do metadata and schema inferencing, among other things.

It's also used by stats itself to cache stats calculations. So when you try to compute stats on a very large file with expensive settings like --everything and --infer-dates, it checks if a previous stats calculation is available and valid, and it does that by looking at those two tiny files - the .stats.csv is the latest stats result; and .stats.json is the metadata of the previous stats run.

In the field I work in, where we deal primarily with large, historical CSVs, this is helpful as the files are typically static once exported from transaction systems, as stats will return instantaneously with results if valid cached stats results are available.

Anyway, I'll add an option to suppress generating these cache files. I'll also add some logic to only cache results if the potential savings are too small (say less than 5 seconds) to bother caching them.

@jqnatividad jqnatividad added the enhancement New feature or request label May 5, 2024
@mhkeller mhkeller changed the title BUG? stats command writes output file even when --output is not set stats command writes output file even when --output is not set May 6, 2024
@mhkeller
Copy link
Author

mhkeller commented May 6, 2024

Thanks for the quick and thorough reply! I figured it had to do with some internal usage. That makes a lot of sense. An option to skip writing out the files would be great for my use case.

With the --cache-threshold strategy, I would set the threshold to a high number that it would likely never reach, I'm guessing?

Thanks in general for your work on this library. I have a more general question that I'll post over in Discussions.

@jqnatividad
Copy link
Owner

No worries... big fan of the data journalism work you and your team are doing at NYTimes BTW... 💯

FYI, during the first few days of the pandemic, I wrote a Selenium scraper to retrieve data from NYTimes and petitioned to have it released as open data instead, to which the team responded quickly. 😄

Anyways, as for the new --cache-threshold option - it will have a default of 5000 ms. And when set to zero, it will suppress cache generation altogether, so you don't have to guesstimate a high threshold.

@mhkeller
Copy link
Author

mhkeller commented May 6, 2024

Ah thank you – that's so nice of you to say. And I'm glad you were able to get the data you were after – that tracking was a huge effort. (I was not involved but was a great admirer.)

That option for the flag makes sense and I'm looking forward to trying it out! I was looking for a fast, portable way to check csv types so qsv is perfect.

@mhkeller
Copy link
Author

mhkeller commented May 7, 2024

Thanks for merging this so quickly!

@chadbaldwin
Copy link

Heh, this is perfect timing for me. I'm working with a directory of csv files that sort of works like an auto-load folder. If a CSV file is dropped into the directory, a process picks it up and tries to load it. I was hoping to find a way to disable the cache files as I only need to run qsv stats once per file and likely won't need it ever again afterward, so the cache isn't all that helpful for me.

@jqnatividad
Copy link
Owner

jqnatividad commented May 15, 2024

That's good to know @chadbaldwin !

You may be interested to know that I added a new negative setting to --cache-threshold for your use case.

qsv/src/cmd/stats.rs

Lines 143 to 153 in 15d0072

-c, --cache-threshold <arg> When greater than 1, the threshold in milliseconds before caching
stats results. If a stats run takes longer than this threshold,
the stats results will be cached.
Set to 0 to suppress caching.
Set to 1 to force caching.
Set to a negative number to automatically create an index
when the input file size is greater than abs(arg) in bytes.
If the negative number ends with 5, it will delete the index
file and the stats cache file after the stats run. Otherwise,
the index file and the cache files are kept.
[default: 5000]

For example, If you set --cache-threshold -10005, stats will automatically create an index (which unlocks parallel processing and makes stats run at least 2-3x faster) when the input file size is greater than 10,005 bytes.

Further, after the stats run, it will auto-delete the index and the stats cache files as the --cache-threshold ends with 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants