-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#431: Moving and adding sampling to backend calculations #438
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
@@ -443,16 +436,16 @@ def compute_data_type(self, ldf: LuxDataFrame): | |||
ldf._data_type[attr] = "geographical" | |||
elif pd.api.types.is_float_dtype(ldf.dtypes[attr]): | |||
|
|||
if ldf.cardinality[attr] != len(ldf) and (ldf.cardinality[attr] < 20): | |||
if ldf.cardinality[attr] != ldf._length and (ldf.cardinality[attr] < 20): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between _length
and len(df)
? It is probably more general to use the latter since the _length
might not be maintained correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed _length
in the metadata for a LuxDataFrame, and found that it was not being used anywhere in the code base (as far as I could tell). On Line 544 of this file, I changed it to be the length of the sampled DataFrame. This is necessary since we don't save the sampled DataFrame after the metadata is computed, but the length of the sampled DataFrame is necessary for future calculations, especially ones related to cardinality, like the one here.
The name of the attribute is probably not the best, so I could maybe change it to _sampled_length
instead?
@@ -538,11 +531,17 @@ def _is_datetime_number(series): | |||
return False | |||
|
|||
def compute_stats(self, ldf: LuxDataFrame): | |||
# use sample to compute statistics | |||
if ldf._sampled is None: | |||
ldf_sampled = PandasExecutor.execute_sampling(ldf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the config parameters that we are using for sampling for metadata and the visualization be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently they are the same (sampling_thresh
). Should we maybe use different parameters?
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
75e5f01
to
c696b2b
Compare
Signed-off-by: Kunal Agarwal <[email protected]>
Signed-off-by: Kunal Agarwal <[email protected]>
Overview
This is a branch that builds off of the work done in #432. This moves the sampling to after the Filter and also adds sampling for metadata computation.
Changes
Changes the
execute
function to move sampling after the Filtering is done, and so the sampling is done on each of the data visualizations. It also editscompute_data
to sample the data before computing the metadata. All metadata is the metadata associated with the sample, not the full dataset.Example Output
N/A