add precommit

Signed-off-by: Avik Basu <[email protected]>
numaproj · ab93 · May 3, 2023 · May 3, 2023 · May 3, 2023 · May 3, 2023
commit 977aa6043d111df04dcbf0c3272ba3a20058f107
@@ -166,4 +166,4 @@ cython_debug/
 # Mac related
 *.DS_Store
 
-.python-version
+.python-version
@@ -0,0 +1,30 @@
+default_language_version:
+ python: python3.9
+repos:
+- repo: https://github.com/psf/black
+ rev: 22.10.0
+ hooks:
+ - id: black
+ args: [ --check ]
+- repo: https://github.com/charliermarsh/ruff-pre-commit
+ # Ruff version.
+ rev: 'v0.0.264'
+ hooks:
+ - id: ruff
+ args: [ --fix, --exit-non-zero-on-fix ]
+- repo: https://github.com/adamchainz/blacken-docs
+ rev: "1.13.0"
+ hooks:
+ - id: blacken-docs
+ additional_dependencies:
+ - black==22.12.0
+- repo: https://github.com/pre-commit/pre-commit-hooks
+ rev: v4.4.0
+ hooks:
+ - id: end-of-file-fixer
+ - id: trailing-whitespace
+ - id: check-toml
+ - id: check-added-large-files
+ - id: check-ast
+ - id: check-case-conflict
+ - id: check-docstring-first
@@ -212,4 +212,3 @@
 ### Contributors
 
  * Kushal Batra
-
@@ -1,3 +1,3 @@
 # Contributor Covenant Code of Conduct
- 
+
 Please refer to [Code of Conduct](https://github.com/numaproj/numaproj/blob/main/CODE_OF_CONDUCT.md)
@@ -9,27 +9,27 @@
 
 
 ## Background
-Numalogic is a collection of ML models and algorithms for operation data analytics and AIOps. 
-At Intuit, we use Numalogic at scale for continuous real-time data enrichment including 
-anomaly scoring. We assign an anomaly score (ML inference) to any time-series 
-datum/event/message we receive on our streaming platform (say, Kafka). 95% of our 
-data sets are time-series, and we have a complex flowchart to execute ML inference on 
-our high throughput sources. We run multiple models on the same datum, say a model that is 
-sensitive towards +ve sentiments, another more tuned towards -ve sentiments, and another 
-optimized for neutral sentiments. We also have a couple of ML models trained for the same 
-data source to provide more accurate scores based on the data density in our model store. 
-An ensemble of models is required because some composite keys in the data tend to be less 
-dense than others, e.g., forgot-password interaction is less frequent than a status check 
-interaction. At runtime, for each datum that arrives, models are picked based on a conditional 
-forwarding filter set on the data density. ML engineers need to worry about only their 
+Numalogic is a collection of ML models and algorithms for operation data analytics and AIOps.
+At Intuit, we use Numalogic at scale for continuous real-time data enrichment including
+anomaly scoring. We assign an anomaly score (ML inference) to any time-series
+datum/event/message we receive on our streaming platform (say, Kafka). 95% of our
+data sets are time-series, and we have a complex flowchart to execute ML inference on
+our high throughput sources. We run multiple models on the same datum, say a model that is
+sensitive towards +ve sentiments, another more tuned towards -ve sentiments, and another
+optimized for neutral sentiments. We also have a couple of ML models trained for the same
+data source to provide more accurate scores based on the data density in our model store.
+An ensemble of models is required because some composite keys in the data tend to be less
+dense than others, e.g., forgot-password interaction is less frequent than a status check
+interaction. At runtime, for each datum that arrives, models are picked based on a conditional
+forwarding filter set on the data density. ML engineers need to worry about only their
 inference container; they do not have to worry about data movement and quality assurance.
 
-## Numalogic realtime training 
-For an always-on ML platform, the key requirement is the ability to train or retrain models 
-automatically based on the incoming messages. The composite key built at per message runtime 
-looks for a matching model, and if the model turns out to be stale or missing, an automatic 
-retriggering is applied. The conditional forwarding feature of the platform improves the 
-development velocity of the ML developer when they have to make a decision whether to forward 
+## Numalogic realtime training
+For an always-on ML platform, the key requirement is the ability to train or retrain models
+automatically based on the incoming messages. The composite key built at per message runtime
+looks for a matching model, and if the model turns out to be stale or missing, an automatic
+retriggering is applied. The conditional forwarding feature of the platform improves the
+development velocity of the ML developer when they have to make a decision whether to forward
 the result further or drop it after a trigger request.
 
 
@@ -59,9 +59,9 @@ For set-up information and running your first pipeline using numalogic, please s
 Numalogic requires Python 3.8 or higher.
 
 ### Prerequisites
-Numalogic needs [PyTorch](https://pytorch.org/) and 
-[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) to work. 
-But since these packages are platform dependendent, 
+Numalogic needs [PyTorch](https://pytorch.org/) and
+[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) to work.
+But since these packages are platform dependendent,
 they are not included in the numalogic package itself. Kindly install them first.
 
 Numalogic supports the following pytorch versions:
@@ -103,7 +103,7 @@ pip install numalogic[mlflow]
  ```
  make test
  ```
-5. To format code style using black:
+5. To format code style using black and ruff:
  ```
  make lint
  ```

@@ -3,7 +3,7 @@
 This section contains some benchmarking results of numalogic's algorithms on real as well
 synthetic data. Datasets here are publicly available from their respective repositories.
 
-Note that efforts have not really been made on hyperparameter tuning. This is just to give users an 
+Note that efforts have not really been made on hyperparameter tuning. This is just to give users an
 idea on how each algorithm is suitable for different kinds of data, and shows how they can do
 their own benchmarking too.
 

@@ -1,11 +1,11 @@
 ## KPI Anomaly dataset
 
-KPI anomaly dataset consists of KPI (key performace index) time series data from 
-many real scenarios of Internet companies with ground truth label. 
+KPI anomaly dataset consists of KPI (key performace index) time series data from
+many real scenarios of Internet companies with ground truth label.
 The dataset can be found (here)[https://github.com/NetManAIOps/KPI-Anomaly-Detection]
 
 The full dataset contains multiple KPI IDs. Different KPI time series have different structures
-and patterns. 
+and patterns.
 For our purpose, we are running anomaly detection for some of these KPI indices.
 
 The performance table is shown below, although note that the hyperparameters have not been tuned.
@@ -26,4 +26,4 @@ Full credit to Zeyan Li et al. for constructing large-scale real world benchmark
 Author = {Zeyan Li and Nengwen Zhao and Shenglin Zhang and Yongqian Sun and Pengfei Chen and Xidao Wen and Minghua Ma and Dan Pei},
 Title = {Constructing Large-Scale Real-World Benchmark Datasets for AIOps},
 Year = {2022},
-Eprint = {arXiv:2208.03938},
+Eprint = {arXiv:2208.03938},
@@ -15,4 +15,4 @@
  "seq_len": 10
  }
  }
-}
+}
@@ -1,9 +1,9 @@
 # Numalogic
 
 
-Numalogic is a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection. 
+Numalogic is a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection.
 
-Numalogic can be installed as a library and used to build end-to-end ML pipelines. For a streaming real time data processing, it can also be paired with our steaming data platform [Numaflow](https://numaflow.numaproj.io/). 
+Numalogic can be installed as a library and used to build end-to-end ML pipelines. For a streaming real time data processing, it can also be paired with our steaming data platform [Numaflow](https://numaflow.numaproj.io/).
 
 ## Key Features
 
@@ -23,4 +23,4 @@ Numalogic can be installed as a library and used to build end-to-end ML pipeline
 
 ## Getting Started
 
-For set-up information and running your first pipeline using numalogic, please see our [getting started guide](./quick-start.md).
+For set-up information and running your first pipeline using numalogic, please see our [getting started guide](./quick-start.md).
@@ -1,6 +1,6 @@
 # Autoencoders
 
-An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data. 
+An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data.
 
 It mainly consists of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.
 
@@ -18,12 +18,12 @@ datamodule = TimeseriesDataModule(12, train_data, batch_size=128)
 
 ## Autoencoder Trainer
 
-Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders. 
+Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders.
 This trainer provides a mechanism to train, validate and infer on data, with all the parameters supported by Lightning Trainer.
 
 Here we are using `VanillaAE`, a Vanilla Autoencoder model.
 
-```python 
+```python
 from numalogic.models.autoencoder.variants import VanillaAE
 from numalogic.models.autoencoder import AutoencoderTrainer
 
@@ -34,7 +34,7 @@ trainer.fit(model, datamodule=datamodule)
 
 ## Autoencoder Variants
 
-Numalogic supports 2 variants of Autoencoders currently. 
+Numalogic supports 2 variants of Autoencoders currently.
 More details can be found [here](https://www.deeplearningbook.org/contents/autoencoders.html).
 
 ### 1. Autoencoders
@@ -46,7 +46,7 @@ architectures are used, i.e. the latent space dimension being less than the inpu
 Examples would be `VanillaAE`, `Conv1dAE`, `LSTMAE` and `TransformerAE`
 
 ### 2. Sparse autoencoders
-A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. 
+A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck.
 Specifically the loss function is constructed so that activations are penalized within a layer.
 So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.
 
@@ -64,16 +64,16 @@ Vanilla Autoencoder model comprising only fully connected layers.
 from numalogic.models.autoencoder.variants import VanillaAE
 
 model = VanillaAE(seq_len=12, n_features=2)
-``` 
+```
 
 #### Convolutional
 
-Conv1dAE is a 1D convolutional autoencoder. 
+Conv1dAE is a 1D convolutional autoencoder.
 
 The encoder network consists of convolutional layers and max pooling layers.
 The decoder network tries to reconstruct the same input shape by corresponding transposed
 convolutional and upsampling layers.
- 
+
 ```python
 from numalogic.models.autoencoder.variants import SparseConv1dAE
 
@@ -88,12 +88,11 @@ An LSTM (Long Short-Term Memory) Autoencoder is an implementation of an autoenco
 from numalogic.models.autoencoder.variants import LSTMAE
 
 model = LSTMAE(seq_len=12, no_features=2, embedding_dim=15)
-
 ```
 
 #### Transformer
 
-The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper. 
+The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper.
 
 It consists of an encoder and a decoder which are both stacks of residual attention blocks, i.e a stack of layers set in such a way that the output of a layer is taken and added to another layer deeper in the block.
 
@@ -103,10 +102,10 @@ These blocks can process an input sequence of variable length n without exhibiti
 from numalogic.models.autoencoder.variants import TransformerAE
 
 model = TransformerAE(
-  num_heads=8,
-  seq_length=12,
-  dim_feedforward=64,
-  num_encoder_layers=3,
-  num_decoder_layers=1,
- )
-```
+ num_heads=8,
+ seq_length=12,
+ dim_feedforward=64,
+ num_encoder_layers=3,
+ num_decoder_layers=1,
+)
+```
@@ -1,6 +1,6 @@
 # Data Generator
 
-Numalogic provides a data generator to create some synthetic time series data, that can be used as train or test data sets. 
+Numalogic provides a data generator to create some synthetic time series data, that can be used as train or test data sets.
 
 Using the synthetic data, we can:
 
@@ -28,7 +28,7 @@ ts_generator = SyntheticTSGenerator(
 )
 
 # shape: (8000, 3) with column names [s1, s2, s3]
-ts_df = ts_generator.gen_tseries() 
+ts_df = ts_generator.gen_tseries()
 
 # Split into test and train
 train_df, test_df = ts_generator.train_test_split(ts_df, test_size=1000)
@@ -37,7 +37,7 @@ train_df, test_df = ts_generator.train_test_split(ts_df, test_size=1000)
 
 ### Inject anomalies
 
-Now, once we generate the synthetic data like above, we can inject anomalies into the test data set using `AnomalyGenerator`. 
+Now, once we generate the synthetic data like above, we can inject anomalies into the test data set using `AnomalyGenerator`.
 
 `AnomalyGenerator` supports the following types of anomalies:
 
@@ -52,7 +52,7 @@ You can also use `anomaly_ratio` to adjust the ratio of anomalous data points w
 from numalogic.synthetic import AnomalyGenerator
 
 # columns to inject anomalies
-injected_cols = ["s1", "s2"] 
+injected_cols = ["s1", "s2"]
 anomaly_generator = AnomalyGenerator(
  train_df, anomaly_type="contextual", anomaly_ratio=0.3
 )
@@ -61,4 +61,4 @@ outlier_test_df = anomaly_generator.inject_anomalies(
 )
 ```
 
-![Outliers](./assets/outliers.png)
+![Outliers](./assets/outliers.png)
@@ -32,4 +32,4 @@ model.fit(train_df)
 pred_df = model.predict(test_df)
 r2_score = model.r2_score(test_df)
 anomaly_score = model.score(test_df)
-```
+```
@@ -16,4 +16,4 @@ test_anomaly_score = model.score(X_test)
 ```
 
 ![Reconstruction](./assets/recon.png)
-![Anomaly Score](./assets/anomaly_score.png)
+![Anomaly Score](./assets/anomaly_score.png)
@@ -37,7 +37,7 @@ registry.save(
  skeys=static_keys,
  dkeys=dynamic_keys,
  primary_artifact=model,
- secondary_artifacts={"preproc": scaler}
+ secondary_artifacts={"preproc": scaler},
 )
 ```
 
@@ -47,11 +47,9 @@ Once, the models are save to MLflow, the `load` function of `MLflowRegistry` can
 
 ```python
 registry = MLflowRegistry(tracking_uri="http:https://0.0.0.0:8080")
-artifact_dict = registry.load(
- skeys=static_keys, dkeys=dynamic_keys
-)
+artifact_dict = registry.load(skeys=static_keys, dkeys=dynamic_keys)
 scaler = artifact_dict["secondary_artifacts"]["preproc"]
 model = artifact_dict["primary_artifact"]
 ```
 
-For more details, please refer to [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html#)
+For more details, please refer to [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html#)
@@ -1,6 +1,6 @@
 # Post Processing
-After the raw scores have been generated, we might need to do some additional postprocessing, 
-for various reasons. 
+After the raw scores have been generated, we might need to do some additional postprocessing,
+for various reasons.
 
 ### Tanh Score Normalization
 Tanh normalization step is an optional step, where we normalize the anomalies between 0-10. This is mostly to make the scores more understandable.
@@ -25,24 +25,22 @@ norm_score = norm.fit_transform(raw_score)
 ```
 
 ### Exponentially Weighted Moving Average
-The Exponentially Weighted Moving Average (EWMA) serves as an effective smoothing function, 
-emphasizing the importance of more recent anomaly scores over those of previous elements within a sliding window. 
+The Exponentially Weighted Moving Average (EWMA) serves as an effective smoothing function,
+emphasizing the importance of more recent anomaly scores over those of previous elements within a sliding window.
 
-This approach proves particularly beneficial in streaming inference scenarios, as it allows for 
-earlier increases in anomaly scores when a new outlier data point is encountered. 
-Consequently, the EMA enables a more responsive and dynamic assessment of streaming data, 
+This approach proves particularly beneficial in streaming inference scenarios, as it allows for
+earlier increases in anomaly scores when a new outlier data point is encountered.
+Consequently, the EMA enables a more responsive and dynamic assessment of streaming data,
 facilitating timely detection and response to potential anomalies.
 
 ```python
 import numpy as np
 from numalogic.postprocess import ExpMovingAverage
 
-raw_score = np.array(
- [1.0, 1.5, 1.2, 3.5, 2.7, 5.6, 7.1, 6.9, 4.2, 1.1]
-).reshape(-1, 1)
+raw_score = np.array([1.0, 1.5, 1.2, 3.5, 2.7, 5.6, 7.1, 6.9, 4.2, 1.1]).reshape(-1, 1)
 
 postproc_clf = ExpMovingAverage(beta=0.5)
 out = postproc_clf.transform(raw_score)
 
 # out: [[1.3], [1.433], [1.333], [2.473], [2.591], [4.119], [5.621], [6.263], [5.229], [3.163]]
-```
+```