Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: introduce ruff as a linter #167

Merged
merged 4 commits into from
May 3, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add precommit
Signed-off-by: Avik Basu <[email protected]>
  • Loading branch information
ab93 committed May 3, 2023
commit 977aa6043d111df04dcbf0c3272ba3a20058f107
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,4 +166,4 @@ cython_debug/
# Mac related
*.DS_Store

.python-version
.python-version
30 changes: 30 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
default_language_version:
python: python3.9
repos:
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
args: [ --check ]
- repo: https://github.com/charliermarsh/ruff-pre-commit
# Ruff version.
rev: 'v0.0.264'
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
- repo: https://github.com/adamchainz/blacken-docs
rev: "1.13.0"
hooks:
- id: blacken-docs
additional_dependencies:
- black==22.12.0
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-toml
- id: check-added-large-files
- id: check-ast
- id: check-case-conflict
- id: check-docstring-first
1 change: 0 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,4 +212,3 @@
### Contributors

* Kushal Batra

2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Contributor Covenant Code of Conduct

Please refer to [Code of Conduct](https://github.com/numaproj/numaproj/blob/main/CODE_OF_CONDUCT.md)
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,27 @@


## Background
Numalogic is a collection of ML models and algorithms for operation data analytics and AIOps.
At Intuit, we use Numalogic at scale for continuous real-time data enrichment including
anomaly scoring. We assign an anomaly score (ML inference) to any time-series
datum/event/message we receive on our streaming platform (say, Kafka). 95% of our
data sets are time-series, and we have a complex flowchart to execute ML inference on
our high throughput sources. We run multiple models on the same datum, say a model that is
sensitive towards +ve sentiments, another more tuned towards -ve sentiments, and another
optimized for neutral sentiments. We also have a couple of ML models trained for the same
data source to provide more accurate scores based on the data density in our model store.
An ensemble of models is required because some composite keys in the data tend to be less
dense than others, e.g., forgot-password interaction is less frequent than a status check
interaction. At runtime, for each datum that arrives, models are picked based on a conditional
forwarding filter set on the data density. ML engineers need to worry about only their
Numalogic is a collection of ML models and algorithms for operation data analytics and AIOps.
At Intuit, we use Numalogic at scale for continuous real-time data enrichment including
anomaly scoring. We assign an anomaly score (ML inference) to any time-series
datum/event/message we receive on our streaming platform (say, Kafka). 95% of our
data sets are time-series, and we have a complex flowchart to execute ML inference on
our high throughput sources. We run multiple models on the same datum, say a model that is
sensitive towards +ve sentiments, another more tuned towards -ve sentiments, and another
optimized for neutral sentiments. We also have a couple of ML models trained for the same
data source to provide more accurate scores based on the data density in our model store.
An ensemble of models is required because some composite keys in the data tend to be less
dense than others, e.g., forgot-password interaction is less frequent than a status check
interaction. At runtime, for each datum that arrives, models are picked based on a conditional
forwarding filter set on the data density. ML engineers need to worry about only their
inference container; they do not have to worry about data movement and quality assurance.

## Numalogic realtime training
For an always-on ML platform, the key requirement is the ability to train or retrain models
automatically based on the incoming messages. The composite key built at per message runtime
looks for a matching model, and if the model turns out to be stale or missing, an automatic
retriggering is applied. The conditional forwarding feature of the platform improves the
development velocity of the ML developer when they have to make a decision whether to forward
## Numalogic realtime training
For an always-on ML platform, the key requirement is the ability to train or retrain models
automatically based on the incoming messages. The composite key built at per message runtime
looks for a matching model, and if the model turns out to be stale or missing, an automatic
retriggering is applied. The conditional forwarding feature of the platform improves the
development velocity of the ML developer when they have to make a decision whether to forward
the result further or drop it after a trigger request.


Expand Down Expand Up @@ -59,9 +59,9 @@ For set-up information and running your first pipeline using numalogic, please s
Numalogic requires Python 3.8 or higher.

### Prerequisites
Numalogic needs [PyTorch](https://pytorch.org/) and
[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) to work.
But since these packages are platform dependendent,
Numalogic needs [PyTorch](https://pytorch.org/) and
[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) to work.
But since these packages are platform dependendent,
they are not included in the numalogic package itself. Kindly install them first.

Numalogic supports the following pytorch versions:
Expand Down Expand Up @@ -103,7 +103,7 @@ pip install numalogic[mlflow]
```
make test
```
5. To format code style using black:
5. To format code style using black and ruff:
```
make lint
```
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This section contains some benchmarking results of numalogic's algorithms on real as well
synthetic data. Datasets here are publicly available from their respective repositories.

Note that efforts have not really been made on hyperparameter tuning. This is just to give users an
Note that efforts have not really been made on hyperparameter tuning. This is just to give users an
idea on how each algorithm is suitable for different kinds of data, and shows how they can do
their own benchmarking too.

Expand Down
8 changes: 4 additions & 4 deletions benchmarks/kpi/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
## KPI Anomaly dataset

KPI anomaly dataset consists of KPI (key performace index) time series data from
many real scenarios of Internet companies with ground truth label.
KPI anomaly dataset consists of KPI (key performace index) time series data from
many real scenarios of Internet companies with ground truth label.
The dataset can be found (here)[https://github.com/NetManAIOps/KPI-Anomaly-Detection]

The full dataset contains multiple KPI IDs. Different KPI time series have different structures
and patterns.
and patterns.
For our purpose, we are running anomaly detection for some of these KPI indices.

The performance table is shown below, although note that the hyperparameters have not been tuned.
Expand All @@ -26,4 +26,4 @@ Full credit to Zeyan Li et al. for constructing large-scale real world benchmark
Author = {Zeyan Li and Nengwen Zhao and Shenglin Zhang and Yongqian Sun and Pengfei Chen and Xidao Wen and Minghua Ma and Dan Pei},
Title = {Constructing Large-Scale Real-World Benchmark Datasets for AIOps},
Year = {2022},
Eprint = {arXiv:2208.03938},
Eprint = {arXiv:2208.03938},
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@
"seq_len": 10
}
}
}
}
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Numalogic


Numalogic is a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection.
Numalogic is a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection.

Numalogic can be installed as a library and used to build end-to-end ML pipelines. For a streaming real time data processing, it can also be paired with our steaming data platform [Numaflow](https://numaflow.numaproj.io/).
Numalogic can be installed as a library and used to build end-to-end ML pipelines. For a streaming real time data processing, it can also be paired with our steaming data platform [Numaflow](https://numaflow.numaproj.io/).

## Key Features

Expand All @@ -23,4 +23,4 @@ Numalogic can be installed as a library and used to build end-to-end ML pipeline

## Getting Started

For set-up information and running your first pipeline using numalogic, please see our [getting started guide](./quick-start.md).
For set-up information and running your first pipeline using numalogic, please see our [getting started guide](./quick-start.md).
33 changes: 16 additions & 17 deletions docs/autoencoders.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Autoencoders

An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data.
An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data.

It mainly consists of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.

Expand All @@ -18,12 +18,12 @@ datamodule = TimeseriesDataModule(12, train_data, batch_size=128)

## Autoencoder Trainer

Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders.
Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders.
This trainer provides a mechanism to train, validate and infer on data, with all the parameters supported by Lightning Trainer.

Here we are using `VanillaAE`, a Vanilla Autoencoder model.

```python
```python
from numalogic.models.autoencoder.variants import VanillaAE
from numalogic.models.autoencoder import AutoencoderTrainer

Expand All @@ -34,7 +34,7 @@ trainer.fit(model, datamodule=datamodule)

## Autoencoder Variants

Numalogic supports 2 variants of Autoencoders currently.
Numalogic supports 2 variants of Autoencoders currently.
More details can be found [here](https://www.deeplearningbook.org/contents/autoencoders.html).

### 1. Autoencoders
Expand All @@ -46,7 +46,7 @@ architectures are used, i.e. the latent space dimension being less than the inpu
Examples would be `VanillaAE`, `Conv1dAE`, `LSTMAE` and `TransformerAE`

### 2. Sparse autoencoders
A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck.
A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck.
Specifically the loss function is constructed so that activations are penalized within a layer.
So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.

Expand All @@ -64,16 +64,16 @@ Vanilla Autoencoder model comprising only fully connected layers.
from numalogic.models.autoencoder.variants import VanillaAE

model = VanillaAE(seq_len=12, n_features=2)
```
```

#### Convolutional

Conv1dAE is a 1D convolutional autoencoder.
Conv1dAE is a 1D convolutional autoencoder.

The encoder network consists of convolutional layers and max pooling layers.
The decoder network tries to reconstruct the same input shape by corresponding transposed
convolutional and upsampling layers.

```python
from numalogic.models.autoencoder.variants import SparseConv1dAE

Expand All @@ -88,12 +88,11 @@ An LSTM (Long Short-Term Memory) Autoencoder is an implementation of an autoenco
from numalogic.models.autoencoder.variants import LSTMAE

model = LSTMAE(seq_len=12, no_features=2, embedding_dim=15)

```

#### Transformer

The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper.
The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper.

It consists of an encoder and a decoder which are both stacks of residual attention blocks, i.e a stack of layers set in such a way that the output of a layer is taken and added to another layer deeper in the block.

Expand All @@ -103,10 +102,10 @@ These blocks can process an input sequence of variable length n without exhibiti
from numalogic.models.autoencoder.variants import TransformerAE

model = TransformerAE(
num_heads=8,
seq_length=12,
dim_feedforward=64,
num_encoder_layers=3,
num_decoder_layers=1,
)
```
num_heads=8,
seq_length=12,
dim_feedforward=64,
num_encoder_layers=3,
num_decoder_layers=1,
)
```
10 changes: 5 additions & 5 deletions docs/data-generator.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data Generator

Numalogic provides a data generator to create some synthetic time series data, that can be used as train or test data sets.
Numalogic provides a data generator to create some synthetic time series data, that can be used as train or test data sets.

Using the synthetic data, we can:

Expand Down Expand Up @@ -28,7 +28,7 @@ ts_generator = SyntheticTSGenerator(
)

# shape: (8000, 3) with column names [s1, s2, s3]
ts_df = ts_generator.gen_tseries()
ts_df = ts_generator.gen_tseries()

# Split into test and train
train_df, test_df = ts_generator.train_test_split(ts_df, test_size=1000)
Expand All @@ -37,7 +37,7 @@ train_df, test_df = ts_generator.train_test_split(ts_df, test_size=1000)

### Inject anomalies

Now, once we generate the synthetic data like above, we can inject anomalies into the test data set using `AnomalyGenerator`.
Now, once we generate the synthetic data like above, we can inject anomalies into the test data set using `AnomalyGenerator`.

`AnomalyGenerator` supports the following types of anomalies:

Expand All @@ -52,7 +52,7 @@ You can also use `anomaly_ratio` to adjust the ratio of anomalous data points w
from numalogic.synthetic import AnomalyGenerator

# columns to inject anomalies
injected_cols = ["s1", "s2"]
injected_cols = ["s1", "s2"]
anomaly_generator = AnomalyGenerator(
train_df, anomaly_type="contextual", anomaly_ratio=0.3
)
Expand All @@ -61,4 +61,4 @@ outlier_test_df = anomaly_generator.inject_anomalies(
)
```

![Outliers](./assets/outliers.png)
![Outliers](./assets/outliers.png)
2 changes: 1 addition & 1 deletion docs/forecasting.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ model.fit(train_df)
pred_df = model.predict(test_df)
r2_score = model.r2_score(test_df)
anomaly_score = model.score(test_df)
```
```
2 changes: 1 addition & 1 deletion docs/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ test_anomaly_score = model.score(X_test)
```

![Reconstruction](./assets/recon.png)
![Anomaly Score](./assets/anomaly_score.png)
![Anomaly Score](./assets/anomaly_score.png)
8 changes: 3 additions & 5 deletions docs/ml-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ registry.save(
skeys=static_keys,
dkeys=dynamic_keys,
primary_artifact=model,
secondary_artifacts={"preproc": scaler}
secondary_artifacts={"preproc": scaler},
)
```

Expand All @@ -47,11 +47,9 @@ Once, the models are save to MLflow, the `load` function of `MLflowRegistry` can

```python
registry = MLflowRegistry(tracking_uri="http:https://0.0.0.0:8080")
artifact_dict = registry.load(
skeys=static_keys, dkeys=dynamic_keys
)
artifact_dict = registry.load(skeys=static_keys, dkeys=dynamic_keys)
scaler = artifact_dict["secondary_artifacts"]["preproc"]
model = artifact_dict["primary_artifact"]
```

For more details, please refer to [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html#)
For more details, please refer to [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html#)
20 changes: 9 additions & 11 deletions docs/post-processing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Post Processing
After the raw scores have been generated, we might need to do some additional postprocessing,
for various reasons.
After the raw scores have been generated, we might need to do some additional postprocessing,
for various reasons.

### Tanh Score Normalization
Tanh normalization step is an optional step, where we normalize the anomalies between 0-10. This is mostly to make the scores more understandable.
Expand All @@ -25,24 +25,22 @@ norm_score = norm.fit_transform(raw_score)
```

### Exponentially Weighted Moving Average
The Exponentially Weighted Moving Average (EWMA) serves as an effective smoothing function,
emphasizing the importance of more recent anomaly scores over those of previous elements within a sliding window.
The Exponentially Weighted Moving Average (EWMA) serves as an effective smoothing function,
emphasizing the importance of more recent anomaly scores over those of previous elements within a sliding window.

This approach proves particularly beneficial in streaming inference scenarios, as it allows for
earlier increases in anomaly scores when a new outlier data point is encountered.
Consequently, the EMA enables a more responsive and dynamic assessment of streaming data,
This approach proves particularly beneficial in streaming inference scenarios, as it allows for
earlier increases in anomaly scores when a new outlier data point is encountered.
Consequently, the EMA enables a more responsive and dynamic assessment of streaming data,
facilitating timely detection and response to potential anomalies.

```python
import numpy as np
from numalogic.postprocess import ExpMovingAverage

raw_score = np.array(
[1.0, 1.5, 1.2, 3.5, 2.7, 5.6, 7.1, 6.9, 4.2, 1.1]
).reshape(-1, 1)
raw_score = np.array([1.0, 1.5, 1.2, 3.5, 2.7, 5.6, 7.1, 6.9, 4.2, 1.1]).reshape(-1, 1)

postproc_clf = ExpMovingAverage(beta=0.5)
out = postproc_clf.transform(raw_score)

# out: [[1.3], [1.433], [1.333], [2.473], [2.591], [4.119], [5.621], [6.263], [5.229], [3.163]]
```
```
Loading