Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pipeline): gsub prosessor #4121

Merged
merged 61 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
63491d9
chore: add log http ingester scaffold
paomian May 22, 2024
4d2ec3b
chore: add some example code
paomian May 22, 2024
2e51c16
chore: add log inserter
paomian May 27, 2024
fbc66ec
chore: add log handler file
paomian May 27, 2024
cd4d83d
chore: add pipeline lib
paomian May 27, 2024
2bc1937
chore: import log handler
paomian May 29, 2024
2b16ef9
chore: add pipelime http handler
paomian May 29, 2024
f1350cd
chore: add pipeline private table
paomian May 30, 2024
1d52cad
chore: add pipeline API
paomian May 31, 2024
8c69abb
chore: improve error handling
paomian May 31, 2024
7e0a9ad
Merge branch 'main' into feat/log-handler
shuiyisong Jun 3, 2024
73432dc
chore: merge main
shuiyisong Jun 3, 2024
9d7284c
Merge pull request #6 from shuiyisong/chore/merge_main
paomian Jun 3, 2024
1a03b7e
chore: add multi content type support for log handler
paomian Jun 3, 2024
a2f1230
Merge branch 'main' into feat/log-handler
shuiyisong Jun 4, 2024
6a0998d
refactor: remove servers dep on pipeline
shuiyisong Jun 3, 2024
443eaf9
refactor: move define_into_tonic_status to common-error
shuiyisong Jun 3, 2024
c8ce4ee
refactor: bring in pipeline 3eb890c551b8d7f60c4491fcfec18966e2b210a4
shuiyisong Jun 4, 2024
eb9cd22
chore: fix typo
shuiyisong Jun 4, 2024
8d0595c
refactor: bring in pipeline a95c9767d7056ab01dd8ca5fa1214456c6ffc72c
shuiyisong Jun 4, 2024
061b14e
chore: fix typo and license header
shuiyisong Jun 4, 2024
c152472
refactor: move http event handler to a separate file
shuiyisong Jun 4, 2024
ddea3c1
chore: add test for pipeline
paomian Jun 4, 2024
162e92f
Merge branch 'main' into feat/log-handler
shuiyisong Jun 4, 2024
5a7a5be
chore: update
shuiyisong Jun 4, 2024
423e51e
chore: fmt
shuiyisong Jun 4, 2024
51df233
Merge pull request #7 from shuiyisong/refactor/log_handler
paomian Jun 4, 2024
8066eb3
refactor: bring in pipeline 7d2402701877901871dd1294a65ac937605a6a93
shuiyisong Jun 4, 2024
e2a2e50
refactor: move `pipeline_operator` to `pipeline` crate
shuiyisong Jun 4, 2024
209a1a3
chore: minor update
shuiyisong Jun 4, 2024
c110adb
refactor: bring in pipeline 1711f4d46687bada72426d88cda417899e0ae3a4
shuiyisong Jun 5, 2024
1047dd7
chore: add log
shuiyisong Jun 5, 2024
2ff2fda
chore: add log
shuiyisong Jun 5, 2024
8b6a652
chore: remove open hook
shuiyisong Jun 5, 2024
6ca15ad
Merge pull request #8 from shuiyisong/refactor/log
paomian Jun 5, 2024
1298b0a
chore: minor update
shuiyisong Jun 5, 2024
ea548b0
chore: fix fmt
shuiyisong Jun 5, 2024
fb13278
Merge pull request #9 from shuiyisong/refactor/log
paomian Jun 5, 2024
6c88b89
chore: minor update
shuiyisong Jun 5, 2024
eeed85e
chore: rename desc for pipeline table
shuiyisong Jun 5, 2024
f77d20b
refactor: remove updated_at in pipelines
shuiyisong Jun 5, 2024
38ed6bb
Merge pull request #10 from shuiyisong/chore/polish_code
paomian Jun 5, 2024
5815675
chore: add more content type support for log inserter api
paomian Jun 5, 2024
c84ef0e
Merge pull request #11 from paomian/feat/log-handler-v2
paomian Jun 5, 2024
2e69655
chore: introduce pipeline crate
shuiyisong Jun 5, 2024
ca9525d
Merge branch 'chore/introduce_pipeline' into feat/log-handler
shuiyisong Jun 5, 2024
85a4c32
Merge branch 'main' into feat/log-handler
shuiyisong Jun 6, 2024
77ef015
chore: update upload pipeline api
paomian Jun 6, 2024
43a57a7
chore: fix by pr commit
paomian Jun 6, 2024
3560285
chore: add some doc for pub fn/struct
paomian Jun 6, 2024
4872c8a
chore: some minro fix
paomian Jun 6, 2024
11933b0
chore: add pipeline version support
paomian Jun 6, 2024
92a2bda
chore: impl log pipeline version
paomian Jun 7, 2024
09b5f60
gsub prosessor
yuanbohan Jun 8, 2024
3f8b9ce
chore: merge main
shuiyisong Jun 11, 2024
e764564
chore: merge main
shuiyisong Jun 12, 2024
5c7052d
chore: merge main
shuiyisong Jun 12, 2024
69f5eca
chore: merge log-handler
shuiyisong Jun 12, 2024
e4b2c2a
chore: add test
shuiyisong Jun 12, 2024
470dffa
chore: merge log-handler
shuiyisong Jun 17, 2024
233ece0
chore: update commit
shuiyisong Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 229 additions & 0 deletions src/pipeline/src/etl/processor/gsub.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// https://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

use regex::Regex;

use crate::etl::field::{Field, Fields};
use crate::etl::processor::{
yaml_bool, yaml_field, yaml_fields, yaml_string, Processor, FIELDS_NAME, FIELD_NAME,
IGNORE_MISSING_NAME,
};
use crate::etl::value::{Array, Map, Value};

pub(crate) const PROCESSOR_GSUB: &str = "gsub";

const REPLACEMENT_NAME: &str = "replacement";
const PATTERN_NAME: &str = "pattern";

/// A processor to replace all matches of a pattern in string by a replacement, only support string value, and array string value
#[derive(Debug, Default)]
pub struct GsubProcessor {
fields: Fields,
pattern: Option<Regex>,
replacement: Option<String>,
ignore_missing: bool,
}

impl GsubProcessor {
fn with_fields(&mut self, fields: Fields) {
self.fields = fields;
}

fn with_ignore_missing(&mut self, ignore_missing: bool) {
self.ignore_missing = ignore_missing;
}

fn try_pattern(&mut self, pattern: &str) -> Result<(), String> {
self.pattern = Some(Regex::new(pattern).map_err(|e| e.to_string())?);
Ok(())
}

fn with_replacement(&mut self, replacement: impl Into<String>) {
self.replacement = Some(replacement.into());
}

fn check(self) -> Result<Self, String> {
if self.pattern.is_none() {
return Err("pattern is required".to_string());
shuiyisong marked this conversation as resolved.
Show resolved Hide resolved
}

if self.replacement.is_none() {
return Err("replacement is required".to_string());
}

Ok(self)
}

fn process_string_field(&self, val: &str, field: &Field) -> Result<Map, String> {
let replacement = self.replacement.as_ref().unwrap();
shuiyisong marked this conversation as resolved.
Show resolved Hide resolved
let new_val = self
.pattern
.as_ref()
.unwrap()
.replace_all(val, replacement)
.to_string();
let val = Value::String(new_val);

let key = match field.target_field {
Some(ref target_field) => target_field,
None => field.get_field(),
};

Ok(Map::one(key, val))
}

fn process_array_field(&self, arr: &Array, field: &Field) -> Result<Map, String> {
let key = match field.target_field {
Some(ref target_field) => target_field,
None => field.get_field(),
};

let re = self.pattern.as_ref().unwrap();
let replacement = self.replacement.as_ref().unwrap();

let mut result = Array::default();
shuiyisong marked this conversation as resolved.
Show resolved Hide resolved
for val in arr.iter() {
match val {
Value::String(haystack) => {
let new_val = re.replace_all(haystack, replacement).to_string();
result.push(Value::String(new_val));
}
_ => {
return Err(format!(
"{} processor: expect string or array string, but got {val:?}",
self.kind()
))
}
}
}

Ok(Map::one(key, Value::Array(result)))
}
}

impl TryFrom<&yaml_rust::yaml::Hash> for GsubProcessor {
type Error = String;

fn try_from(value: &yaml_rust::yaml::Hash) -> Result<Self, Self::Error> {
let mut processor = GsubProcessor::default();

for (k, v) in value.iter() {
let key = k
.as_str()
.ok_or(format!("key must be a string, but got {k:?}"))?;
match key {
FIELD_NAME => {
processor.with_fields(Fields::one(yaml_field(v, FIELD_NAME)?));
}
FIELDS_NAME => {
processor.with_fields(yaml_fields(v, FIELDS_NAME)?);
}
PATTERN_NAME => {
processor.try_pattern(&yaml_string(v, PATTERN_NAME)?)?;
}
REPLACEMENT_NAME => {
processor.with_replacement(yaml_string(v, REPLACEMENT_NAME)?);
}

IGNORE_MISSING_NAME => {
processor.with_ignore_missing(yaml_bool(v, IGNORE_MISSING_NAME)?);
}

_ => {}
}
}

processor.check()
}
}

impl crate::etl::processor::Processor for GsubProcessor {
fn kind(&self) -> &str {
PROCESSOR_GSUB
}

fn ignore_missing(&self) -> bool {
self.ignore_missing
}

fn fields(&self) -> &Fields {
&self.fields
}

fn exec_field(&self, val: &Value, field: &Field) -> Result<Map, String> {
match val {
Value::String(val) => self.process_string_field(val, field),
Value::Array(arr) => self.process_array_field(arr, field),
_ => Err(format!(
"{} processor: expect string or array string, but got {val:?}",
self.kind()
)),
}
}
}

#[cfg(test)]
mod tests {
use crate::etl::field::Field;
use crate::etl::processor::gsub::GsubProcessor;
use crate::etl::processor::Processor;
use crate::etl::value::{Map, Value};

#[test]
fn test_string_value() {
let mut processor = GsubProcessor::default();
processor.try_pattern(r"\d+").unwrap();
processor.with_replacement("xxx");

let field = Field::new("message");
let val = Value::String("123".to_string());
let result = processor.exec_field(&val, &field).unwrap();

assert_eq!(
result,
Map::one("message", Value::String("xxx".to_string()))
);
}

#[test]
fn test_array_string_value() {
let mut processor = GsubProcessor::default();
processor.try_pattern(r"\d+").unwrap();
processor.with_replacement("xxx");

let field = Field::new("message");
let val = Value::Array(
vec![
Value::String("123".to_string()),
Value::String("456".to_string()),
]
.into(),
);
let result = processor.exec_field(&val, &field).unwrap();

assert_eq!(
result,
Map::one(
"message",
Value::Array(
vec![
Value::String("xxx".to_string()),
Value::String("xxx".to_string())
]
.into()
)
)
);
}
}
3 changes: 3 additions & 0 deletions src/pipeline/src/etl/processor/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ pub mod csv;
pub mod date;
pub mod dissect;
pub mod epoch;
pub mod gsub;
pub mod letter;
pub mod regex;
pub mod urlencoding;
Expand All @@ -29,6 +30,7 @@ use csv::CsvProcessor;
use date::DateProcessor;
use dissect::DissectProcessor;
use epoch::EpochProcessor;
use gsub::GsubProcessor;
use letter::LetterProcessor;
use regex::RegexProcessor;
use urlencoding::UrlEncodingProcessor;
Expand Down Expand Up @@ -163,6 +165,7 @@ fn parse_processor(doc: &yaml_rust::Yaml) -> Result<Arc<dyn Processor>, String>
date::PROCESSOR_DATE => Arc::new(DateProcessor::try_from(value)?),
dissect::PROCESSOR_DISSECT => Arc::new(DissectProcessor::try_from(value)?),
epoch::PROCESSOR_EPOCH => Arc::new(EpochProcessor::try_from(value)?),
gsub::PROCESSOR_GSUB => Arc::new(GsubProcessor::try_from(value)?),
letter::PROCESSOR_LETTER => Arc::new(LetterProcessor::try_from(value)?),
regex::PROCESSOR_REGEX => Arc::new(RegexProcessor::try_from(value)?),
urlencoding::PROCESSOR_URL_ENCODING => Arc::new(UrlEncodingProcessor::try_from(value)?),
Expand Down
12 changes: 12 additions & 0 deletions src/pipeline/src/etl/value/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ impl std::ops::Deref for Array {
}
}

impl std::ops::DerefMut for Array {
fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.values
}
}

impl IntoIterator for Array {
type Item = Value;

Expand All @@ -54,3 +60,9 @@ impl IntoIterator for Array {
self.values.into_iter()
}
}

impl From<Vec<Value>> for Array {
fn from(values: Vec<Value>) -> Self {
Array { values }
}
}
70 changes: 70 additions & 0 deletions src/pipeline/tests/gsub.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// https://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

use greptime_proto::v1::value::ValueData::TimestampMillisecondValue;
use greptime_proto::v1::{ColumnDataType, ColumnSchema, SemanticType};
use pipeline::{parse, Content, GreptimeTransformer, Pipeline, Value};

#[test]
fn test_gsub() {
let input_value_str = r#"
[
{
"reqTimeSec": "1573840000.000"
}
]
"#;
let input_value: Value = serde_json::from_str::<serde_json::Value>(input_value_str)
.expect("failed to parse input value")
.try_into()
.expect("failed to convert input value");

let pipeline_yaml = r#"
---
description: Pipeline for Akamai DataStream2 Log

processors:
- gsub:
field: reqTimeSec
pattern: "\\."
replacement: ""
- epoch:
field: reqTimeSec
resolution: millisecond
ignore_missing: true

transform:
- field: reqTimeSec
type: epoch, millisecond
index: timestamp
"#;

let yaml_content = Content::Yaml(pipeline_yaml.into());
let pipeline: Pipeline<GreptimeTransformer> =
parse(&yaml_content).expect("failed to parse pipeline");
let output = pipeline.exec(input_value).expect("failed to exec pipeline");

let expected_schema = vec![ColumnSchema {
column_name: "reqTimeSec".to_string(),
datatype: ColumnDataType::TimestampMillisecond.into(),
semantic_type: SemanticType::Timestamp.into(),
datatype_extension: None,
}];

assert_eq!(output.schema, expected_schema);
assert_eq!(
output.rows[0].values[0].value_data,
Some(TimestampMillisecondValue(1573840000000))
);
}
4 changes: 0 additions & 4 deletions src/pipeline/tests/pipeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,6 @@ use greptime_proto::v1::value::ValueData::{
use greptime_proto::v1::Value as GreptimeValue;
use pipeline::{parse, Content, GreptimeTransformer, Pipeline, Value};

// use pipeline::transform::GreptimeTransformer;
// use pipeline::value::Value;
// use pipeline::{parse, Content, Pipeline};

#[test]
fn main() {
let input_value_str = r#"
Expand Down