Our code works with data presented in the following format:
- Reports directory, where all reports stores in separate JSON files and have names like id.json, where
id
is a report id. - CSV file with events of addition some report to a certain group.
Every report in the reports directory is stored in a separate file in the following format:
{
"id": 566,
"timestamp": 1234567891234,
"class": "java.lang.Exception",
"frames": [
[
"java.util.ArrayList.get",
"com.company.Class1.method1",
"com.company.Class2.method2",
"com.company.Class1.method2"
]
]
}
Every event in events file means that some report with id rid
was attached to group with id iid
at the time moment ts
.
Additional column label
shows the fact, that the event was done automatically or manually.
label=True
means manual labeling and label=False
automated attach.
Example of events file:
ts,rid,iid,label
906750420,755,1,True
917921167,1106,23,True
922132797,1329,45,False
922133018,1331,31,True
NetBeans dataset has been introduced in "S3M: Siamese Stack (Trace) Similarity Measure" paper and stored on Figshare in JSON format.
To convert this JSON file to our format, please use the following converter:
python src/scripts/state_to_events_converter.py
--state_path path_to_netbeans.json
--reports_path dir_path_for_saving_reports
--events_path path_to_events.csv
This script produces the directory with reports and csv file with events.
Example of running baseline similarity methods on NetBeans dataset:
python src/similarity/scripts/similarity.py
--method lerch
--data_name netbeans
--actions_file path_to_events.csv
--reports_path reports_dir
--train_start 350
--train_longitude 3850
--val_start 4200
--val_longitude 140
--test_start 4340
--test_longitude 700
--model_save_path path_to_model.pkl
# --forget_days 62 # only for our dataset
# --hyp_top_issues 100 # only for our dataset
# --hyp_top_reports 100 # only for our dataset
The script will produce some state files in event_state
directory in one level with src
.
These precomputed states will speed up data reading in the next runs.
To collect data for aggregation training and testing, we need to run this script
python src/similarity/scripts/similarity.py
--data_name netbeans
--actions_file path_to_events.csv
--reports_path reports_dir
--data_start 4340
--data_longitude 700
--model_path path_to_model.pkl
--dump_save_path test_dump_path.json
# --forget_days 62 # only for our dataset
The example above show how to collect test data for aggregation model evaluation.
When train and test data for aggregation model were collected, we can train it by running the following script:
python src/aggregation/scripts/aggregation.py
--train_path train_dump.json
--test_path test_dump.json
--model_save_path path_to_aggregation_model.pkl