Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing the model with new data #8

Open
magicalvoice opened this issue Apr 12, 2024 · 29 comments
Open

Testing the model with new data #8

magicalvoice opened this issue Apr 12, 2024 · 29 comments

Comments

@magicalvoice
Copy link

Hi,

@SaoYear Thank you for the great work. I am new to the problem of SED, I have fine-tuned iwth own data, Now, I just want to test the final fine-tuned model with test audio files, not having ground truth for the same. Is there any script to do that, without the need of preparing the .tsv files with onset offset event label etc. in the format of DESED data.

Basically, how will I use model for completely unknown input audio.
if you could tell me in steps, It would be really helpful.

Thank you so much in advance!

@SaoYear
Copy link
Member

SaoYear commented Apr 12, 2024

Hi,

Thanks for your interests!

If I understand you correctly, what you want is to use the model to inference on some unlabeled audio clips.

This is the same process as we submit the evaluation results for the DCASE challenge. And this function is integrated within the DCASE baseline codes as well as in the codes of ATST-SED, supporting by the pytorch-lightning.

To do this:

  1. You might notice that, in config.yaml file (e.g., \train\configs\stage1.yaml file in this repo), there are only eval_folder and eval_folder_44k and no eval_tsv.
    So first you could modify these two terms to your own paths. If you have audios in 16kHz, you can just change eval_folder to your own path. Otherwise, you might need to change eval_folder_44k and the script would automatically resample the audios to 16kHz

  2. Run the evaluation, e.g., if you want to use fine-tuned model, run:

train_stage2.py --gpus 0, --eval_from_checkpoint YOUR_PRETRAINED_CKPT_PATH

The system would inference the data automatically and the predicted results would be stored in your exp folder.

Hope these help : )

@magicalvoice
Copy link
Author

Thank you so much @SaoYear. This really helped, Much appreciate!!

@magicalvoice magicalvoice changed the title Testing the model with new data, having no groundtruth Testing the model with new data Apr 15, 2024
@magicalvoice magicalvoice reopened this Apr 29, 2024
@magicalvoice
Copy link
Author

Hi @SaoYear,

I might ask silly doubts, but I am getting this error, when tried testing in the same way you described. Although I have ensured all the input file to be 10s each in duration and file sizes also same, I don't know why I am getting this error, and how to fix it.

Please help me @SaoYear

(dcase2023) empuser@server:~/ATST-SED-Scripts/ATST-SED/train$ python train_stage2.py --gpus 1 --eval_from_checkpoint exp/stage2/version_0/epoch=209-step=23100.ckpt
/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train

/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/train_stage2.py(505)()
-> configs, args, test_model_state_dict, evaluation = prepare_run()
(Pdb) c
loaded model: exp/stage2/version_0/epoch=209-step=23100.ckpt
at epoch: 209
Global seed set to 42
32
Loading ATST from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/atst_as2M.ckpt
Loading student from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/exp/stage1/version_0/epoch=39-step=4400.ckpt
Model loaded
Loading ATST from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/atst_as2M.ckpt
Loading teacher from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/exp/stage1/version_0/epoch=39-step=4400.ckpt
Model loaded
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..
Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used..
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=gloo
All distributed processes registered. Starting with 1 processes

Testing DataLoader 0: 0%| | 0/32 [00:00<?, ?it/s]torch.Size([1, 1, 2505, 128])
torch.Size([1, 128, 626, 1])
shape of x original: torch.Size([1, 1001, 768])
shape of pos original torch.Size([1, 250, 768])
Traceback (most recent call last):
File "train_stage2.py", line 505, in
configs, args, test_model_state_dict, evaluation = prepare_run()
File "train_stage2.py", line 374, in single_run
trainer.test(desed_training)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in test
return call._call_and_handle_interrupt(
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1188, in _run_stage
return self._run_evaluate()
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1228, in _run_evaluate
eval_loop_results = self._evaluation_loop.run()
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 368, in test_step
return self.model.test_step(*args, **kwargs)
File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/local/ultra_sed_trainer.py", line 529, in test_step
strong_preds_student, weak_preds_student = self.detect(sed_feats, atst_feats, self.sed_student)
File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/local/ultra_sed_trainer.py", line 241, in detect
return model(self.scaler(self.take_log(mel_feats)), pretrained_feats)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in call_impl
return forward_call(*args, **kwargs)
File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/CRNN_e2e.py", line 94, in forward
embeddings = self.atst_frame(pretrain_x)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in call_impl
return forward_call(*args, **kwargs)
File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/atst/atst_model.py", line 18, in forward
atst_x = self.atst.get_intermediate_layers(
File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/atst/audio_transformer.py", line 212, in get_intermediate_layers
x,
,
,
,
,patch_length = self.prepare_tokens(x,mask_index=None,length=length,mask=False)
File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/atst/audio_transformer.py", line 146, in prepare_tokens
x = x + pos
RuntimeError: The size of tensor a (1001) must match the size of tensor b (250) at non-singleton dimension 1
Testing DataLoader 0: 0%| | 0/32 [00:00<?, ?it/s]

@SaoYear
Copy link
Member

SaoYear commented May 5, 2024

Hi, it seems like you made some modifications on the audio_transformer.py (Since your line 146 in audio_transformer.py is different from the line in this repo). And now the length of your positional embeddings and patch embeddings are not aligned (one is 1001 and the other is 250).

I will attempt to explain what's happening in the function prepare_tokens, which might help you to debug:

  1. We apply a linear patching to transform each continuous 4 frames to a patch embedding, this would make the temporal resolution from 1001 to 1001 // 4 = 250;
  2. We use cut mode for positional embedding, get 250-length trainable positional embeddings and add them with the patch embeddings (line 143-144 here)

I would recommend you to print the shape of x and pos before the addition, so that you can know whose length is wrong, both should be 250.

@Angelalilyer
Copy link

Angelalilyer commented Jun 16, 2024

Hi, it seems like you made some modifications on the audio_transformer.py (Since your line 146 in audio_transformer.py is different from the line in this repo). And now the length of your positional embeddings and patch embeddings are not aligned (one is 1001 and the other is 250).

I will attempt to explain what's happening in the function prepare_tokens, which might help you to debug:

  1. We apply a linear patching to transform each continuous 4 frames to a patch embedding, this would make the temporal resolution from 1001 to 1001 // 4 = 250;
  2. We use cut mode for positional embedding, get 250-length trainable positional embeddings and add them with the patch embeddings (line 143-144 here)

I would recommend you to print the shape of x and pos before the addition, so that you can know whose length is wrong, both should be 250.

  1. eval_folder

I have also encountered this problem. May I ask how many seconds of audio should be in "eval_folder"? My test set is 10s, it seems that the shape cannot match


audio, atst_feats, labels, padded_indxs, filenames = batch
print(audio.shape) #[1, 441882]
sed_feats = self.mel_spec(audio) #should be [1, 128, 624]
atst_feats = self.atst_norm(atst_feats) #should be [1, 64, 500]
print(sed_feats.shape) # torch.Size([1, 128, 626])
print(atst_feats .shape) # torch.Size([1, 64, 1001])

@SaoYear
Copy link
Member

SaoYear commented Jun 17, 2024

The shape of your waveforms is incorrect. You should resample them to 16kHz.

To do so, you could refer to resample_data_generate_durations function (actually the resample_folder func in local.resample_folder) in the DESED baseline code. This function will automatically create a folder for you. And then you could change the eval_folder to the path of the created folder. The evaluation should work.

@Angelalilyer
Copy link

The shape of your waveforms is incorrect. You should resample them to 16kHz.

To do so, you could refer to resample_data_generate_durations function (actually the resample_folder func in local.resample_folder) in the DESED baseline code. This function will automatically create a folder for you. And then you could change the eval_folder to the path of the created folder. The evaluation should work.

Hello! Thank you very much for your help! But I still have some questions:

  1. I checked my log output and found that the predicted categories only include the following types,
    "Alarm bell ring Blender Cat Dishes Dog Electricshaver_tootbrush Frying Running_water Speed Vacuum_cleaner"
    The model I used is "stage2_wo_external. ckpt". Is this a normal result?
  2. When I use "stage2_w_external. ckpt", I get an error message: "RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory". It is likely that the model is damaged and I am unable to load it.

@SaoYear
Copy link
Member

SaoYear commented Jun 18, 2024

  1. Yes, this ATST-SED model is designed for DESED dataset. These 10 classes are exactly the classes defined by the DESED dataset (DCASE challenge task 4). If you want to recognize more classes such as AudioSet Strong, you should refer to ATST repo here, instead of ATST-SED.
  2. The _external.ckpt is broken. But the _wo_external.ckpt performs very similarly. You could use the _wo_external.ckpt one. Actually, we did not pay much attention on finetuning the _external.ckpt , this checkpoint is just to use for fair comparisons with other methods in our paper.

@Angelalilyer
Copy link

  1. Yes, this ATST-SED model is designed for DESED dataset. These 10 classes are exactly the classes defined by the DESED dataset (DCASE challenge task 4). If you want to recognize more classes such as AudioSet Strong, you should refer to ATST repo here, instead of ATST-SED.
  2. The _external.ckpt is broken. But the _wo_external.ckpt performs very similarly. You could use the _wo_external.ckpt one. Actually, we did not pay much attention on finetuning the _external.ckpt , this checkpoint is just to use for fair comparisons with other methods in our paper.

Hello! Your reply was very helpful to me. I carefully reviewed the code for "ATST Frame". If I want to obtain frame level sound event detection for my test dataset, should I change and run this part of the code?

########################
Frame-level downstream tasks

DESED
please see sehll/downstream/finetune_dcase
Strongly labelled AudioSet
please see shell/downstream/finetune_as_strong
#########################

The inference code is "audiossl/audiossl/methods/atsframe/downstream/train_as_strong. py"
The inference model is "atstframe_base.ckpt"
May I ask if my guess is correct? Thanks~~!!

@martineghiazaryan
Copy link

martineghiazaryan commented Jul 3, 2024

is there a way to try the model inference on different durations of audio? maybe cutting the audios into frames before using the model or changing something within the model to support longer audio?

@SaoYear
Copy link
Member

SaoYear commented Jul 3, 2024

is there a way to try the model inference on different durations of audio? maybe cutting the audios into frames before using the model or changing something within the model to support longer audio?

Yeah, you could refer to what we've done in the ATST-RCT system, last paragraph of section 3.

Quick summary:

  1. use a fix-length window (say, W seconds) to shift through the long-duration audio (L seconds);
  2. keep a hop length (K seconds);
  3. you will get (L - W) / K + 1 (represented as N) W-second audio clips by this process
  4. inference these N audio clips
  5. aggregate the result (you might average or use logical OR for the overlapped frames) and pass to the median filter

@martineghiazaryan
Copy link

Okay thank you! Is there an implemetation of this already ?

@SaoYear
Copy link
Member

SaoYear commented Jul 3, 2024

Okay thank you! Is there an implemetation of this already ?

You could refer to ATST-RCT repo, I just uploaded a neccessary file.

Please see the test_step in the trainer file for this part of implementation.

@martineghiazaryan
Copy link

I guess I should change the batched_decode_preds in the utils file ? I do not understand all the changes I should make.

@SaoYear
Copy link
Member

SaoYear commented Jul 4, 2024

Yeah, there are three steps:

  1. split the audio into shorted clips, in the test_step line 651-673
  2. give the to the model (forward function)
  3. unify the predictions (pull_back_preds function in utils.py)

@martineghiazaryan
Copy link

martineghiazaryan commented Jul 8, 2024

I am only trying to run the inference on student model with the stage_2_wo_external. modified to the train_stage2 file and iwas able to get the inferences on 10 second audios now I was trying to change the duration following your previous steps it seems the audio splitting step was successful. I did not get to which forward function I should give the split frames(crnn_e2e ?)

here are my logs

mel_spec = desed_training.mel_spec(audio)
atstfrnorm = desed_training.atst_norm(atstft)

print(mel_spec.shape)
print(atstfrnorm.shape)

strong_preds_student, _ = desed_training.detect(torch.unsqueeze(mel_spec, 0), torch.unsqueeze(atstfrnorm, 0), desed_training.sed_student)
    
Audio type: <class 'torch.Tensor'>
Audio shape: torch.Size([284672])
Audio in seconds: 17.792
Padd in seconds: 1
Padding length: 3328
Audio length (Padding): 288000
Starting positions: tensor([     0., 128000.])
Ending positions: tensor([160000., 288000.])
Decomposed audio: torch.Size([2, 160000])
mel specs torch.Size([2, 128, 626])
atst feats torch.Size([64, 1780])

here on detecting the feature I get an

File "/train/train_stage2.py", line 536, in <module>
   single_run(
 File "/train/train_stage2.py", line 264, in single_run
   strong_preds_student, _ = desed_training.detect(torch.unsqueeze(mel_spec, 1), torch.unsqueeze(atstfrnorm, 0), desed_training.sed_student)
 File "train/local/ultra_sed_trainer.py", line 239, in detect
   return model(self.scaler(self.take_log(mel_feats)), pretrained_feats)
 File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
   return forward_call(*args, **kwargs)
 File "ATST-SED/desed_task/nnet/CRNN_e2e.py", line 88, in forward
   x = self.cnn(x)
 File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
   return forward_call(*args, **kwargs)
 File "ATST-SED/desed_task/nnet/CNN.py", line 113, in forward
   x = self.cnn(x)
 File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
   return forward_call(*args, **kwargs)
 File "lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
   input = module(input)
 File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
   return forward_call(*args, **kwargs)
 File "lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
   return self._conv_forward(input, self.weight, self.bias)
 File "lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
   return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [2, 1, 128, 1, 626]

for the last step I added the pull_back_preds to the current repo to be able to call it in the batched_decode_preds function as presented in the RCT repo.

Thank you for the help

@magicalvoice
Copy link
Author

Hi, have anyone written separate code for just inference, like loading the model, trained weights and running it on 10 s audios to get inference per file [may be some post processing too].

If anyone has done it, please help me with that, how to do it.

Thank you.

@SaoYear
Copy link
Member

SaoYear commented Jul 9, 2024

@martineghiazaryan @magicalvoice
I will write a quick inference script

@martineghiazaryan
Copy link

@SaoYear hey any news from the script?

@SaoYear
Copy link
Member

SaoYear commented Jul 11, 2024 via email

@SaoYear
Copy link
Member

SaoYear commented Jul 17, 2024

Hey guys, sorry for the late but I have update an inference file in the latest commit.

You could use the inference file by:
python -m inference
And the path of the waveform to be inferenced could be changed inside the code.

If you have any other problem, please let me know.

@magicalvoice
Copy link
Author

@SaoYear First of all, Thank you so much for your kind help!!

I have another question, How to interpret this result inference_result.png, like True False for each chunk based on a threshold, but which class it belongs ?

Also, I want to clarify a doubt, I read your paper "Fine-tune the pretrained ATST model for sound event detection", it is basically training and finetuning ATST-Frame model along with the help of CRNN, because DESED is quite small data for finetuning ATST-Frame? but then Who is Teacher and who is Student here in this case, cause I am getting confused by looking at results for both stages has student and teacher.

I am stuck with lot of questions, Please help, Thanks in advance!!

@SaoYear
Copy link
Member

SaoYear commented Jul 17, 2024

@SaoYear First of all, Thank you so much for your kind help!!

I have another question, How to interpret this result inference_result.png, like True False for each chunk based on a threshold, but which class it belongs ?

Also, I want to clarify a doubt, I read your paper "Fine-tune the pretrained ATST model for sound event detection", it is basically training and finetuning ATST-Frame model along with the help of CRNN, because DESED is quite small data for finetuning ATST-Frame? but then Who is Teacher and who is Student here in this case, cause I am getting confused by looking at results for both stages has student and teacher.

I am stuck with lot of questions, Please help, Thanks in advance!!

  1. to decode the class names, I have import the class_dict dictionary in the inference.py file and each class name is assigned with an index of row in the sed_results matrix.

  2. yes, this work is focus on finetuning the pretrained model for the small-scale DESED dataset.

  3. as for you confusion about the student and teacher:
    a. the student and teacher in this work are concepts from the MeanTeacher method, a semi-supervised method. So both student and teacher are referring the two models in the MeanTeacher method. You could refer to the original work of the MeanTeacher, but basically, the teacher model is just the exponential moving average (EMA) of the student model.

    b. therefore, if we use the MeanTeacher semi-supervised method, there would be a student and a teacher model. So, in stage 1, we use MeanTeacher, there are a student and a teacher, but the teacher in the stage1 is the EMA of the student in the stage1. And in stage 2, we also use MeanTeacher, so there are also a student and a teacher in the stage2. And the teacher model in stage 2 is the EMA of the student model in the stage2.

    c. using MeanTeacher for SED stems from the JiaKai's work, which is a winning system in 2018 DCASE challenge. Thereafter, the student and teacher models always appearred in the SED systems, because they all used the MeanTeacher methods.

    d. specifically in this work, we find the previous semi-supervised methods are not only useful for the small models but also helpful for fine-tuning the pretrained models.

@HeChengHui
Copy link

@SaoYear
Thank you for your inference code. I tried using my own audio with stage_2_wo_external and i got the following results
image

is it supposed to do this? the audio has parts of people talking.

@magicalvoice
Copy link
Author

Hi @SaoYear Thank you, I understood. after stage 2 training i.e fine-tuning both CRNN and ATST-Frame, can I only use CRNN weights differently, is there a way? if yes, how.

@SaoYear
Copy link
Member

SaoYear commented Jul 19, 2024

Hi @SaoYear Thank you, I understood. after stage 2 training i.e fine-tuning both CRNN and ATST-Frame, can I only use CRNN weights differently, is there a way? if yes, how.

@magicalvoice I never tried to do that. But I suppose that, only using the CRNN part of the ATST-SED shall not be better than a CRNN trained from scratch.

If you want to do that, you could just comment the ATST features and the merge layer MLP. And feed the CNN output to the RNN directly.

The CNN trained in ATST-SED is regarded as a compensation for some local features that ignored by FrameATST. And the RNN in the ATST-SED is traiend to learn the fused features from both FrameATST and CNN. If you want to use just the CRNN part of the entire model, the performance of both CNN and RNN would be weaken and therefore the overall performance would be weaken.

@SaoYear
Copy link
Member

SaoYear commented Jul 19, 2024

@SaoYear Thank you for your inference code. I tried using my own audio with stage_2_wo_external and i got the following results image

is it supposed to do this? the audio has parts of people talking.

Hi @HeChengHui , would you mind to post the wav file? There could be some problems with the inference process.

@HeChengHui
Copy link

@SaoYear
mixed.zip

the audio is >10s but the code seems to handle it by splitting and overlap.

@SaoYear
Copy link
Member

SaoYear commented Jul 23, 2024

Hi @HeChengHui , thanks for sharing the wav.

The splitting and overlapping are the intention of the inference. I have fixed some problems in the original inference code:

  1. the model is set to evaluation model (model.eval()) after loaded.
  2. the visualization quality of sed results are improved, the class labels are added in the plot, as requested by @magicalvoice .

Now the inference looks fine.

According to the audio clip you provided, the SED results look like:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants