KR20210064018A

KR20210064018A - Acoustic event detection method based on deep learning

Info

Publication number: KR20210064018A
Application number: KR1020200035181A
Authority: KR
Inventors: 김홍국; 박인영
Original assignee: 광주과학기술원
Priority date: 2019-11-25
Filing date: 2020-03-23
Publication date: 2021-06-02
Also published as: KR102314824B1

Abstract

Disclosed is a method that extracts a complex feature value included in the personal sound data based on the sound data including at least one sound source, classifies at least one or more sound source included in the complex feature value using an artificial intelligence model based on a fast region-CNN-LSTM, and detects an event using the classified sound source. Therefore, the present invention is capable of providing improved real-time detection based on sound information as well as situation detection using image information.

Description

Acoustic event detection method in deep learning-based detection situation {ACOUSTIC EVENT DETECTION METHOD BASED ON DEEP LEARNING}

본 개시는 딥러닝 기반 감지상황에서의 음향 사건 탐지 방법 및 시스템에 관련된 것으로, 본 개시는 서로 다른 특징을 지닌 음향 신호의 주파수 영역을 탐지하여 고속 지역 합성곱 기반 네트워크(F-R-CNN)을 사용함으로써, 거리와 관계없이 실시간으로 복합적인 음향을 검출할 수 있다.The present disclosure relates to a method and system for detecting an acoustic event in a deep learning-based sensing situation, and the present disclosure provides a method and system for detecting an acoustic signal with different characteristics by using a fast regional convolution-based network (FR-CNN) to detect the frequency domain of an acoustic signal with different characteristics. , it is possible to detect complex sounds in real time regardless of distance.

음향 사건을 검출하고 음향 사건의 종류를 분류하는 기술은, 상황-인지(context-aware) 기술과 융합되어 사용자의 주변 환경 판단에 적용되기 위해서 꾸준히 연구되어 왔다.A technology for detecting an acoustic event and classifying the type of acoustic event has been continuously studied in order to be applied to a user's judgment of the surrounding environment by being fused with a context-aware technology.

이러한 연구에 따른 종래의 단일 음향 감지 모델은 확연히 드러나는 음향만 감지하므로 동시에 발생하는 복합 음원에 대하여 실제 감지를 진행할 때 정보가 훼손되는 문제가 있었다. 즉, 복합 음원 감지의 정확도가 낮은 문제가 있었다.Since the conventional single sound sensing model according to this study only detects a sound that is clearly visible, there is a problem in that information is damaged when actually sensing a complex sound source occurring at the same time. That is, there was a problem in that the accuracy of detecting a complex sound source was low.

그리고, 종래의 음향 사건 감지 시스템은 환경에 따라 감지해야하는 음원보다 큰 잡음이 검출되는 경우 감지 환경의 정확도가 낮아지는 문제가 있었고, 음원이 발생하는 지점으로부터의 거리에 따라 음원 검출의 정확도가 낮아지는 문제가 있었다.In addition, the conventional acoustic event detection system has a problem in that the accuracy of the detection environment is lowered when noise greater than the sound source to be detected is detected depending on the environment, and the accuracy of the sound source detection is lowered according to the distance from the point where the sound source is generated. There was a problem.

본 개시의 목적은 고속 지역 합성곱 기반 네트워크(F-R-CNN)을 이용하여 인공 지능 모델을 학습시키고, 감지상황에서의 실시간 음향 사건 감지를 목적으로 한다. An object of the present disclosure is to train an artificial intelligence model using a fast regional convolution based network (F-R-CNN), and to detect real-time acoustic events in a detection situation.

본 개시는 거리에 관계없이 음향 특징을 추출하고 음향감지상황에서 복수개의 음향 발생을 검출하는 복합 음향 사건 감지를 목적으로 한다.An object of the present disclosure is to detect a complex acoustic event that extracts acoustic features regardless of distance and detects a plurality of acoustic occurrences in an acoustic sensing situation.

본 개시는 음향 데이터를 이용한 사건 감지 방법에 있어서 적어도 하나 이상의 음원이 포함된 음향 데이터를 수신하는 단계; 상기 음향 데이터에 포함된 복합음향 특징을 추출하는 단계; 인공 지능 모델을 이용하여 상기 복합음향 특징에 포함된 적어도 하나 이상의 음원을 각각 분류하는 단계; 및 상기 분류된 음원 각각의 조합에 기초하여 사건을 감지하는 단계를 포함하는 사건 감지 방법을 개시한다.The present disclosure provides a method for detecting an event using sound data, the method comprising: receiving sound data including at least one sound source; extracting a composite acoustic feature included in the acoustic data; classifying each of at least one sound source included in the composite sound feature using an artificial intelligence model; and detecting an event based on a combination of each of the classified sound sources.

또한, 본 개시에 있어서 상기 인공 지능 모델을 이용하여 상기 복합음향 특징에 포함된 음원을 분류하는 단계는, 상기 정적 특징의 콘볼루션 특징맵을 획득하고, 상기 콘볼루션 특징맵으로부터 기 설정된 관심 영역(Region of interest)의 특징 벡터를 추출하여 음원을 분류하는 단계 및 상기 차등 특징의 특징 맵을 획득하고, 상기 차등 특징의 특집 맵으로부터 특징 벡터를 추출하고 상기 특징 벡터에 포함된 음원을 분류하는 단계를 포함할 수 있다.In addition, in the present disclosure, the step of classifying the sound source included in the composite acoustic feature by using the artificial intelligence model includes obtaining a convolutional feature map of the static feature, and a preset region of interest from the convolutional feature map ( Classifying the sound source by extracting a feature vector of a region of interest), obtaining a feature map of the differential feature, extracting a feature vector from the feature map of the differential feature, and classifying the sound source included in the feature vector may include

본 개시는 고속 지역 합성곱 기반 네트워크(F-R-CNN)을 사용하여 영상 정보를 이용한 상황 감지와 더불어 음향 정보에 기초한 개선된 실시간 감지를 제공할 수 있다.The present disclosure may provide improved real-time detection based on sound information as well as situation detection using image information using a fast regional convolution based network (F-R-CNN).

또한, 본 개시는 거리에 관계없이 음향 특징을 파악하므로, 기존의 음향 기반 감지시스템의 단점을 보완하고 복수개의 음향 발생을 검출하여 정확한 음향 사건 발생을 감지할 수 있다.In addition, since the present disclosure identifies the acoustic characteristics regardless of the distance, it is possible to compensate for the disadvantages of the existing acoustic-based sensing system and to detect the occurrence of a plurality of sounds to accurately detect the occurrence of acoustic events.

도 1은 본 개시의 일 실시 예에 따른 순서도를 나타낸다.
도 2는 본 개시의 일 실시 예에 따른 알고리즘 진행과정을 나타낸다.
도 3은 본 개시의 일 실시 예에 따른 알고리즘 진행과정을 나타낸다.
도 4는 본 개시의 일 실시 예에 따른 Attention-LSTM을 나타낸다.
도 5는 본 개시의 일 실시 예에 따른 CTC(Connectionist Temporal Classification)를 나타낸다.
도 6은 본 개시의 테스트 결과에 따른 성능을 나타낸다.
도 7은 본 개시의 테스트 결과에 따른 성능을 나타낸다.
도 8은 본 개시의 테스트 결과에 따른 성능을 나타낸다.1 shows a flowchart according to an embodiment of the present disclosure.
2 illustrates an algorithm progress according to an embodiment of the present disclosure.
3 illustrates an algorithm progress according to an embodiment of the present disclosure.
4 shows Attention-LSTM according to an embodiment of the present disclosure.
5 illustrates a Connectionist Temporal Classification (CTC) according to an embodiment of the present disclosure.
6 shows performance according to test results of the present disclosure.
7 shows performance according to test results of the present disclosure.
8 shows performance according to test results of the present disclosure.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. Hereinafter, exemplary embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but identical or similar elements are denoted by the same reference numerals regardless of reference numerals, and redundant descriptions thereof will be omitted.

첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 개시의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical spirit disclosed in the present specification is not limited by the accompanying drawings, and all changes and equivalents included in the spirit and scope of the present disclosure It should be understood to include water or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first and second may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component.

본 개시에서 사용될 용어의 표현을 정의하면 아래와 같다.When the expression of the term used in the present disclosure is defined, it is as follows.

음향 사건 감지(acoustic event detection,AED), 지역 합성곱 신경망(regional-convolutional neural networks, R-CNN)), 장단기 기억 신경망(long short-term memory ,LSTM), CTC(connectionist temporal classification).Acoustic event detection (AED), regional-convolutional neural networks (R-CNN)), long short-term memory (LSTM), connectionist temporal classification (CTC).

본 개시는 다양한 잡음 환경 및 거리에서 작동하는 음향 사건 감지 시스템을 제안한다. 그 방법은 잡음환경에서 서로 다른 특징을 지닌 시계열 음향 사건의 주파수 영역을 탐지하는 지역 합성곱 기반 네트워크(R-CNN)를 사용한다. 이 때, 고속 지역 합성곱 기반 네트워크(F-R-CNN)를 사용함으로써 본 개시는 실시간으로 구현될 수 있다.The present disclosure proposes an acoustic event detection system that operates in a variety of noisy environments and distances. The method uses a regional convolution-based network (R-CNN) to detect the frequency domain of time-series acoustic events with different characteristics in a noisy environment. At this time, by using a fast regional convolution based network (F-R-CNN), the present disclosure can be implemented in real time.

이를 통해 주파수대역에서의 관심지역 정보를 일일이 수작업으로 라벨링하지 않고도 준지도학습(semi-supervised learning)을 가능하게 한다. This enables semi-supervised learning without manually labeling the region of interest information in the frequency band.

또한, 본 개시에 따르면 사건 감지 성능이 거리에 따른 신호의 세기에 좌우되는 문제를 해결하기 위해 인공 지능 모델의 입력값은 복합특징값(multi-feature)을 사용하도록 설계될 수 있다.In addition, according to the present disclosure, the input value of the artificial intelligence model may be designed to use a multi-feature value in order to solve the problem that the event detection performance depends on the signal strength according to the distance.

부가적으로 신호의 시계열적 특징을 위해 장단기 기억 신경망이 적용되었으며, 어텐션(Attention) 알고리즘을 통하여 음원의 길이적 특성을 반영하여 음원 감지 성능을 향상시켰다. 최종적으로, CTC(onnectionist temporal classification) 를 적용하여 복합 음원이 포함된 하나의 사건을 감지한다.Additionally, a long-short-term memory neural network was applied for the time-series characteristics of the signal, and the sound source detection performance was improved by reflecting the length characteristics of the sound source through the attention algorithm. Finally, one event including a complex sound source is detected by applying connectionist temporal classification (CTC).

이하 도 1에 대하여 설명한다.Hereinafter, FIG. 1 will be described.

도 1은 본 개시의 순서도를 나타낸다.1 shows a flowchart of the present disclosure.

도 1을 참조하면, 본 개시의 음향 사건 감지 시스템은 적어도 하나 이상의 음원이 포함된 음향 데이터를 수신하는 단계(S10)를 포함할 수 있다.Referring to FIG. 1 , the acoustic event detection system of the present disclosure may include receiving acoustic data including at least one sound source ( S10 ).

구체적으로 음원이란 음(sound)의 근원(source)을 포함할 수 있으며, 음향 사건 감지 시스템은 적어도 하나 이상의 음원을 포함하는 음향 데이터를 포함할 수 있다. Specifically, the sound source may include a source of a sound, and the acoustic event detection system may include sound data including at least one sound source.

예를 들어 하나의 음향 데이터에는 대화 음원, 물체가 부딪히는 음원, 총성 등 특정 물체 또는 생물로부터 발생하는 모든 소리 등을 포함될 수 있다. 그리고 상기 음향 데이터는 음원으로부터 추출된 음향 특징 데이터를 포함할 수 있다.For example, one sound data may include all sounds generated from a specific object or living things, such as a dialogue sound source, a sound source that an object collides with, and a gunshot. And the sound data may include sound characteristic data extracted from the sound source.

이때 정확한 음향 데이터를 획득하기 위하여 잡음 제거가 전처리 과정에서 수행될 수 있으며, 잡음의 스펙트럼 주파수 대역이 발생한 사건과 다르거나 각 사건(event)에 고유한 주파수 대역이있는 경우 R-CNN (region-convolutional neural network)을 이용할 수 있다.At this time, in order to obtain accurate acoustic data, noise removal may be performed in the preprocessing process, and if the spectral frequency band of noise is different from the event in which it occurred or if each event has a unique frequency band, R-CNN (region-convolutional neural networks) can be used.

본 개시의 음향 사건 감지 시스템은 상기 음향 데이터에 포함된 복합특징값을 추출하는 단계(S20)를 포함할 수 있다.The acoustic event detection system of the present disclosure may include extracting a complex feature value included in the acoustic data (S20).

이때, 상기 복합특징값은 수신된 음향 데이터의 스펙트로그램(spectrogram)인 정적 특징(static feature) 및 상기 음향 데이터와 기 설정된 시간 이전의 음향 데이터의 차이에 기초한 차등 특징(differential feature)을 포함할 수 있다.In this case, the composite feature value may include a static feature that is a spectrogram of the received sound data and a differential feature based on a difference between the sound data and sound data before a preset time. have.

구체적으로 정적 특징(static feature)는 음향 데이터의 특징이 포함된 로그 멜 밴드(Log Mel-Band) 에너지 기반 이미지를 의미할 수 있다.In more detail, the static feature may mean a log Mel-Band energy-based image including the sound data feature.

이때, Log Mel-Band 에너지는 음향 신호의 특성을 잘 나타낼 수 있는 에너지이다. 그러나, 복합적인 음향 신호를 분석하는데 한계가 있다. 음향 데이터는 일반적으로 데이터의 세기(power)에 의해 정규화되지만 때로는 거리에 따른 세기가 약해지거나 잡음이 섞여있는 경우 특징 정보가 손상될 수 있다.In this case, the Log Mel-Band energy is energy that can well represent the characteristics of the acoustic signal. However, there is a limitation in analyzing complex acoustic signals. Acoustic data is generally normalized by the power of the data, but sometimes, the characteristic information may be damaged if the intensity is weakened according to distance or mixed with noise.

따라서 본 개시에서는 음향 데이터에 시계열 특성이 있다는 점을 고려하여 현재 음향 데이터에서 이전 데이터와의 차이를 통하여 로그 멜 밴드 에너지의 변화를 측정 할 수 있다. Therefore, in the present disclosure, considering that the acoustic data has a time series characteristic, it is possible to measure the change in log mel band energy through the difference from the previous data in the current acoustic data.

상기 방법을 통하여 생성된 차등 특징(differential feature)은 현재 음향 데이터와 기 설정된 시간 이전의 음향 데이터가 이미지로 표현된 스펙트럼 간의 변화량에 기초하므로 음성 데이터의 발생 지점으로부터 거리에 관계없이 음향 데이터의 특징을 반영할 수 있다. Since the differential feature generated through the above method is based on the amount of change between the current sound data and the spectrum in which the sound data before a preset time is expressed as an image, the characteristics of the sound data are determined regardless of the distance from the point of generation of the sound data. can reflect

복합특징값을 획득한 이후 본 개시는 인공 지능 모델을 이용하여 상기 복합특징값에 포함된 적어도 하나 이상의 음원을 각각 분류할 수 있다(S30). 그리고 분류된 음원 각각의 조합에 기초하여 사건을 감지할 수 있다(S40).After acquiring the complex feature value, the present disclosure may classify at least one sound source included in the complex feature value using an artificial intelligence model (S30). And it is possible to detect an event based on the combination of each of the classified sound sources (S40).

이하 본 개시의 알고리즘과 관련하여 도 2 내지 도 5에서 구체적으로 설명한다.Hereinafter, the algorithm of the present disclosure will be described in detail with reference to FIGS. 2 to 5 .

본 개시는 다양한 유형의 딥 러닝 출력 계층에 대한 지역(region) 정보를 사용하는 Fast R-CNN-LSTM 기반 실시간 감지 시스템을 개시한다. The present disclosure discloses a Fast R-CNN-LSTM based real-time sensing system using region information for various types of deep learning output layers.

구체적으로 본 개시에 따르면, 고속 지역 합성곱 장단기 기억 신경망(Fast R-CNN-Attention LSTM)을 음향 데이터가 이미지로 표현된 학습데이터를 이용하여 학습시켜 FFast R-CNN-LSTM(31)모델 생성할 수 있다.Specifically, according to the present disclosure, a fast R-CNN-Attention LSTM (Fast R-CNN-Attention LSTM) is trained using training data in which acoustic data is expressed as an image to generate an FFast R-CNN-LSTM (31) model. can

본 개시는 생성된 FFast R-CNN-LSTM(31)모델의 디코딩 값이 일정 기준값 이상의 스코어를 갖는 학습 데이터를 선별하고, 선별된 학습 데이터를 이용하여 준지도 지역 라벨링(semi-supervised region labeling)을 통해 SFast R-CNN-LSTM(32) 모델을 생성할 수 있다. 상기 과정으로 생성된 SFast R-CNN-LSTM(32)모델은 Spectrum 구간에서 음원의 특징을 나타내는 지역(region)을 모델 스스로 추출할 수 있으므로 핸드 라벨링 비용을 최소화할 수 있다. The present disclosure selects training data in which the decoded value of the generated FFast R-CNN-LSTM (31) model has a score greater than or equal to a predetermined reference value, and uses the selected training data to perform semi-supervised region labeling. Through the SFast R-CNN-LSTM (32) model can be generated. The SFast R-CNN-LSTM 32 model generated by the above process can minimize the hand labeling cost because the model can extract the region representing the characteristics of the sound source in the spectrum section by itself.

이하 도 2에서 구체적으로 설명한다.It will be described in detail below with reference to FIG. 2 .

일반적으로 CNN(Convolution Neural Network)은 이미지 처리에 적합한 인공 지능 모델이다. 본 개시에서는 CNN의 입력으로 음향 데이터를 사용하기 위해 여러 시점을 수집하고 이미지 처리에 사용하기 위해 음향 데이터의 여러 프레임을 하나의 이미지로 구성할 수 있다.In general, CNN (Convolution Neural Network) is an artificial intelligence model suitable for image processing. In the present disclosure, multiple viewpoints can be collected to use acoustic data as an input of CNN, and multiple frames of acoustic data can be configured into one image for use in image processing.

그리고 본 개시의 인공 지능 모델은 CNN과 RNN (LSTM)을 연결하는 구조의 음향에 따른 사건 감지 알고리즘을 포함할 수 있다. And the artificial intelligence model of the present disclosure may include an event detection algorithm according to the acoustic structure that connects CNN and RNN (LSTM).

도 2를 참조하면, 본 개시의 알고리즘은 FFast R-CNN-LSTM(31) 모델과 SFast R-CNN-LSTM 모델(32) 및 차등 특징이 반영되는 인공지능 모델(33)을 포함할 수 있다.Referring to FIG. 2 , the algorithm of the present disclosure may include an FFast R-CNN-LSTM 31 model, an SFast R-CNN-LSTM model 32 , and an artificial intelligence model 33 in which differential features are reflected.

이때, FFast R-CNN-LSTM (31)은 기존의 고속 지역 합성곱 신경망(Fast R-CNN-LSTM)을 전대역(full-band height)에서 레이블링되지 않은 학습 데이터를 사용하여 사전 학습시킨 모델일 수 있다.At this time, the FFast R-CNN-LSTM 31 may be a pre-trained model using unlabeled training data in the full-band height of the existing Fast R-CNN-LSTM. have.

구체적으로 고속 지역 합성곱 신경망(Fast R-CNN)은 이미지로부터 특징데이터가 존재할 위치에 Bounding Box Proposal이 모두 생성되는 R-CNN의 병목(bottleneck)구조의 단점을 개선하고자 제안된 방식을 의미할 수 있다. 구체적으로 이전의 R-CNN 과 비교하여 각 Bbox-Proposal들이 CNN을 거치는것이 아니라, 전체 이미지에 대해 CNN을 한번 거친 후 출력 된 특징 맵(Feature map)단에서 객체 탐지를 수행하는 것을 특징으로 한다.Specifically, the fast regional convolutional neural network (Fast R-CNN) refers to a method proposed to improve the shortcomings of the bottleneck structure of R-CNN, in which all of the Bounding Box Proposals are generated at the locations where feature data exists from the image. have. Specifically, compared to the previous R-CNN, each Bbox-Proposal does not go through the CNN, but rather performs object detection at the output feature map stage after passing through the CNN once for the entire image.

또한, 본 개시는 SFast R-CNN-LSTM 모델(32)을 포함할 수 있다.The present disclosure may also include a SFast R-CNN-LSTM model 32 .

이때, SFast R-CNN-LSTM(32)모델은 Spectrum 구간에서 음원의 특징을 나타내는 지역(region)을 모델 스스로 추출할 수 있다.In this case, the SFast R-CNN-LSTM (32) model can extract a region representing the characteristics of the sound source in the spectrum section by itself.

구체적으로 본 개시는 SFast R-CNN-LSTM 모델(32)을 훈련시키기 위하여 FFast R-CNN-LSTM 의 테스트 결과 기 설정된 값 이상의 스코어를 획득한 적절한 훈련 데이터를 선별할 수 있다. 또한, 본 개시의 SFast R-CNN-LSTM 모델(32)는 준지도 영역 레이블링(semi-supervised region labeling)을 사용하는 Fast R-CNN-LSTM을 포함할 수 있다. 본 개시의 SFast R-CNN-LSTM 모델(32)은 선별된 훈련 데이터를 사용하므로 핸드 라벨링 비용을 최소화할 수 있다.Specifically, the present disclosure may select appropriate training data obtained by obtaining a score greater than or equal to a preset value as a result of a test of the FFast R-CNN-LSTM in order to train the SFast R-CNN-LSTM model 32 . In addition, the SFast R-CNN-LSTM model 32 of the present disclosure may include Fast R-CNN-LSTM using semi-supervised region labeling. Since the SFast R-CNN-LSTM model 32 of the present disclosure uses selected training data, it is possible to minimize the hand labeling cost.

그리고 본 개시의 감지 시스템은 차등 특징이 반영되는 인공지능 모델(33)을 포함할 수 있다.And the sensing system of the present disclosure may include an artificial intelligence model 33 in which differential features are reflected.

이때, 차등 특징이 반영되는 인공지능 모델(33)은 SFast R-CNN-LSTM 모델(32)와 병렬적으로 학습될 수 있다. 또한 SFast R-CNN-LSTM 모델(32)과 음향 감지를 위한 일부 층(layer)를 공유하여 하나의 사건 감지를 수행할 수 있다. In this case, the artificial intelligence model 33 reflecting the differential feature may be trained in parallel with the SFast R-CNN-LSTM model 32 . In addition, one event detection can be performed by sharing some layers for acoustic detection with the SFast R-CNN-LSTM model 32 .

이때 일부 층(layer)는 CTC 알고리즘 층을 포함할 수 있다.In this case, some layers may include a CTC algorithm layer.

상기 차등 특징이 반영되는 인공 지능 모델(33)은 CNN과 LSTM이 연결된 층(layer)을 포함하고, 차등 특징이 CNN과 RNN(LSTM)이 연결된 알고리즘에 제공되면 차등 특징이 반영된 특징 벡터를 생성할 수 있다. 이때 차등 특징은 음원에서 추출한 특정시간(T시간)의 음향데이터의 이미지 정보에서 특정 시간 이전(T-1시간) 음향 데이터의 이미지 정보를 뺀 변화량을 포함하고, 차등 특징이 반영되는 인공 지능 모델(33)은 상기 차등 특징값을 사용하여 학습될 수 있다.The artificial intelligence model 33 in which the differential feature is reflected includes a layer in which CNN and LSTM are connected, and when the differential feature is provided to an algorithm in which CNN and RNN (LSTM) are connected, a feature vector reflecting the differential feature is generated. can At this time, the differential feature includes the amount of change obtained by subtracting the image information of the sound data before a specific time (T-1 time) from the image information of the sound data at a specific time (T time) extracted from the sound source, and the artificial intelligence model ( 33) can be learned using the differential feature value.

이하 도 3을 설명한다.3 will be described below.

도 3을 참조하면, 본 개시의 음향 감지 시스템의 진행 과정을 나타낸다.Referring to FIG. 3 , the process of the sound sensing system of the present disclosure is shown.

본 개시의 음향 감지 시스템은 적어도 하나 이상의 음원이 포함된 음향 데이터를 수신하고, 상기 음향 데이터에 포함된 복합특징값을 추출하는 할 수 있다(S20). 그리고 추출된 복합특징값 중 정적 특징(static feature)은 FFast R-CNN-LSTM 모델(31)에 입력되고, FFast R-CNN-LSTM 모델(31)의 출력 결과 S20단계에서 설정 되어 있었던 기 설정된 관심 영역(RoI)이 재설정될 수 있다.The sound sensing system of the present disclosure may receive sound data including at least one sound source and extract a complex feature value included in the sound data (S20). And a static feature among the extracted complex feature values is input to the FFast R-CNN-LSTM model 31, and the output result of the FFast R-CNN-LSTM model 31 is a preset interest set in step S20. The region RoI may be reset.

본 개시의 감지 시스템은 음향 데이터의 정적 특징을 SFast R-CNN-LSTM 모델(32)에 제공하고, 음향 데이터의 정적 특징의 콘볼루션 특징맵으로부터 재설정된 관심 영역의 특징 벡터를 추출하여 상기 복합특징값에 포함된 음원을 분류할 수 있다.The sensing system of the present disclosure provides a static feature of acoustic data to the SFast R-CNN-LSTM model 32, extracts a reconfigured feature vector of a region of interest from a convolutional feature map of static features of the acoustic data, and extracts the composite feature You can classify the sound source included in the value.

또한, SFast R-CNN-LSTM 모델(32)과 병렬적으로 차등 특징이 반영되는 인공지능 모델(33)은 음향 데이터의 차등 특징(differential feature)를 이용하여 콘볼루션 특징맵으로부터 특징 벡터를 추출하고, 상기 복합특징값에 포함된 음원을 분류할 수 있다.In addition, the artificial intelligence model 33 in which differential features are reflected in parallel with the SFast R-CNN-LSTM model 32 extracts a feature vector from the convolutional feature map using the differential feature of the acoustic data, , it is possible to classify the sound source included in the complex feature value.

이때, 차등 특징이 반영되는 인공지능 모델(33)과 정적 특징이 반영된 SFast R-CNN-LSTM 모델(32)는 특정 층(layer)을 공유하여 상기 복합특징값에 포함된 적어도 하나 이상의 음원을 각각 분류하는 단계(S30) 및 상기 분류된 음원 각각의 조합에 기초하여 사건을 감지하는 단계(S40)를 포함할 수 있다.At this time, the artificial intelligence model 33 reflecting the differential feature and the SFast R-CNN-LSTM model 32 reflecting the static feature share a specific layer to receive at least one sound source included in the complex feature value, respectively. It may include a step of classifying (S30) and a step of detecting an event based on a combination of each of the classified sound sources (S40).

구체적으로 공유되는 특정 층(layer)는 시계열적 특징 및 지역적 특징이 반영되는 Attention-LSTM 알고리즘과 CTC(Connectionist Temporal Classification) 알고리즘을 포함할 수 있다.Specifically, a specific shared layer may include an Attention-LSTM algorithm and a Connectionist Temporal Classification (CTC) algorithm in which time-series and regional features are reflected.

이때, Attention-LSTM 알고리즘과 CTC 알고리즘은 도2 에 나타난 모델(31,32,33)각각에 포함되어 음향 감지를 수행할 수 있다.At this time, the Attention-LSTM algorithm and the CTC algorithm are included in each of the models 31, 32, and 33 shown in FIG. 2 to perform sound sensing.

이하 Attention-LSTM 및 CTC 알고리즘을 도 4내지 도 5에서 설명한다.Hereinafter, Attention-LSTM and CTC algorithms will be described with reference to FIGS. 4 to 5 .

도 4는 본 개시의 Attention 알고리즘을 나타낸다.4 shows an Attention algorithm of the present disclosure.

먼저 본 개시의 사건 감지에서 획득되는 음향 데이터는 다양한 길이의 사건(Event)를 감지할 수 있어야 한다. First, the acoustic data obtained in the event detection of the present disclosure should be able to detect events of various lengths.

본 개시는 음향 데이터의 지역(region)적 특징을 특징맵에서 추출하고 Attention 알고리즘을 적용하여 상기 지역적 특징에 특정 가중치는 부여하여 음향 감지 정확도를 높일 수 있다. Attention 알고리즘은 레이블링이 필요하지 않은 unsupervised learning에 의하여 학습될 수 있다.The present disclosure extracts regional features of sound data from a feature map and applies an attention algorithm to assign specific weights to the regional features to increase sound detection accuracy. The attention algorithm can be trained by unsupervised learning that does not require labeling.

도 5는 CTC(Connectionist Temporal Classification) 기반 후 처리(post processing)과정을 나타낸다5 shows a CTC (Connectionist Temporal Classification)-based post processing process.

본 개시의 감지 시스템은 CTC(Connectionist Temporal Classification)를 이용하여 상기 음향 데이터의 각 시간 단계(time-step) 또는 프레임 별(frame wisely)로 각각 레이블링 하여 최종 사건을 감지할 수 있다.The sensing system of the present disclosure may detect a final event by labeling each time-step or frame wisely of the acoustic data using CTC (Connectionist Temporal Classification).

예를 들어, 3 프레임의 음향 데이터가 수신되고 프레임 별 음원 분류 결과가 i) 자동차 타이어 소리, ii) 차량 충격 소음, iii)사람 소리로 결정되었다고 하면, 감지 시스템은 획득된 3프레임의 음향 데이터를 이용하여 최종 사건으로 '교통 사고 발생' 를 출력할 수 있다. 이때, 최종 사건을 판단하는 방법으로 CTC(Connectionist Temporal Classification)가 이용될 수 있다.For example, if 3 frames of sound data are received and the sound source classification result for each frame is determined to be i) a car tire sound, ii) a vehicle impact noise, and iii) a human sound, the detection system detects the obtained sound data of 3 frames. 'Traffic Accident Occurrence' can be output as the final event using In this case, Connectionist Temporal Classification (CTC) may be used as a method of determining the final event.

도 6 내지 도 8은 본 개시의 테스트에 따른 성능 결과를 나타낸다.6 to 8 show performance results according to tests of the present disclosure.

도 6은 본 개시의 후처리(Post-Processing)에 따른 비교 결과를 나타낸다. 도 6을 참조하면 후처리 존재 여부에 따라서 F1 스코어가 후처리를 한 다음에 더 높은 것을 알 수 있다. 6 shows a comparison result according to post-processing of the present disclosure. Referring to FIG. 6 , it can be seen that the F1 score is higher after the post-processing according to the presence or absence of the post-processing.

또한, 복합특징값을 사용하고, Attention 알고리즘을 사용하였을 경우 가장 F1스코어가 높은 것을 알 수 있다.In addition, it can be seen that the F1 score is the highest when complex feature values are used and the Attention algorithm is used.

도 7 내지 도 8은 본 개시의 알고리즘의 성능이 정적 특징 및 차등 특징이 포함된 복합특징값을 사용하였을 때 가장 높은 성능을 보인다는 통계수치를 나타낸다.7 to 8 show statistical values indicating that the performance of the algorithm of the present disclosure shows the highest performance when a composite feature value including a static feature and a differential feature is used.

특히 차동 특징을 사용하는 모델은 정적 기능보다 시끄러운 환경에서 훨씬 뛰어난 성능을 보이는 것을 알 수 있다.In particular, it can be seen that the model using the differential feature performs much better in a noisy environment than the static feature.

전술한 본 개시는, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 상기 컴퓨터는 단말기의 프로세서를 포함할 수도 있다.The present disclosure described above can be implemented as computer-readable code on a medium in which a program is recorded. The computer-readable medium includes all types of recording devices that store data that can be read by a computer system. Examples of computer-readable media include hard disk drives (HDDs), solid state disks (SSDs), silicon disk drives (SDDs), ROMs, RAM, CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, etc. There is this. In addition, the computer may include a processor of the terminal.

Claims

In the event detection method using acoustic data
Receiving sound data including at least one sound source;
extracting a complex feature value included in the sound data;
classifying at least one sound source included in the complex feature value, respectively, using an artificial intelligence model; and
Including the step of detecting an event based on each combination of the classified sound sources,
How to detect an event.

The method of claim 1,
The composite feature value is
Including a static feature that is a spectrogram of the received sound data and a differential feature based on a difference between the sound data and sound data before a preset time,
How to detect an event.

According to claim 2
The artificial intelligence model,
FFast R-CNN-LSTM, in which a fast regional convolution long-short-term neural network (Fast R-CNN-LSTM) was pre-trained using unlabeled training data at full-band height, and the FFast R-CNN-LSTM training Including SFast R-CNN-LSTM trained by selecting data and an artificial intelligence model that reflects differential features,
How to detect an event.

4. The method of claim 3
Classifying the sound source included in the complex feature value using the artificial intelligence model includes:
Obtaining the convolutional feature map of the static feature using the FFast R-CNN-LSTM, and classifying the sound source by extracting a feature vector of a preset region of interest from the convolutional feature map doing,
How to detect an event.

5. The method of claim 4
The step of classifying a sound source by extracting a feature vector of a preset region of interest from the convolutional feature map,
resetting the region of interest based on a feature vector of a preset region of interest output by the FFast R-CNN-LSTM; and
The SFast R-CNN-LSTM further comprising extracting a feature vector of a reset ROI from the convolutional feature map of the static feature to classify the sound source included in the complex feature value,
How to detect an event.

The method of claim 5,
The step of detecting an event based on each combination of the classified sound sources,
Using CTC (Connectionist Temporal Classification) to label each time-step or frame of the acoustic data to detect a final event,
How to detect an event.

4. The method of claim 3
Classifying the sound source included in the complex feature value using the artificial intelligence model includes:
Acquiring the feature map of the differential feature by the artificial intelligence model reflecting the differential feature, extracting a feature vector from the feature map of the differential feature, and classifying the sound source included in the feature vector,
How to detect an event.

The method of claim 7,
The step of detecting an event based on each combination of the classified sound sources,
Using CTC (Connectionist Temporal Classification) to label each time-step or frame of the acoustic data to detect a final event,
How to detect an event.

4. The method of claim 3
Classifying the sound source included in the complex feature value using the artificial intelligence model includes:
The FFast R-CNN-LSTM obtains the convolutional feature map of the static feature, extracts a feature vector of a preset region of interest from the convolutional feature map, and resets the preset region of interest step;
obtaining, by the SFast R-CNN-LSTM, a convolutional feature map of the static feature, and extracting a feature vector of the reset ROI;
obtaining a feature map of the differential feature by an artificial intelligence model to which the differential feature is reflected, and extracting a feature vector from the feature map of the differential feature;
classifying the sound source included in the feature vector; and
Using CTC (Connectionist Temporal Classification) to label each time-step or frame of the acoustic data to detect a final event,
How to detect an event.