KR102079359B1

KR102079359B1 - Process Monitoring Device and Method using RTC method with improved SAX method

Info

Publication number: KR102079359B1
Application number: KR1020180081423A
Authority: KR
Inventors: 이인석; 백준걸
Original assignee: 고려대학교 산학협력단
Priority date: 2017-07-13
Filing date: 2018-07-13
Publication date: 2020-02-20
Also published as: KR20190008515A

Abstract

공정 모니터링 방법이 개시된다. 상기 공정 모니터링 방법은 공정 모니터링 장치에 의해 수행되는 공정 모니터링 방법으로서, (a) 공정을 통해 측정되는 시계열 데이터를 수집하는 단계, (b) 임의의 중단점(break point)

에 대하여(여기서,

는 기 설정된 알파벳 크기) 상기 시계열 데이터를 심볼화하여 심볼화된 데이터를 생성하는 단계, (c) 상기 심볼화된 데이터와 상기 시계열 데이터간에 MSE(mean square error)를 계산하는 단계, (d) 상기 계산된 MSE가 가장 작을 때의 심볼화된 데이터에 대한 중단점을 최적 중단점으로 결정하는 단계, 및 (e) 상기 심볼화된 데이터 중에서 상기 최적 중단점에서의 심볼화된 데이터에 대하여 RTC(real-time contrast) 기법을 적용하여 상기 공정의 이상여부를 탐지하는 단계를 포함한다.Process monitoring methods are disclosed. The process monitoring method is a process monitoring method performed by a process monitoring device, the method comprising: (a) collecting time series data measured through the process, (b) any break point

For (where,

Symbolizing the time series data to generate symbolized data, (c) calculating a mean square error (MSE) between the symbolized data and the time series data, and (d) Determining a breakpoint for the symbolized data when the calculated MSE is the smallest as the optimum breakpoint, and (e) real symbolic data for the symbolized data at the optimal breakpoint among the symbolized data. -detecting the abnormality of the process by applying a (time contrast) technique.

Description

Process monitoring device and method using RTC method with improved SAX method}

본 발명은 공정 모니터링 장치 및 방법에 관한 것으로서, 보다 구체적으로 RTC 기법의 입력 데이터를 전처리하여 RTC 기법에서 보다 효율적인 결정 경계를 결정하도록 하는 공정 모니터링 장치 및 방법에 관한 것이다.The present invention relates to a process monitoring apparatus and method, and more particularly, to a process monitoring apparatus and method for pre-processing input data of the RTC technique to determine more efficient decision boundary in the RTC technique.

첨단 기술의 개발에 따라 제조 공정은 더욱 복잡하고 다양하게 변화되며, 이러한 변화는 공정 모니터링을 보다 중요한 과제로 만든다. 기존의 결함 검출 방법으로서 SPC(Statistical Process Control) 기법은 비효율적이거나, 신뢰할 수 없는 결과를 야기하기도 한다. 기존의 SPC 기법의 문제점을 해결하기 위해 기계 학습을 이용하여 정상 및 비정상을 분류하고자 하였으나, 현재 제조 공정의 상태를 보다 잘 반영하는 새로운 데이터를 반영하는 것이 불가능하여 비정상상태에 대한 관측의 검출 성능이 저하될 우려가 있다.As advanced technologies develop, manufacturing processes become more complex and diverse, and these changes make process monitoring more important. As a conventional defect detection method, the SPC (Statistical Process Control) technique is inefficient or leads to unreliable results. In order to solve the problems of the existing SPC technique, we tried to classify normal and abnormal using machine learning, but it is impossible to reflect new data that reflects the state of the current manufacturing process better. There is a risk of deterioration.

RTC(Real-Time Contrasts) 기법은 상술한 지도 학습 모델의 한계를 극복하기 위한 기법으로서, 새로운 데이터가 수집될 때 새로운 분류기를 학습하고, 학습된 결과를 기반으로 통계량을 계산한다. 기존의 RTC 관리도에서는 랜덤 포레스트(Random forests)를 분류기로 사용하는데, 이 분류기는 정상 및 비정상 상태를 분류하는 기준에 대해 쉽게 이해할 수 있는 정보를 제공하며 변수 중요도(variable importance)를 통해 원인 분석할 수 있다. 그러나, 랜덤 포레스트 분류기는 의사 결정 트리(Decision trees)의 수에 따라 이산 통계량을 만드는데, 이산 통계량은 제조 공정이 더 큰 이상 징후를 갖고 있음에도 동일한 값으로 계산될 수 있어 성능 저하의 원인이 될 수 있다.RTC (Real-Time Contrasts) technique is a technique for overcoming the limitations of the supervised learning model described above. When a new data is collected, a new classifier is trained and a statistic is calculated based on the learned result. Conventional RTC charts use random forests as classifiers, which provide easy-to-understand information about the criteria for classifying normal and abnormal conditions and can be causally analyzed through variable importance. have. However, random forest classifiers produce discrete statistics based on the number of decision trees, which can be calculated to the same value even though the manufacturing process has larger anomalies, which can cause performance degradation. .

한편, 랜덤 포레스트 분류기를 구성하는 의사 결정 트리는 각 변수에 대한 결정 경계로 분류를 수행한다. 따라서, 랜덤 포레스트 기반 RTC 관리도의 성능은 각 변수의 값보다 결정 경계로 나누어진 클래스에 의해 더 큰 영향을 받는다. 이에 따라, 분류기를 개선하는 기존의 RTC 관리도와 달리, 데이터의 패턴 분류를 통해 각 의사 결정 트리의 클래스 예측 성능을 향상시킬 필요가 있다.On the other hand, the decision tree constituting the random forest classifier performs classification with decision boundaries for each variable. Therefore, the performance of the random forest based RTC chart is more affected by the class divided by the decision boundary than the value of each variable. Accordingly, unlike the existing RTC management chart for improving the classifier, it is necessary to improve the class prediction performance of each decision tree through pattern classification of data.

대한민국 등록특허 제1872345호Republic of Korea Patent 1187345 대한민국 등록특허 제1562623호Republic of Korea Patent No. 1562623

본 발명의 목적은 RTC 기법의 입력 데이터의 가공을 통해 데이터와 데이터의 패턴을 보다 명확하게 만듦으로써 각 의사 결정 트리의 클래스 예측 성능을 향상시키고, 이동식 창(moving window) 내의 데이터를 보다 정확하게 구별할 수 있는 공정 모니터링 장치 및 방법을 제공하는데 있다.The object of the present invention is to improve the class prediction performance of each decision tree by making the data and the pattern of the data more clear through the processing of the input data of the RTC technique, and to more accurately distinguish the data in the moving window. To provide a process monitoring apparatus and method that can be.

본 발명의 일 실시 예에 따른 공정 모니터링 방법은 공정 모니터링 장치에 의해 수행되는 공정 모니터링 방법으로서, (a) 공정을 통해 측정되는 시계열 데이터를 수집하는 단계, (b) 임의의 중단점(break point)

에 대하여(여기서,

는 기 설정된 알파벳 크기) 상기 시계열 데이터를 심볼화하여 심볼화된 데이터를 생성하는 단계, (c) 상기 심볼화된 데이터와 상기 시계열 데이터간에 MSE(mean square error)를 계산하는 단계, (d) 상기 계산된 MSE가 가장 작을 때의 심볼화된 데이터에 대한 중단점을 최적 중단점으로 결정하는 단계, 및 (e) 상기 심볼화된 데이터 중에서 상기 최적 중단점에서의 심볼화된 데이터에 대하여 RTC(real-time contrast) 기법을 적용하여 상기 공정의 이상여부를 탐지하는 단계를 포함할 수 있다.Process monitoring method according to an embodiment of the present invention is a process monitoring method performed by the process monitoring device, (a) collecting time series data measured through the process, (b) any break point (break point)

For (where,

Symbolizing the time series data to generate symbolized data, (c) calculating a mean square error (MSE) between the symbolized data and the time series data, and (d) Determining a breakpoint for the symbolized data when the calculated MSE is the smallest as the optimum breakpoint, and (e) real symbolic data for the symbolized data at the optimal breakpoint among the symbolized data. The method may include detecting an abnormality of the process by applying a -time contrast technique.

본 발명의 일 실시 예에 따른 공정 모니터링 장치는 공정을 통해 측정되는 시계열 데이터를 수집하고, 임의의 중단점(break point)

에 대하여(여기서,

는 기 설정된 알파벳 크기) 상기 시계열 데이터를 심볼화하여 심볼화된 데이터를 생성하고, 상기 심볼화된 데이터와 상기 시계열 데이터간에 MSE(mean square error)를 계산하여 상기 계산된 MSE가 가장 작을 때의 심볼화된 데이터에 대한 중단점을 최적 중단점으로 결정하는 데이터 가공부, 및 상기 심볼화된 데이터 중에서 상기 최적 중단점에서의 심볼화된 데이터에 대하여 RTC(real-time contrast) 기법을 적용하여 상기 공정의 이상여부를 탐지하는 이상 탐지부를 포함할 수 있다.Process monitoring device according to an embodiment of the present invention collects the time series data measured through the process, any break point (break point)

For (where,

Symbolizes the time series data to generate symbolized data, calculates a mean square error (MSE) between the symbolized data and the time series data, and calculates a symbol when the calculated MSE is smallest. The process by applying a real-time contrast (RTC) technique to the data processing unit for determining the breakpoint for the normalized data as the optimum breakpoint, and the symbolized data at the optimal breakpoint among the symbolized data It may include an abnormality detection unit for detecting the abnormality of.

본 발명의 일 실시 예에 따른 공정 모니터링 장치 및 방법은 입력 데이터의 가공을 통해 각 의사 결정 트리의 클래스 예측 성능을 향상시키고, 이동식 창 내의 데이터를 보다 정확하게 구별할 수 있다.Process monitoring apparatus and method according to an embodiment of the present invention can improve the class prediction performance of each decision tree through the processing of the input data, it is possible to more accurately distinguish the data in the mobile window.

또한, 결함의 원인을 정확하게 분석하고 비정상상태를 보다 신속하게 탐지할 수 있다.In addition, it is possible to accurately analyze the cause of the defect and to detect abnormal conditions more quickly.

도 1은 본 발명의 일 실시 예에 따른 공정 모니터링 장치의 블록도이다.
도 2는 차원수에 따른 시계열 데이터와 심볼화된 데이터의 병렬 좌표계를 도시한 것이다.
도 3은 시계열 데이터와 심볼화된 데이터 간에 계산된 MSE를 도시한 것이다.
도 4는 본 발명의 일 실시 예에 따른 공정 모니터링 방법의 순서도이다.
도 5는 본 발명의 일 실시 예에 따른 이상여부 탐지 방법의 순서도이다.
도 6a는 기존의 분류 기법과 본 발명의 일 실시 예에 따라 분류된 데이터의 모니터링 통계량을 도시한 것이다.
도 6b는 원본 데이터의 값과 본 발명의 일 실시 예에 따라 변환된 데이터의 값을 도시한 것이다.
도 6c는 본 발명의 일 실시 예에 따라 계산된 데이터의 변수 중요도를 도시한 것이다.
도 7은 원본 데이터와 본 발명의 일 실시 예에 따라 변환된 데이터들의 분류 결과를 도시한 것이다.1 is a block diagram of a process monitoring apparatus according to an embodiment of the present invention.
2 illustrates a parallel coordinate system of time series data and symbolized data according to dimension numbers.
3 shows the MSE calculated between time series data and symbolized data.
4 is a flow chart of a process monitoring method according to an embodiment of the present invention.
5 is a flowchart illustrating a method of detecting an abnormality according to an embodiment of the present invention.
6A illustrates a conventional classification technique and monitoring statistics of data classified according to an embodiment of the present invention.
6B illustrates a value of original data and a value of data converted according to an embodiment of the present invention.
6C illustrates variable importance of data calculated according to an embodiment of the present invention.
7 illustrates a classification result of original data and converted data according to an embodiment of the present invention.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the inventive concept disclosed herein are merely illustrated for the purpose of describing the embodiments according to the inventive concept, and the embodiments according to the inventive concept. These may be embodied in various forms and are not limited to the embodiments described herein.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나 이는 본 발명의 개념에 따른 실시예들을 특정한 개시형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Embodiments according to the inventive concept may be variously modified and have various forms, so embodiments are illustrated in the drawings and described in detail herein. However, this is not intended to limit the embodiments in accordance with the concept of the present invention to specific embodiments, and includes modifications, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만, 예를 들어 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but the components should not be limited by the terms. The terms are only for the purpose of distinguishing one component from another component, for example, without departing from the scope of the rights according to the inventive concept, the first component may be called a second component, Similarly, the second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 표현들, 예를 들어 "~사이에"와 "바로~사이에" 또는 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is said to be "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in the middle. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between. Expressions describing relationships between components, such as "between" and "immediately between" or "directly neighboring", should be interpreted as well.

본 명세서에서 사용한 용어는 단지 특정한 실시예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. As used herein, the terms "comprise" or "have" are intended to designate that the stated feature, number, step, operation, component, part, or combination thereof is present, but one or more other features or numbers, It should be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, parts or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and, unless expressly defined herein, are not construed in ideal or excessively formal meanings. Do not. Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings.

공정 데이터의 크기와 잡음을 줄이기 위하여 다양한 데이터 표현 기법이 있다. 이러한 데이터 표현 기법에는 예를 들면 DFT(Discrete Fourier Transformation), DWT(Discrete Wavelet Transform), SVD(Singular Value Decomposition), PAA(Pageswise Aggregate Approximation) 및 SAX(Symbolic Aggregate Approximation) 등이 있다.There are various data representation techniques to reduce the size and noise of process data. Such data representation techniques include, for example, Discrete Fourier Transformation (DFT), Discrete Wavelet Transform (DWT), Single Value Decomposition (SVD), Pageswise Aggregate Approximation (PAA), and Symbolic Aggregate Approximation (SAX).

여기서, SAX 기법은 시계열 데이터를 문자열로 심볼화하는 데이터 표현 기법으로서, 시퀀스의 평균을 사용하여 데이터를 표현하는 PAA 기법에 문자열 기반의 심볼화 알고리즘이 결합된 기법이다.Here, the SAX technique is a data representation technique for symbolizing time series data into a string, and is a technique in which a string-based symbolization algorithm is combined with a PAA technique for representing data using an average of sequences.

PAA 기법은 아래의 수학식 1을 이용하여 길이 n의 시퀀스

를 길이 M의 시퀀스

으로 나타낸다.PAA technique is a sequence of length n using Equation 1 below

Sequence of length M

Represented by

여기서, n과 M은 자연수로서 n≥M을 만족한다.Here, n and M satisfy n≥M as a natural number.

즉, PAA 기법은 n차원 시계열을 M차원으로 줄이기 위하여 시계열 데이터를 동일한 크기의 프레임으로 나누고, 각 프레임에 속하는 데이터의 평균값으로 데이터를 나타낸다.That is, in order to reduce the n-dimensional time series to M-dimension, the PAA technique divides the time series data into frames having the same size and represents the data as an average value of data belonging to each frame.

상술한 바와 같은 PAA 기법을 통해 변환된 데이터의 각 세그먼트를 문자열로 심볼화하여 나타내는 것이 SAX 기법이다. 여기서, 문자열은 예를 들면 알파벳이나 숫자 등을 포함할 수 있다.The SAX technique symbolizes each segment of data converted through the PAA technique as described above as a string. Here, the string may include, for example, an alphabet or a number.

이러한 기존의 SAX 기법은 시간축을 집계(aggregate)하는데, 시간축을 집계함에 따른 시간 정보의 손실은 실시간으로 오류를 감지하는데 있어 치명적으로 작용할 수 있다.The existing SAX technique aggregates the time base, and loss of time information due to aggregation of the time base may be fatal in detecting an error in real time.

이에 대하여, 본 발명의 일 실시 예에 따른 공정 모니터링 장치 및 방법은 RTC 관리도의 성능을 향상시키기 위하여 기존의 SAX 기법에서 개선된 SAX 기법인 적응형 결정 경계 SAX(Adaptive Decision Boundary SAX, 이하 ADB-SAX) 기법을 적용하여 입력 데이터를 가공하고, 가공된 입력 데이터에 RTC 기법을 적용하여 공정을 모니터링한다.On the other hand, the process monitoring apparatus and method according to an embodiment of the present invention is an adaptive decision boundary SAX (ADB-SAX), which is a SAX technique improved from the existing SAX technique to improve the performance of the RTC control chart. ) The input data is processed using the technique, and the process is monitored by applying the RTC technique to the processed input data.

도 1은 본 발명의 일 실시 예에 따른 공정 모니터링 장치의 블록도이다.1 is a block diagram of a process monitoring apparatus according to an embodiment of the present invention.

도 1을 참조하면, 공정 모니터링 장치(10)는 데이터 가공부(100) 및 이상 탐지부(200)를 포함한다.Referring to FIG. 1, the process monitoring apparatus 10 may include a data processor 100 and an abnormality detector 200.

데이터 가공부(100)는 공정을 통해 측정되는 시계열 데이터를 수집하고, RTC 기법을 적용함에 앞서 수집한 시계열 데이터를 가공한다. 보다 구체적으로, 데이터 가공부(100)는 시계열 데이터를 심볼화하여 심볼화된 데이터를 생성하고, 심볼화된 데이터와 시계열 데이터간에 MSE(mean square error)를 계산하여 계산된 MSE가 가장 작을 때의 심볼화된 데이터에 대한 중단점을 최적 중단점으로 결정한다.The data processing unit 100 collects time series data measured through a process and processes the collected time series data before applying the RTC technique. More specifically, the data processing unit 100 symbolizes the time series data to generate symbolized data, calculates a mean square error (MSE) between the symbolized data and the time series data, and calculates the smallest MSE. The breakpoint for the symbolized data is determined as the optimal breakpoint.

이를 위하여, 데이터 가공부(100)는 심볼화부(110) 및 최적 중단점 결정부(130)를 포함한다.To this end, the data processing unit 100 includes a symbolization unit 110 and an optimum breakpoint determiner 130.

심볼화부(110)는 임의의 중단점(break point)

에 대하여 수집된 시계열 데이터를 심볼화하여 심볼화된 데이터를 생성한다. 여기서, 중단점은 시계열 데이터를 동일한 크기의 영역으로 나누는 경계점이다.The symbolization unit 110 may have any break point.

Symbolized data is generated by symbolizing the collected time series data. Here, the breakpoint is a boundary point that divides time series data into regions of the same size.

일 실시 예에 따른 중단점은 N(0, 1)의 가우시안 커브(gaussian curve)를 동일한 크기의 영역으로 나누는 점들의 집합일 수 있다.A breakpoint according to an embodiment may be a set of points that divide a Gaussian curve of N (0, 1) into regions of the same size.

다른 일 실시 예에 따른 중단점은 분위수 q_main과 q_others에 의해 정의되는 점들의 집합으로서,

와 같이 나타낼 수 있다. 여기서,

는 기 설정된 알파벳 크기(alphabet size)이다. 기 설정된 알파벳 크기는 후술할 알파벳 크기 결정부(150)에 의해 결정된 값일 수 있다.The breakpoint according to another embodiment is a set of points defined by the quartiles q _main and q _others ,

It can be expressed as here,

Is a preset alphabet size. The preset alphabet size may be a value determined by the alphabet size determiner 150 to be described later.

임의의 중단점

은 데이터의 분포에 따라 정렬되는데, 평균에 가까운 중단점일수록 분류에 상대적으로 효과가 없다고 볼 수 있다. 이에 따라, 상기 임의의 중단점은 평균으로부터 보다 멀리 위치한 중단점들의 집합인

와 평균으로부터 보다 가까이 위치한 중단점들의 집합인

으로 분류될 수 있다. 이때, q_main과 q_others는 각각

과

에 대한 분위수(quantile)를 의미한다.

과

에 속해있는 중단점들은 임의로 설정될 수 있는 q_main과 q_others에 따라 정해질 수 있다.Any breakpoint

Are sorted according to the distribution of the data. The closer to the mean the breakpoint, the less effective the classification. Accordingly, the random breakpoint is a set of breakpoints located farther from the mean.

And a set of breakpoints located closer to the mean

Can be classified as. Where q _main and q _others are respectively

and

The quantile for.

and

Breakpoints belonging to can be determined by q _main and q _others , which can be set arbitrarily.

예를 들어, 기 설정된 알파벳 크기가 7인 경우에, 중단점의 개수는 6개이므로 중단점들의 집합은

내지

이다. 이때, 임의로 설정된 q_main과 q_others에 대하여

와

는 대칭점으로서 각각 q_main에 1 표준 편차를 더한 값과 1-q_main에 1 표준 편차를 뺀 값이며,

와

는 대칭점으로서 각각 q_main과 1-q_main이고,

와

는 대칭점으로서 각각 q_others와_1-q_others이다. For example, if the preset alphabet size is 7, the number of breakpoints is 6, so the set of breakpoints is

To

to be. At this time, for the randomly set q _main and q _others

Wow

Are symmetry points, where q _main plus 1 standard deviation and 1-q _main minus 1 standard deviation.

Wow

Are symmetry points, q _main and 1-q _main ,

Wow

Are symmetry points, q _others and _1- q _others, respectively.

상술한 바와 같은 임의의 중단점에 의해 나누어진 각 시계열 데이터 영역에 대하여, 심볼화부(110)는 수집된 시계열 데이터를 아래의 수학식 2와 같이 각 시계열 데이터 영역의 평균값을 계산하여 심볼화된 데이터를 생성한다.For each time series data area divided by any breakpoint as described above, the symbolizing unit 110 calculates the averaged value of each time series data area from the collected time series data as shown in Equation 2 below. Create

여기서, E(area_i)는 중단점에 의해 나누어지는 시계열 데이터의 i번째 영역에 대한 평균값을 의미하고,

는 시계열 데이터의 표준 편차를 의미한다.Here, E (area _i ) means an average value for the i th region of the time series data divided by the breakpoint,

Is the standard deviation of the time series data.

한편, 일 실시 예에 따른 심볼화부(110)는 상술한 바와 같이 심볼화에 앞서 시간축을 집계하는 기존의 SAX 기법과 달리 시간축을 집계하지 않고, 데이터 값만을 심볼화할 수 있다. 즉, 심볼화부(110)는 PAA 기법에 의해 데이터를 시퀀스의 평균으로 나타내는 단계를 생략하고(상기 수학식 1에서 n=M), 데이터의 값만을 심볼화하여 시간 정보에 대한 손실을 최소화할 수 있다.On the other hand, unlike the conventional SAX technique of counting the time axis prior to symbolization as described above, the symbolization unit 110 may symbolize only the data value without counting the time axis. That is, the symbolization unit 110 may omit the step of representing the data as the average of the sequence by the PAA technique (n = M in Equation 1), and minimize the loss of time information by symbolizing only the data value. have.

상술한 바와 같이 심볼화부(110)에 따라 시계열 데이터는 범주형의 데이터로 변환된다.As described above, the time series data is converted into categorical data according to the symbolization unit 110.

차원수에 따른 시계열 데이터와 심볼화된 데이터의 병렬 좌표계를 도시한 도 2를 참조하면, x축은 변수의 수이고, y축은 각 변수에서 최대값과 최소값에 대한 스케일 값이다. 심볼화 과정을 거치면, 각 변수의 값들이 알파벳 크기에 따라 범주형(categorical)의 데이터로 변하는 것을 확인할 수 있으며, 각 데이터가 갖는 패턴은 크게 변하지 않는 것을 확인할 수 있다.Referring to FIG. 2, which shows a parallel coordinate system of time series data and symbolized data according to the number of dimensions, the x-axis is the number of variables, and the y-axis is a scale value for the maximum and minimum values in each variable. Through the symbolization process, it can be seen that the values of each variable change into categorical data according to the alphabet size, and the pattern of each data does not change significantly.

또한, 범주형의 데이터로 변하는 경우 랜덤 포레스트의 결정 경계의 결정에 영향을 미친다. 랜덤 포레스트를 구성하는 의사 결정 트리는 각 변수의 값을 비교하여 결정 경계를 결정하는데, 값들 간의 거리가 클수록 결정 경계가 생길 확률이 높아진다. 만일 정상인 데이터들 간에 결정 경계가 존재하면, 모니터링 통계량이 상승하게 된다. 이에 반하여, 본 발명의 일 실시 예에 따라 데이터를 범주형의 데이터로 만들어 각 값의 간격을 최소화하는 경우 정상 범주 영역 안의 통계량을 낮출 수 있다.In addition, the change in categorical data affects the determination of the decision boundary of the random forest. The decision tree constituting the random forest compares the values of each variable to determine the decision boundary. The greater the distance between the values, the higher the probability of occurrence of the decision boundary. If there is a decision boundary between healthy data, the monitoring statistics will rise. In contrast, when the data is categorized according to an embodiment of the present invention to minimize the interval of each value, the statistics within the normal categorical region may be lowered.

최적 중단점 결정부(130)는 임의의 중단점에 대하여 최적의 중단점을 결정한다. 보다 구체적으로, 최적 중단점 결정부(130)는 임의의 중단점에 대하여, 심볼화부(110)에 의해 심볼화된 데이터와 시계열 데이터간에 MSE를 계산하고, 계산된 MSE가 가장 작을 때의 심볼화된 데이터에 대한 중단점을 최적 중단점으로 결정한다.The optimum breakpoint determiner 130 determines an optimal breakpoint for any breakpoint. More specifically, the optimum breakpoint determiner 130 calculates an MSE between data symbolized by the symbolizer 110 and time series data for an arbitrary breakpoint, and symbolizes when the calculated MSE is smallest. The breakpoint for the generated data is determined as the optimum breakpoint.

여기서, MSE는 평균 제곱 오차로서 시계열 데이터의 심볼화에 따른 정보 손실(information loss)를 나타내는 값이다. 즉, 임의의 중단점 조합들과 시계열 데이터간에 계산된 각각의 MSE 중에서 최소값을 나타낼 때 심볼화에 따른 정보 손실이 가장 적다고 볼 수 있으므로, 최적 중단점 결정부(130)는 MSE가 최소가 될 때의 중단점을 최적 중단점으로 결정한다.Here, MSE is an average squared error indicating information loss due to symbolization of time series data. That is, since the loss of information due to symbolization is minimal when the minimum value of each MSE calculated between arbitrary breakpoint combinations and time series data is represented, the optimal breakpoint determiner 130 may determine that the MSE is the minimum. The breakpoint in time is determined as the optimum breakpoint.

시계열 데이터와 심볼화된 데이터 간에 계산된 MSE를 도시한 도 3을 참조하면, x축은 중단점들의 조합에 대한 인덱스를 나타내고, y축은 각 인덱스에서의 MSE값이다. 예를 들어 q_main가 81에서 99로 1 백분위 수 간격으로 변하고, q_others가 60에서 80으로 1 백분위 수 간격으로 변하는 경우, 그에 따라 중단점들의 조합이 변하고 MSE값 또한 변한다. 인덱스가 대략 230인 지점(q_main=92,q_others=68)에서, MSE값이 최소를 나타내므로 이때의 중단점을 최적 중단점으로 결정할 수 있다.Referring to FIG. 3, which shows the MSE calculated between time series data and symbolized data, the x-axis represents the index for the combination of breakpoints, and the y-axis is the MSE value at each index. For example, if q _main changes from 81 to 99 in 1st percentile intervals, and q _others changes from 60 to 80 in 1st percentile intervals, the combination of breakpoints changes and the MSE value changes accordingly. The point at which the index is approximately 230 (q _main = 92, q _others = 68), the MSE value represents the minimum, so the breakpoint at this time can be determined as the optimal breakpoint.

여기에, 데이터 가공부(100)는 알파벳 크기 결정부(150)를 더 포함할 수 있다.Here, the data processing unit 100 may further include an alphabet size determining unit 150.

알파벳 크기 결정부(150)는 중단점의 개수를 결정하는 알파벳 크기를 결정한다. SAX 기법에서, 알파벳 크기는 심볼화되는 알파벳 수를 결정하는 매개 변수로서, PAA 기법에서 시간축을 집계하지 않는 경우(상기 수학식 1에서 n=M인 경우)에는 각 시간 지점에서 데이터를 나타내는 범주의 수를 결정한다. 이러한 알파벳 크기는 데이터가 단순한 대칭 분포일 경우에는 작게 설정되고, 왜도(skewness)가 큰 비대칭 분포일수록 크게 설정되는 것이 바람직하다.The alphabet size determiner 150 determines an alphabet size for determining the number of breakpoints. In the SAX technique, the alphabet size is a parameter that determines the number of alphabets to be symbolized. If the time axis is not aggregated in the PAA technique (n = M in Equation 1 above), the size of the category representing the data at each time point is determined. Determine the number This alphabet size is preferably set smaller when the data is a simple symmetric distribution, and is set larger as the skewness has a larger skewness.

한편, 통계적 접근에서 왜도는 평균과 최빈값의 분포와 관련이 있다. 예를 들어 왼쪽 비대칭(left-skewed) 분포일 때는 평균은 최빈값의 좌측에 위치하고, 오른쪽 비대칭(right-skewed) 분포일 때는 평균은 최빈값의 우측에 위치한다. 즉, 최빈값과 평균의 차이는 알파벳 크기 결정에 있어서 유용한 지수가 될 수 있다.On the other hand, the skewness in the statistical approach is related to the distribution of mean and mode. For example, for a left-skewed distribution, the mean is located to the left of the mode, and for a right-skewed distribution, the mean is to the right of the mode. In other words, the difference between the mode and the mean may be a useful index for determining the alphabet size.

이에 따라, 일 실시 예에 따른 알파벳 크기 결정부(150)는 아래의 수학식 3과 같이 알파벳 크기를 결정할 수 있다.Accordingly, the alphabet size determining unit 150 according to an embodiment may determine the alphabet size as shown in Equation 3 below.

여기서, S는 알파벳 크기의 범위로서 임의로 설정될 수 있으며,

는 시계열 데이터가 정규 분포를 따를 때의 중단점의 최대값, mode와 mean은 각각 이동식 창(moving window)에서의 시계열 데이터의 최빈값과 평균값을 의미한다. 한편, 알파벳 크기의 범위는 데이터의 왜도에 따라 달리 설정될 수 있다.Here, S may be arbitrarily set as a range of alphabetical sizes,

Denotes the maximum value of the breakpoint when the time series data follow a normal distribution, and mode and mean, respectively, the mode and mean value of the time series data in the moving window. On the other hand, the range of the alphabet size may be set differently according to the skewness of the data.

이상 탐지부(200)는 데이터 가공부(100)를 통하여 가공된 데이터에 RTC(real-time contrast) 기법을 적용하여 제조 공정의 이상여부를 탐지한다.The abnormality detection unit 200 detects an abnormality of the manufacturing process by applying a real-time contrast (RTC) technique to the data processed through the data processing unit 100.

이를 위하여, 이상 탐지부(200)는 학습부(210), 분류 확률 계산부(230), 모니터링 통계량 계산부(250) 및 탐지부(270)를 포함한다.To this end, the abnormality detector 200 includes a learner 210, a classification probability calculator 230, a monitoring statistics calculator 250, and a detector 270.

학습부(210)는 수집된 데이터가 데이터 가공부(100)를 통하여 변환된 데이터 중에서 정상상태일 때의 데이터를 레퍼런스(reference) 데이터로 하고, 이후에 실시간으로 수집되는 데이터를 대조(contrast) 데이터로 하여 분류기를 학습한다. 여기서, 대조 데이터는 시간 t에서 새로이 측정될 때마다 이동식 창을 적용하여, 가장 최근에 수집된 N_w개 만큼의 데이터만 포함되도록 한다.The learning unit 210 uses the data when the collected data is in the normal state among the data converted through the data processing unit 100 as reference data, and then contrasts the data collected in real time. Learn the classifier. Here, the control data applies a movable window every time a new measurement is made at time t, so that only N _w data of the most recently collected data are included.

분류기는 적어도 하나 이상의 의사 결정 트리를 포함하는 랜덤 포레스트(random forest) 분류기로서 상기 레퍼런스 데이터와 대조 데이터를 학습하여 시간 t마다 분류 경계(decision boundary)를 생성하고, 레퍼런스 데이터는 클래스 0으로 분류하고 대조 데이터는 클래스 1로 분류한다. 학습을 통하여, 분류기는 예측치(

)를 생성한다. 여기서, x_i는 변환된 데이터이고, t_j는 j번째 분류기를 의미한다.A classifier is a random forest classifier that includes at least one decision tree that learns the reference data and the control data to generate a decision boundary at every time t, classifying and matching the reference data into class 0. The data is classified as Class 1. Through learning, the classifier can predict

) Here, x _i is transformed data and t _j means j-th classifier.

이때, 랜덤 포레스트 분류기에서 의사 결정 트리의 수는 크게 설정 되더라도 과적합(overfitting)의 문제는 발생하지 않으므로, 가능한 크게 설정되는 것이 바람직하다.In this case, even if the number of decision trees in the random forest classifier is large, the problem of overfitting does not occur.

분류 확률 계산부(230)는 학습부(210)를 통하여 생성된 예측치를 이용하여 아래의 수학식 4와 같이 분류 확률을 계산한다.The classification probability calculation unit 230 calculates a classification probability as shown in Equation 4 below by using the predicted value generated by the learning unit 210.

랜덤 포레스트 분류기는 각 개별 의사 결정 트리를 완전히 성장시켜 바이어스를 줄이고, 개별 트리 간의 상관 관계를 줄이기 위해 배깅(bagging)을 사용한다. 이때, OOB(Out of bagging)는 배깅을 위하여 의사 결정 트리를 학습할 때 사용되지 않는 데이터를 의미한다.The random forest classifier uses bagging to fully grow each individual decision tree to reduce bias and to reduce correlation between individual trees. At this time, out of bagging (OOB) means data that is not used when learning a decision tree for bagging.

는 지시자 함수로서, 예측치가 실제 클래스 k와 일치하면 1을 반환하고, 일치하지 않으면 0을 반환하는 함수이다.

Is an indicator function that returns 1 if the prediction matches the actual class k and 0 if it does not.

모니터링 통계량 계산부(250)는 분류 확률 계산부(230)로부터 계산된 분류 확률을 이용하여 모니터링 통계량을 계산한다. 일반적으로, 레퍼런스 데이터에 대한 모니터링 통계량은 대조 데이터에 대한 모니터링 통계보다 훨씬 커서 더 안정적이고 검출 성능이 좋다. 이에 따라, 모니터링 통계량 계산부(250)는 아래의 수학식 5와 같이 레퍼런스 데이터에 대한 모니터링 통계량을 계산하는 것이 바람직하다.The monitoring statistics calculator 250 calculates the monitoring statistics using the classification probabilities calculated by the classification probability calculator 230. In general, the monitoring statistics for the reference data are much larger than the monitoring statistics for the control data, which makes them more stable and detectable. Accordingly, the monitoring statistics calculator 250 calculates the monitoring statistics for the reference data as shown in Equation 5 below.

여기서, N₀는 레퍼런스 데이터(S₀)의 크기를 의미한다.Here, N ₀ means the size of the reference data S ₀ .

탐지부(270)는 모니터링 통계량 계산부(250)를 통해 계산된 모니터링 통계량이 기 설정된 관리 한계선(control limit)을 벗어나는 경우 이상상태인 것으로 탐지한다. 이에 따라, 공정 상태가 제어(in-conrtorl)에서 제어 밖(out-of-control) 상태로 변경된다.The detector 270 detects an abnormal state when the monitoring statistics calculated by the monitoring statistics calculator 250 fall outside a preset control limit. Accordingly, the process state changes from in-conrtorl to out-of-control state.

한편, 기 설정된 관리 한계선은 공정의 종류나 환경, 제품의 종류 또는 탐지하고자 하는 이상 등에 따라 달리 설정될 수 있는 값으로서, 예를 들면 ARL(average run length)과 같이 관리도의 성능을 평가하는 척도를 이용하여 설정될 수 있다.On the other hand, the preset control limit line is a value that can be set differently according to the type or environment of the process, the type of the product, or the abnormality to be detected. It can be set using.

여기에, 이상 탐지부(200)는 원인 진단부(290)를 더 포함할 수 있다.Here, the abnormality detection unit 200 may further include a cause diagnosis unit 290.

원인 진단부(290)는 탐지부(270)를 통해 공정이 이상상태인 것으로 탐지되었을 때, 이상 원인을 진단하기 위하여 변수 중요도(variable importance)를 계산하고, 변수 중요도를 이용하여 이상의 원인을 진단한다. 변수 중요도는 이상을 탐지하는데 기여하는 정도로서, 불순도 점수를 통해 계산될 수 있다. 여기서, 불순도 점수는 예를 들면 지니 지수(Gini index) 또는 섀넌 엔트로피(Shannon entropy) 등일 수 있다.The cause diagnosis unit 290 calculates variable importance to diagnose the cause of the abnormality when the process is detected as an abnormal state through the detection unit 270, and diagnoses the cause of the abnormality using the variable importance. . Variable importance is the degree of contribution to detecting anomalies, which can be calculated through impurity scores. Here, the impurity score may be, for example, a Gini index or Shannon entropy.

일 실시 예에 따라 지니 지수를 이용하는 경우, 랜덤 포레스트에서의 지니 지수는 아래의 수학식 6과 같다.According to an embodiment, when using the Gini index, the Gini index in the random forest is expressed by Equation 6 below.

여기서, v는 의사 결정 트리의 노드이고, c는 클래스의 수이고, r_i는 각 노드에서 클래스 i의 비율을 의미한다. 계산된 지니 지수에 따라, 변수 중요도는 아래의 수학식 7과 같이 계산된다.Where v is a node in the decision tree, c is the number of classes, and r _i represents the proportion of class i at each node. According to the calculated Gini index, the variable importance is calculated as in Equation 7 below.

여기서, ntree는 트리의 총 개수이고, D_t는 개별 분류기이고,

는 불순도의 감소도이다. 이때, 불순물의 감소도는 노드 v에서 자식 노드들에 대해 나누어지는 비율을 나타내는 가중치 w_L과 w_R을 반영하는 것으로서, 아래의 수학식 8과 같다.Where ntree is the total number of trees, D _t is the individual classifier,

Is the decrease in impurity. At this time, the degree of reduction of impurities reflects the weights w _L and w _R indicating the ratio divided by the child nodes at the node v, as shown in Equation 8 below.

원인 진단부(290)는 상술한 바와 같이 변수 중요도를 계산하고, 변수 중요도가 큰 변수에 이상의 원인이 있는 것으로 진단할 수 있다.The cause diagnosis unit 290 calculates the importance of the variable as described above, and may diagnose that the cause of the abnormality is in the variable having the high variable importance.

여기에, 공정 모니터링 장치(10)는 수집된 시계열 데이터나 데이터 가공부(100) 및 이상 탐지부(200)를 통해 계산된 결과 등의 저장을 위하여 별도의 데이터베이스를 더 포함할 수 있다.Here, the process monitoring device 10 may further include a separate database for storing the collected time series data or the results calculated by the data processing unit 100 and the abnormality detection unit 200.

도 4는 본 발명의 일 실시 예에 따른 공정 모니터링 방법의 순서도이다. 이하에서는 앞서 설명한 부분과 중복되는 부분에 대한 상세한 설명은 생략하기로 한다.4 is a flow chart of a process monitoring method according to an embodiment of the present invention. Hereinafter, detailed descriptions of the overlapping portions will be omitted.

도 4를 참조하면, S310 단계는 공정을 통해 측정되는 시계열 데이터를 수집하는 단계이다.Referring to FIG. 4, step S310 is a step of collecting time series data measured through a process.

S320 단계는 임의의 중단점

에 대하여 시계열 데이터를 심볼화하여 심볼화된 데이터를 생성한다. S320 단계는 임의의 중단점에 의해 나누어진 각 시계열 데이터 영역에 대하여, 상기 수학식 2에 의해 각 시계열 데이터 영역의 평균값을 계산하여 시계열 데이터를 심볼화한다.S320 step is a random breakpoint

Symbolize time series data with respect to to generate symbolic data. In step S320, for each time series data area divided by an arbitrary breakpoint, the average value of each time series data area is calculated by Equation 2 to symbolize the time series data.

S330 단계는 S320 단계를 통해 심볼화된 데이터와 수집된 시계열 데이터간에 MSE를 계산한다. S330 단계는 모든 임의의 중단점들의 조합에 대하여 반복하여 수행될 수 있다.In operation S330, the MSE is calculated between the symbolized data and the collected time series data. Step S330 may be performed repeatedly for all combinations of breakpoints.

S340 단계는 S330 단계를 통해 계산된 MSE가 가장 작을 때의 심볼화된 데이터에 대한 중단점을 최적 중단점으로 결정한다.Step S340 determines the breakpoint for the symbolized data when the MSE calculated in step S330 is the smallest as the optimum breakpoint.

S350 단계는 S340 단계를 통해 심볼화된 데이터 중에서 최적 중단점에서의 심볼화된 데이터에 대하여 RTC 기법을 적용하여 공정의 이상여부를 탐지한다.Step S350 detects an abnormality of the process by applying the RTC technique to the symbolized data at the optimal breakpoint among the symbolized data in step S340.

도 5는 본 발명의 일 실시 예에 따른 이상여부 탐지 방법의 순서도이다.5 is a flowchart illustrating a method of detecting an abnormality according to an embodiment of the present invention.

도 5를 참조하면, S351 단계는 심볼화된 데이터와 대조 데이터를 랜덤 포레스트 분류기를 이용하여 학습하고, 의사 결정 트리의 예측치를 생성한다.Referring to FIG. 5, step S351 learns symbolized data and control data using a random forest classifier, and generates prediction values of a decision tree.

S351 단계를 통하여 예측치가 생성되면, S352 단계는 예측치를 이용하여 상기 수학식 4와 같이 분류 확률을 계산한다.When the predicted value is generated through the step S351, the step S352 calculates the classification probability as shown in Equation 4 using the predicted value.

S353 단계는 S352 단계를 통해 계산된 분류확률을 이용하여 상기 수학식 5와 같이 모니터링 통계량을 계산한다.Step S353 calculates the monitoring statistics as shown in Equation 5 using the classification probability calculated through step S352.

S354 단계는 S353 단계를 통해 계산된 모니터링 통계량이 기 설정된 관리 한계선을 벗어나는 경우 이상상태인 것으로 탐지한다.Step S354 detects an abnormal state when the monitoring statistics calculated through step S353 are out of the preset management limit line.

한편, S354 단계에 의해 이상상태가 탐지되면, 이상 원인을 진단하기 위하여 상기 수학식 7과 같이 변수 중요도를 계산하고, 계산된 변수 중요도를 통해 이상의 원인이 되는 변수를 추출하는 S355 단계를 추가적으로 수행할 수 있다.On the other hand, if the abnormal state is detected by the step S354, in order to diagnose the cause of the abnormality to calculate the importance of the variable as shown in equation (7), and additionally perform the step S355 to extract the variable causing the abnormality through the calculated variable importance Can be.

도 6a는 기존의 분류 기법과 본 발명의 일 실시 예에 따라 분류된 데이터의 모니터링 통계량을 도시한 것이고, 도 6b는 원본 데이터의 값과 본 발명의 일 실시 예에 따라 변환된 데이터의 값을 도시한 것이고, 도 6c는 본 발명의 일 실시 예에 따라 계산된 데이터의 변수 중요도를 도시한 것이다.6A illustrates a conventional classification technique and monitoring statistics of data classified according to an embodiment of the present invention, and FIG. 6B illustrates values of original data and values of data converted according to an embodiment of the present invention. FIG. 6C illustrates the variable importance of data calculated according to an embodiment of the present invention.

도 6a를 참조하면, 가공된 후의 데이터는 정상 범주 내에서는 정상 범주로 포함되며, 정상 범주를 벗어나면 바깥 카테고리로 구분되는데, 변환된 데이터는 정상 범주에서 변환되지 않았을 때보다 낮은 모니터링 통계량을 갖는다. 이는 동일한 카테고리 값을 갖는 데이터 사이에는 결정 경계가 생길 수 없기 때문이다. 반면에, 정상 범주 밖의 데이터는 바깥 카테고리로 구분되기 때문에 기존의 모니터링 통계량보다 큰 값을 갖는다. 즉, 본 발명에 의할 경우 정상 범주의 모니터링 통계량과 비정상 범주의 모니터링 통계량의 차이가 더 벌어지게 되어 탐지 성능을 높일 수 있다. 에서 통계량이 CL을 넘어가지 않는 것을 확인할 수 있다.Referring to FIG. 6A, the processed data is included as a normal category within the normal category, and is divided into an external category outside the normal category, and the converted data has lower monitoring statistics than when not converted from the normal category. This is because no decision boundary can occur between data having the same category value. On the other hand, data outside the normal category is larger than the existing monitoring statistics because it is classified into the outer category. That is, according to the present invention, the difference between the monitoring statistics of the normal category and the monitoring statistics of the abnormal category is widened, thereby increasing the detection performance. You can see that the statistic does not exceed CL.

한편, 도 6b를 참조하면 본 발명의 일 실시 예에 따라 데이터가 변환되는 경우에 시점 22에서와 같이 변환된 데이터 값이 2.699에서 3.977로 증가되는 현상을 나타내는 시점이 발생할 수 있으나, 이동식 창에서는 한 점의 오분류를 이동식 창의 크기 중 하나의 오분류로 인식하므로, RTC 기법을 채용하는 경우 도 6b에 나타난 현상은 문제되지 않는다.Meanwhile, referring to FIG. 6B, when data is converted according to an embodiment of the present invention, a time point indicating a phenomenon in which the converted data value is increased from 2.699 to 3.977 as in time point 22 may occur. Since the misclassification of the point is recognized as one misclassification of the size of the movable window, the phenomenon shown in FIG. 6B is not a problem when the RTC technique is employed.

또한, 도 6c를 참조하면, 원인 분석을 나타내는 변수 중요도에서도 X1이 다른 변수들과 비슷한 값을 갖는 것을 확인할 수 있다.In addition, referring to FIG. 6C, it can be seen that X1 has a similar value to other variables in variable importance indicating cause analysis.

도 7은 원본 데이터와 본 발명의 일 실시 예에 따라 변환된 데이터들의 분류 결과를 도시한 것이다.7 illustrates a classification result of original data and converted data according to an embodiment of the present invention.

도 7을 참조하면, 원본 데이터(a)와 가공된 데이터(b)에 대한 결정 경계가 함께 표시되어있다. 각 포인트에는 구분된 클래스의 라벨(0 또는 1)이 표시되어있다. 원본 데이터와 달리, 본 발명에 의해 가공된 데이터를 이용하여 분류할 경우 결정 경계에 의해 클래스가 명확하게 나뉘어진 것을 확인할 수 있다.Referring to Fig. 7, the decision boundaries for the original data (a) and the processed data (b) are displayed together. Each point is labeled with a class (0 or 1) of the class. Unlike the original data, when classifying using the processed data according to the present invention, it can be seen that the class is clearly divided by the decision boundary.

상술한 바와 같이 본 발명의 일 실시 예에 따른 공정 모니터링 장치 및 방법에 의할 경우, RTC 기법의 적용에 앞서 정보의 손실을 최소화하는 데이터 가공을 통하여 공정 모니터링의 성능을 향상시킬 수 있다.As described above, in the case of the process monitoring apparatus and method according to an embodiment of the present invention, performance of process monitoring may be improved through data processing that minimizes loss of information prior to applying the RTC technique.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments may include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable arrays (FPAs), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of explanation, one processing device may be described as being used, but one of ordinary skill in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and may configure the processing device to operate as desired, or process independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. Or may be permanently or temporarily embodied in a signal wave to be transmitted. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. Software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be embodied in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with reference to the accompanying drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different form than the described method, or other components. Or, even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

Claims

A process monitoring method performed by a process monitoring device,
(a) collecting time series data measured through the process;
(b) any break point

For (where,

Symbolizing the time series data to generate symbolized data;
(c) calculating a mean square error (MSE) between the symbolized data and the time series data;
(d) determining a breakpoint for the symbolized data when the calculated MSE is smallest as an optimal breakpoint; And
(e) detecting an abnormality of the process by applying a real-time contrast (RTC) technique to the symbolized data at the optimum breakpoint among the symbolized data.

The method of claim 1,
Step (b) is a process monitoring method for symbolizing the data value of the time series data, without aggregating the time axis.

The method of claim 1,
The step (b) is a process monitoring method for symbolizing the time series data by calculating the average value of each time series data area for each time series data area divided by the arbitrary breakpoint.

The method of claim 3,
The average value of each time series data area is calculated by Equation 1 below.
[Equation 1]

Where E (area _i ) is the mean value for the i th region,

Is the standard deviation of the time series data.

The method of claim 1,
The preset alphabet size is a process monitoring method defined by Equation 2 below.
[Equation 2]

Where S is a range of alphabetic sizes,

Is the maximum value of the breakpoint when the time series data follow the normal distribution, and mode and mean are the modes and mean values of the time series data in the moving window, respectively.

The method of claim 1,
Step (e) is
(e-1) learning the symbolized data and contrast data using a random forest classifier and generating predictions of Decision Trees;
(e-2) calculating a classification probability using the prediction value;
(e-3) calculating monitoring statistics using the classification probability; And
(e-4) detecting the abnormal state when the monitoring statistic is out of a predetermined control limit.

The method of claim 6,
Step (e) is
(e-5) The process monitoring method further comprises the step of diagnosing the cause of the abnormality using variable importance, if detected as abnormal.

Collect time series data measured by the process, and random breakpoints

For (where,

Symbolizes the time series data to generate symbolized data, calculates a mean square error (MSE) between the symbolized data and the time series data, and calculates a symbol when the calculated MSE is smallest. A data processing unit which determines a breakpoint for the normalized data as an optimum breakpoint; And
And an abnormality detection unit detecting an abnormality of the process by applying a real-time contrast (RTC) technique to the symbolized data at the optimum breakpoint among the symbolized data.

The method of claim 8,
The data processing unit symbolizes data values of the time series data without aggregation of a time axis,
For each time series data area divided by the arbitrary breakpoint, the average value of each time series data area is calculated to symbolize the time series data,
The average value of each time series data area is calculated by Equation 1 below.
[Equation 1]

Where E (area _i ) is the mean value for the i th region,

Is the standard deviation of the time series data.

The method of claim 8,
The data processing unit
And a alphabet size determiner configured to determine the preset alphabet size by Equation 2 below.
[Equation 2]

Where S is a range of alphabetic sizes,