KR20230150274A

KR20230150274A - Machine learning-based flow determination for video coding

Info

Publication number: KR20230150274A
Application number: KR1020237027621A
Authority: KR
Inventors: 안키테쉬 쿠마르 싱; 힐미 에네스 에길메즈; 무하메드 제이드 코반; 마르타 카르체비츠
Original assignee: 퀄컴 인코포레이티드
Priority date: 2021-02-25
Filing date: 2022-02-22
Publication date: 2023-10-30
Also published as: JP2024508772A; BR112023016294A2; EP4298795A1; WO2022182651A1

Abstract

시스템들 및 기법들은 비디오 데이터를 프로세싱하기 위하여 본원에 설명된다. 일부 양태들에서, 방법은 기계 학습 시스템에 의해, 입력 비디오 데이터를 획득하는 단계를 포함할 수 있다. 입력 비디오 데이터는 현재 프레임에 대한 하나 이상의 휘도 성분들을 포함한다. 방법은 기계 학습 시스템에 의해, 현재 프레임에 대한 휘도 성분(들)을 사용하여 현재 프레임의 휘도 성분(들)에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 단계를 포함할 수 있다. 일부 경우들에서, 방법은 현재 프레임의 루마 성분(들) 및 이전 프레임의 적어도 하나의 재구성된 루마 성분에 기초하여 휘도 성분(들)에 대한 모션 정보를 결정하는 단계를 포함할 수 있다. 일부 경우들에서, 방법은 현재 프레임의 휘도 성분(들)에 대해 결정된 모션 정보를 사용하여 현재 프레임의 색차 성분(들)에 대한 모션 정보를 결정하는 단계를 더 포함할 수 있다. Systems and techniques are described herein for processing video data. In some aspects, a method may include obtaining input video data by a machine learning system. Input video data includes one or more luminance components for the current frame. The method includes determining, by a machine learning system, motion information for the luminance component(s) of the current frame and motion information for one or more chrominance components of the current frame using the luminance component(s) for the current frame. It can be included. In some cases, the method may include determining motion information for luminance component(s) based on the luma component(s) of the current frame and at least one reconstructed luma component(s) of the previous frame. In some cases, the method may further include determining motion information for the chrominance component(s) of the current frame using the motion information determined for the luminance component(s) of the current frame.

Description

Machine learning-based flow determination for video coding

본 개시는 일반적으로 이미지들 및/또는 비디오의 인코딩 (또는 압축) 및 디코딩 (압축해제) 을 포함하는 이미지 및 비디오 코딩에 관한 것이다. 예를 들어, 본 개시의 양태들은 하나 이상의 이미지 프레임들 또는 픽처들 (예를 들어, 비디오 프레임들/픽처들) 의 루마 및 크로마 성분들에 대한 플로우 정보를 결정하기 위한 기법들에 관한 것이다. This disclosure generally relates to image and video coding, including encoding (or compression) and decoding (decompression) of images and/or video. For example, aspects of the present disclosure relate to techniques for determining flow information for luma and chroma components of one or more image frames or pictures (e.g., video frames/pictures).

많은 디바이스들 및 시스템들은 비디오 데이터가 프로세싱되고 소비를 위해 출력될 수 있게 한다. 디지털 비디오 데이터는 소비자들 및 비디오 제공자들의 수요들을 충족시키기 위해 대량의 데이터를 포함한다. 예를 들어, 비디오 데이터의 소비자들은 높은 충실도, 해상도, 프레임 레이트들 등을 갖는 높은 품질의 비디오를 원한다. 결과적으로, 이들 요구들을 충족시키기 위해 필요한 대량의 비디오 데이터는 그 비디오 데이터를 프로세싱하고 저장하는 통신 네트워크들 및 디바이스들에 부담을 지운다.Many devices and systems enable video data to be processed and output for consumption. Digital video data contains large amounts of data to meet the demands of consumers and video providers. For example, consumers of video data want high quality video with high fidelity, resolution, frame rates, etc. As a result, the large amounts of video data needed to meet these demands place a burden on communication networks and devices that process and store the video data.

비디오 데이터를 압축하기 위해 다양한 코딩 기법이 사용될 수도 있다. 비디오 코딩의 목표는 비디오 품질에 대한 열화를 회피 또는 최소화하면서 더 낮은 비트 레이트를 이용하는 형태로 비디오 데이터를 압축하는 것이다. 끊임없이 진화하는 비디오 서비스들이 이용가능하게 됨에 따라, 우수한 코딩 효율을 갖는 인코딩 기법들이 필요하다.Various coding techniques may be used to compress video data. The goal of video coding is to compress video data into a form that utilizes lower bit rates while avoiding or minimizing degradation to video quality. As constantly evolving video services become available, encoding techniques with superior coding efficiency are needed.

하나 이상의 머신 학습 시스템들을 사용하여 이미지 및/또는 비디오 콘텐츠를 코딩(예를 들어, 인코딩 및/또는 디코딩)하기 위한 시스템들 및 기술들이 설명된다. 적어도 하나의 예에 따르면, 비디오를 프로세싱하기 위한 방법이 제공된다. 이 방법은, 기계 학습 시스템에 의해, 입력 비디오 데이터를 획득하는 단계 - 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분을 포함함 -; 및 기계 학습 시스템에 의해, 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 단계를 포함한다.Systems and techniques are described for coding (e.g., encoding and/or decoding) image and/or video content using one or more machine learning systems. According to at least one example, a method for processing video is provided. The method includes obtaining, by a machine learning system, input video data, wherein the input video data includes at least one luminance component for the current frame; and determining, by the machine learning system, using the at least one luminance component for the current frame, motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. do.

다른 예에서, 적어도 하나의 메모리 (예를 들어, 가상 콘텐츠 데이터, 하나 이상의 이미지들 등과 같은 데이터를 저장하도록 구성됨) 및 적어도 하나의 메모리에 커플링된 하나 이상의 프로세서들 (예를 들어, 회로부로 구현됨) 을 포함하는, 비디오 데이터를 프로세싱하기 위한 장치가 제공된다. 하나 이상의 프로세서들은, 기계 학습 시스템을 사용하여, 입력 비디오 데이터를 획득하도록 구성되고, 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분을 포함하고; 그리고 기계 학습 시스템을 사용하여, 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 컴포넌트들에 대한 모션 정보를 결정하도록 구성되고 그러한 것을 수행할 수 있다.In another example, at least one memory (e.g., configured to store data such as virtual content data, one or more images, etc.) and one or more processors (e.g., implemented in circuitry) coupled to the at least one memory An apparatus for processing video data is provided, including: The one or more processors are configured to obtain, using the machine learning system, input video data, wherein the input video data includes at least one luminance component for a current frame; and, using the machine learning system, determine, using the at least one luminance component for the current frame, motion information for the at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. You can do something like that.

다른 예에서, 하나 이상의 프로세서들에 의해 실행될 때, 하나 이상의 프로세서들로 하여금: 기계 학습 시스템을 사용하여, 입력 비디오 데이터를 획득하게 하고 - 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분을 포함함 -; 기계 학습 시스템을 사용하여, 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하게 하는 명령들을 저장한 비일시적 컴퓨터 판독가능 매체가 제공된다. In another example, when executed by one or more processors, cause the one or more processors to: Using a machine learning system, obtain input video data, wherein the input video data includes at least one luminance component for the current frame. Ham -; Instructions for using a machine learning system to determine motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame using the at least one luminance component for the current frame. A non-transitory computer-readable medium storing data is provided.

다른 예에 있어서, 비디오 데이터를 프로세싱하기 위한 장치가 제공된다. 이 장치는, 입력 비디오 데이터를 획득하는 수단으로서, 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분을 포함하는, 상기 입력 비디오 데이터를 획득하는 수단; 및 현재 프레임에 대한 적어도 하나의 휘도 성분을 이용하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 수단을 포함한다.In another example, an apparatus for processing video data is provided. The apparatus includes means for obtaining input video data, wherein the input video data includes at least one luminance component for a current frame; and means for determining, using the at least one luminance component for the current frame, motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame.

일부 양태들에서, 앞서 설명된 방법들, 장치들, 및 컴퓨터 판독가능 매체 중 하나 이상은, 기계 학습 시스템에 의해, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 사용하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 결정하는 것; 및 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 사용하여 현재 프레임에 대한 하나 이상의 인터-프레임 예측들을 결정하는 것을 더 포함한다. In some aspects, one or more of the methods, devices, and computer-readable media described above may be used to generate, by a machine learning system, motion information for at least one luminance component of a current frame and one or more chrominance components of the current frame. determining a warping parameter for at least one luminance component of the current frame and one or more warping parameters for one or more chrominance components of the current frame using the motion information for the current frame; and determining one or more inter-frame predictions for the current frame using the warping parameters for at least one luminance component of the current frame and one or more warping parameters for one or more chrominance components of the current frame.

일부 양태들에서, 상기 하나 이상의 인터-프레임 예측들은 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대한 상기 워핑 파라미터 및 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 하나 이상의 워핑 파라미터들을 사용하여 보간 연산을 적용함으로써 적어도 부분적으로 결정된다.In some aspects, the one or more inter-frame predictions are interpolated using the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame. It is determined at least in part by applying an operation.

일부 양태들에서, 보간 연산은 삼선형 보간 연산을 포함한다. In some aspects, the interpolation operation includes a trilinear interpolation operation.

일부 양태들에서, 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들은 공간-스케일 플로우 (SSF) 워핑 파라미터들을 포함한다. In some aspects, the warping parameter for at least one luminance component of the current frame and one or more warping parameters for one or more chrominance components of the current frame include spatial-scale flow (SSF) warping parameters.

일부 양태들에서, SSF 워핑 파라미터들은 학습된 스케일-플로우 벡터들을 포함한다. In some aspects, SSF warping parameters include learned scale-flow vectors.

일부 양태들에서, 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하기 위해, 앞서 설명된 방법들, 장치들, 및 컴퓨터 판독가능 매체 중 하나 이상은, 현재 프레임의 적어도 하나의 휘도 성분 및 이전 프레임의 적어도 하나의 재구성된 루마 성분에 기초하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보를 결정하는 것; 및 현재 프레임의 적어도 하나의 휘도 성분에 대해 결정된 모션 정보를 사용하여 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 것을 더 포함한다. In some aspects, the at least one luminance component for the current frame is used to determine motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame, as described above. One or more of the methods, devices, and computer-readable media may include: generating motion information for at least one luminance component of a current frame based on at least one luminance component of the current frame and at least one reconstructed luma component of a previous frame; to decide; and determining motion information for one or more chrominance components of the current frame using the motion information determined for at least one luminance component of the current frame.

일부 양태들에서, 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보는 기계 학습 시스템의 컨볼루션 계층을 사용하여 결정된다. In some aspects, motion information for one or more chrominance components of the current frame is determined using a convolutional layer of a machine learning system.

일부 양태들에서, 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하기 위해, 앞서 설명된 방법들, 장치들, 및 컴퓨터 판독가능 매체 중 하나 이상은 현재 프레임의 적어도 하나의 휘도 성분에 대해 결정된 모션 정보를 샘플링하는 것을 더 포함한다. In some aspects, one or more of the methods, devices, and computer-readable media described above may be used to determine motion information for one or more chrominance components of a current frame. It further includes sampling the determined motion information.

일부 양태들에서, 현재 프레임은 비디오 프레임을 포함한다. In some aspects, the current frame includes a video frame.

일부 양태들에서, 하나 이상의 색차 성분들은 적어도 하나의 색차-청색 성분 및 적어도 하나의 색차-적색 성분을 포함한다. In some embodiments, the one or more chrominance components include at least one chrominance-blue component and at least one chrominance-red component.

일부 양태들에서, 현재 프레임은 휘도-색차 (YUV) 포맷을 갖는다. 일부 경우에, YUV 포맷은 YUV 4:2:0 포맷이다. In some aspects, the current frame has a luminance-chrominance (YUV) format. In some cases, the YUV format is YUV 4:2:0 format.

일부 양태들에서, 본 명세서에 설명된 장치들은 모바일 디바이스 (예를 들어, 모바일 전화 또는 소위 "스마트폰", 태블릿 컴퓨터, 또는 다른 유형의 모바일 디바이스), 웨어러블 디바이스, 확장 현실 디바이스 (예를 들어, 가상 현실 (VR) 디바이스, 증강 현실 (AR) 디바이스, 또는 혼합 현실 (MR) 디바이스), 개인용 컴퓨터, 랩톱 컴퓨터, 비디오 서버, 텔레비전, 차량 (또는 차량의 컴퓨팅 디바이스), 또는 다른 디바이스를 포함하거나 그 일부일 수 있다. 일부 양태들에서, 장치는 하나 이상의 이미지들 또는 비디오 프레임들을 캡처하기 위한 적어도 하나의 카메라를 포함한다. 예를 들어, 장치는 비디오 프레임들을 포함하는 하나 이상의 이미지들 및/또는 하나 이상의 비디오들을 캡처하기 위한 카메라 (예를 들어, RGB 카메라) 또는 다수의 카메라들을 포함할 수 있다. 일부 양태들에서, 장치는 하나 이상의 이미지들, 비디오, 통지 또는 다른 디스플레이가능한 데이터를 디스플레이하기 위한 디스플레이를 포함한다. 일부 양태들에서, 장치는 하나 이상의 비디오 프레임 및/또는 신택스 데이터를 송신 매체를 통해 적어도 하나의 디바이스로 송신하도록 구성된 송신기를 포함한다. 일부 양태들에서, 프로세서는 뉴럴 프로세싱 유닛 (NPU), 중앙 프로세싱 유닛 (CPU), 그래픽 프로세싱 유닛 (GPU), 또는 다른 프로세싱 디바이스 또는 컴포넌트를 포함한다. In some aspects, the devices described herein may be a mobile device (e.g., a mobile phone or so-called “smartphone,” a tablet computer, or other type of mobile device), a wearable device, an extended reality device (e.g., includes or includes a virtual reality (VR) device, augmented reality (AR) device, or mixed reality (MR) device), a personal computer, a laptop computer, a video server, a television, a vehicle (or a computing device in a vehicle), or another device. It may be part of it. In some aspects, a device includes at least one camera to capture one or more images or video frames. For example, a device may include a camera (e.g., an RGB camera) or multiple cameras to capture one or more videos and/or one or more images containing video frames. In some aspects, a device includes a display for displaying one or more images, video, notifications, or other displayable data. In some aspects, an apparatus includes a transmitter configured to transmit one or more video frames and/or syntax data to at least one device over a transmission medium. In some aspects, a processor includes a neural processing unit (NPU), central processing unit (CPU), graphics processing unit (GPU), or other processing device or component.

이 개요는, 청구된 요지의 핵심적인 또는 본질적인 특징들을 식별하도록 의도되지 않으며, 청구된 요지의 범위를 결정하는 데 별개로 사용되도록 의도되지도 않는다. 그 주제는 이 특허의 전체 명세서, 임의의 또는 모든 도면들, 및 각각의 청구항의 적절한 부분들을 참조하여 이해되어야 한다.This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation in determining the scope of the claimed subject matter. The subject matter should be understood by reference to the appropriate portions of the entire specification, any or all drawings, and each claim of this patent.

전술한 내용은, 다른 특징 및 실시 형태들과 함께, 다음의 명세서, 청구항 및 첨부 도면을 참조하면 더욱 명백해질 것이다. The foregoing, along with other features and embodiments, will become more apparent upon reference to the following specification, claims and accompanying drawings.

본 출원의 예시적 실시형태들은 다음 도면들을 참조하여 이하에서 상세히 설명된다.
도 1 은 시스템-온-칩(SOC)의 예시적인 구현을 도시한다.
도 2a 는 완전히 연결된 뉴럴 네트워크의 예를 도시한다.
도 2b 는 국부적으로 연결된 뉴럴 네트워크의 예를 도시한다.
도 2c 는 컨볼루션 뉴럴 네트워크의 예를 도시한다.
도 2d 는 이미지로부터 시각적 피처들을 인식하도록 설계된 딥 컨볼루션 네트워크(DCN)의 상세한 예를 예시한다.
도 3 은 딥 컨볼루션 네트워크(DCN)를 설명하기 위한 블록도이다.
도 4 는 일부 예들에 따른, 뉴럴 네트워크 기반 시스템을 사용하여 이미지 및/또는 비디오 코딩 (인코딩 및 디코딩) 을 수행하도록 동작가능한 디바이스를 포함하는 시스템의 일 예를 예시하는 다이어그램이다.
도 5 는 일부 예들에 따른, 적색-녹색-청색(RGB) 포맷을 갖는 입력을 위한 엔드-투-엔드 뉴럴 네트워크 기반 이미지 및 비디오 코딩 시스템의 예를 예시하는 다이어그램이다.
도 6 은 일부 예들에 따른, 엔드-투-엔드 뉴럴 네트워크 기반 이미지 및 비디오 코딩 시스템의 일부일 수 있는 하나 이상의 휘도-색차(YUV) 입력 포맷들(예를 들어, 4:2:0 YUV 입력 포맷)을 프로세싱하도록 구성된 공간-스케일 흐름(SSF) 뉴럴 네트워크 아키텍처의 예를 예시하는 다이어그램이다.
도 7a 는 일부 예들에 따른, 루마 입력으로 동작하는 기계-학습 기반 플로우 엔진의 예를 예시하는 다이어그램이다.
도 7b 는 일부 예들에 따른, 크로마 모션 정보를 획득하기 위한 루마 모션 정보의 서브샘플링의 예를 예시하는 다이어그램이다.
도 8a 는 일부 예들에 따른, YUV(예를 들어, YUV 4:2:0) 잔차들을 갖는 기계 학습 기반 아키텍처의 예를 예시하는 다이어그램이다.
도 8b 는 일부 예들에 따른, 1x1 컨볼루션 계층의 예시적인 동작을 예시하는 다이어그램이다.
도 9 는 일부 예들에 따른, YUV 4:2:0 입력과 같은 YUV 입력(Y, U, 및 V)과 직접 작동하는 (예를 들어, 엔드-투-엔드 뉴럴 네트워크 기반 이미지 및 비디오 코딩 시스템의) 기계 학습 기반 아키텍처의 예를 예시하는 다이어그램이다.
도 10 은 일부 예들에 따른, YUV 4:2:0 입력과 같은 YUV 입력(Y, U, 및 V)과 직접 작동하는 (예를 들어, 엔드-투-엔드 뉴럴 네트워크 기반 이미지 및 비디오 코딩 시스템의) 기계 학습 기반 아키텍처의 또 다른 예를 예시하는 다이어그램이다.
도 11 은 일부 예들에 따른, 비디오 데이터를 프로세싱하기 위한 프로세스의 일례를 나타내는 흐름도이다.
도 12 는 본 명세서에 설명된 다양한 기술을 구현할 수 있는 예시적인 컴퓨팅 디바이스의 예시적인 컴퓨팅 디바이스 아키텍처를 도시한다.Exemplary embodiments of the present application are described in detail below with reference to the following drawings.
1 shows an example implementation of a system-on-chip (SOC).
Figure 2a shows an example of a fully connected neural network.
Figure 2b shows an example of a locally connected neural network.
Figure 2C shows an example of a convolutional neural network.
Figure 2D illustrates a detailed example of a deep convolutional network (DCN) designed to recognize visual features from images.
Figure 3 is a block diagram for explaining a deep convolutional network (DCN).
4 is a diagram illustrating an example of a system including a device operable to perform image and/or video coding (encoding and decoding) using a neural network-based system, according to some examples.
5 is a diagram illustrating an example of an end-to-end neural network based image and video coding system for input with red-green-blue (RGB) format, according to some examples.
6 illustrates one or more luminance-chrominance (YUV) input formats (e.g., 4:2:0 YUV input format) that may be part of an end-to-end neural network based image and video coding system, according to some examples. Diagram illustrating an example of a space-scale flow (SSF) neural network architecture configured to process.
7A is a diagram illustrating an example of a machine-learning based flow engine operating with luma input, according to some examples.
FIG. 7B is a diagram illustrating an example of subsampling of luma motion information to obtain chroma motion information, according to some examples.
FIG. 8A is a diagram illustrating an example of a machine learning based architecture with YUV (eg, YUV 4:2:0) residuals, according to some examples.
8B is a diagram illustrating example operation of a 1x1 convolutional layer, according to some examples.
9 illustrates an example of an end-to-end neural network based image and video coding system (e.g., an end-to-end neural network based image and video coding system) that operates directly with YUV inputs (Y, U, and V), such as a YUV 4:2:0 input, according to some examples. ) Diagram illustrating an example of a machine learning-based architecture.
10 illustrates an example of an end-to-end neural network based image and video coding system (e.g., an end-to-end neural network based image and video coding system) that operates directly with YUV inputs (Y, U, and V), such as a YUV 4:2:0 input, according to some examples. ) This diagram illustrates another example of a machine learning-based architecture.
11 is a flow diagram illustrating an example of a process for processing video data, according to some examples.
12 illustrates an example computing device architecture of an example computing device that can implement various techniques described herein.

본 개시의 특정 양태들 및 실시형태들이 이하에 제공된다. 이들 양태들 및 실시형태들 중 일부는 독립적으로 적용될 수 있고 그들 중 일부는 당업자에게 자명한 바와 같이 조합하여 적용될 수도 있다. 다음의 설명에 있어서, 설명의 목적들로, 특정 상세들이 본 출원의 실시형태들의 철저한 이해를 제공하기 위해 제시된다. 하지만, 다양한 실시형태들이 이들 특정 상세들 없이 실시될 수도 있음이 명백할 것이다. 도면들 및 설명은 제한적인 것으로 의도되지 않는다.Certain aspects and embodiments of the present disclosure are provided below. Some of these aspects and embodiments may be applied independently and some of them may be applied in combination as will be apparent to those skilled in the art. In the following description, for purposes of explanation, specific details are set forth to provide a thorough understanding of embodiments of the present application. However, it will be clear that various embodiments may be practiced without these specific details. The drawings and description are not intended to be limiting.

다음의 설명은 오직 예시적인 실시형태들을 제공할 뿐이고, 본 개시의 범위, 적용가능성, 또는 구성을 한정하도록 의도되지 않는다. 오히려, 예시적인 실시형태들의 설명은 예시적인 실시형태를 구현하기 위한 가능한 설명을 당업자에게 제공할 것이다. 첨부된 청구범위에 설명된 바와 같이 본 출원의 사상 및 범위를 벗어나지 않으면서 엘리먼트들의 기능 및 배열에 다양한 변경들이 이루어질 수도 있음이 이해되어야 한다.The following description provides example embodiments only and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the description of the example embodiments will provide those skilled in the art with possible instructions for implementing the example embodiments. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the present application as set forth in the appended claims.

디지털 비디오 데이터는, 특히 고품질 비디오 데이터에 대한 요구가 계속 증가함에 따라, 많은 양의 데이터를 포함할 수 있다. 예를 들어, 비디오 데이터의 소비자들은 통상적으로, 높은 충실도, 해상도, 프레임 레이트들 등을 갖는 점점 더 높은 품질의 비디오를 원한다. 그러나, 이러한 요구를 만족시키기 위해 요구되는 많은 양의 비디오 데이터는 비디오 데이터를 처리하고 저장하는 장치들뿐만 아니라 통신 네트워크에도 상당한 부담을 줄 수 있다.Digital video data can contain large amounts of data, especially as the demand for high quality video data continues to increase. For example, consumers of video data typically desire increasingly higher quality video with higher fidelity, resolution, frame rates, etc. However, the large amounts of video data required to satisfy these demands can place a significant burden on communication networks as well as devices that process and store the video data.

비디오 데이터를 코딩하기 위해 다양한 기법들이 사용될 수 있다. 비디오 코딩은 특정 비디오 코딩 표준에 따라 수행될 수 있다. 예시적인 비디오 코딩 표준들은 고효율 비디오 코딩 (HEVC), 진보된 비디오 코딩 (AVC), 동영상 전문가 그룹 (MPEG) 코딩, 및 다목적 비디오 코딩 (VVC) 을 포함한다. 비디오 코딩은 종종 인터-예측(inter-prediction) 또는 인트라-예측(intra-prediction)과 같은 예측 방법들을 사용하며, 이는 비디오 이미지들 또는 시퀀스들에 존재하는 리던던시들을 이용한다. 비디오 코딩 기법들의 공통 목적은 비디오 품질에 대한 열화들을 회피 또는 최소화하면서 더 낮은 비트 레이트를 사용하는 형태로 비디오 데이터를 압축하는 것이다. 비디오 서비스들에 대한 수요가 증가하고 새로운 비디오 서비스들이 이용가능해짐에 따라, 더 양호한 코딩 효율, 성능, 및 레이트 제어를 갖는 코딩 기법들이 필요하다.Various techniques can be used to code video data. Video coding may be performed according to specific video coding standards. Exemplary video coding standards include High Efficiency Video Coding (HEVC), Advanced Video Coding (AVC), Moving Picture Expert Group (MPEG) coding, and Versatile Video Coding (VVC). Video coding often uses prediction methods such as inter-prediction or intra-prediction, which exploit redundancies present in video images or sequences. A common goal of video coding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradations to video quality. As demand for video services increases and new video services become available, coding techniques with better coding efficiency, performance, and rate control are needed.

기계 학습(ML) 기반 시스템들은 이미지 및/또는 비디오 코딩을 수행하는 데 사용될 수 있다. 일반적으로 ML은 인공 지능(AI)의 부분 집합이다. ML 시스템들은, 컴퓨터 시스템들이 명시적 명령어들의 사용 없이, 패턴들 및 추론에 의존함으로써 다양한 태스크들을 수행하기 위해 사용할 수 있는 알고리즘들 및 통계 모델들을 포함할 수 있다. ML 시스템의 일 예는 인공 뉴런들(예를 들어, 뉴런 모델들)의 상호연결된 그룹을 포함할 수 있는 뉴럴 네트워크(인공 뉴럴 네트워크로도 지칭됨)이다. 뉴럴 네트워크들은 특히 이미지 및/또는 비디오 코딩, 이미지 분석 및/또는 컴퓨터 비전 애플리케이션들, 인터넷 프로토콜(IP) 카메라들, 사물 인터넷(IoT) 디바이스들, 자율 차량들, 서비스 로봇들과 같은 다양한 애플리케이션들 및/또는 디바이스들을 위해 사용될 수 있다.Machine learning (ML) based systems can be used to perform image and/or video coding. In general, ML is a subset of artificial intelligence (AI). ML systems can include algorithms and statistical models that computer systems can use to perform various tasks by relying on patterns and inferences, without the use of explicit instructions. One example of an ML system is a neural network (also referred to as an artificial neural network), which may include an interconnected group of artificial neurons (e.g., neuron models). Neural networks are used in a variety of applications, such as image and/or video coding, image analysis and/or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles, service robots, and /or can be used for devices.

뉴럴 네트워크 내의 개별 노드들은 입력 데이터를 취하고 데이터에 대해 간단한 연산들을 수행함으로써 생물학적 뉴런들을 에뮬레이트할 수 있다. 입력 데이터에 대해 수행된 단순한 동작들의 결과들은 다른 뉴런들에 선택적으로 전달된다. 가중치 값들은 네트워크 내의 각각의 벡터 및 노드와 연관되고, 이들 값들은 입력 데이터가 출력 데이터와 어떻게 관련되는지를 제약한다. 예를 들어, 각 노드의 입력 데이터는 대응하는 가중치 값과 곱해질 수 있다. 곱들의 합은 선택적인 바이어스에 의해 조정될 수 있고, 활성화 함수가 결과에 적용되어, 노드의 출력 신호 또는 "출력 활성화"(때때로 활성화 맵 또는 특징 맵으로 지칭됨)를 산출할 수 있다. 가중치 값들은 초기에 네트워크를 통한 트레이닝 데이터의 반복 흐름에 의해 결정될 수 있다(예를 들어, 가중치 값들은 네트워크가 그들의 전형적인 입력 데이터 특성들에 의해 특정 클래스들을 식별하는 방법을 학습하는 트레이닝 단계 동안 설정된다).Individual nodes within a neural network can emulate biological neurons by taking input data and performing simple operations on the data. The results of simple operations performed on input data are selectively transmitted to other neurons. Weight values are associated with each vector and node in the network, and these values constrain how input data is related to output data. For example, the input data of each node can be multiplied by the corresponding weight value. The sum of the products can be adjusted by an optional bias, and an activation function can be applied to the result, yielding the node's output signal or “output activation” (sometimes referred to as an activation map or feature map). Weight values may initially be determined by iterative flow of training data through the network (e.g., weight values are set during the training phase where the network learns to identify specific classes by their typical input data characteristics). ).

특히, 컨볼루션 뉴럴 네트워크(CNN), 순환 뉴럴 네트워크(RNN), 생성 적대 네트워크(GAN), 다층 퍼셉트론(MLP) 뉴럴 네트워크와 같은 상이한 유형의 뉴럴 네트워크가 존재한다. 예를 들어, 컨볼루션 뉴럴 네트워크(CNN)는 피드-포워드 인공 뉴럴 네트워크의 일종이다. 컨볼루션 뉴럴 네트워크들은 각각 수용 필드(receptive field)(예를 들어, 입력 공간의 공간적으로 로컬화된 영역)를 갖고 입력 공간을 집합적으로 타일링하는 인공 뉴런들의 집합들을 포함할 수 있다. RNN들은 계층의 결과를 예측하는 것을 돕기 위해 계층의 출력을 저장하고 이 출력을 입력에 다시 공급하는 원리에 대해 작동한다. GAN은 입력 데이터에서 패턴들을 학습할 수 있는 생성 뉴럴 네트워크의 형태이며, 따라서 뉴럴 네트워크 모델은 원래 데이터세트로부터 합리적으로 얻어질 수 있었던 새로운 합성 출력들을 생성할 수 있다. GAN은 합성된 출력을 생성하는 생성 뉴럴 네트워크 및 진본성에 대해 출력을 평가하는 판별 뉴럴 네트워크를 포함하여 함께 동작하는 2개의 뉴럴 네트워크를 포함할 수 있다. MLP 뉴럴 네트워크들에서, 데이터는 입력 계층에 공급될 수 있고, 하나 이상의 은닉 계층들은 데이터에 대한 추상화의 레벨들을 제공한다. 그런 다음, 추상화된 데이터에 기초하여 출력 계층 상에서 예측들이 이루어질 수 있다.In particular, different types of neural networks exist, such as convolutional neural networks (CNN), recurrent neural networks (RNN), generative adversarial networks (GAN), and multilayer perceptron (MLP) neural networks. For example, a convolutional neural network (CNN) is a type of feed-forward artificial neural network. Convolutional neural networks may include sets of artificial neurons that collectively tile the input space, each having a receptive field (e.g., a spatially localized region of the input space). RNNs work on the principle of storing the output of a layer and feeding this output back to the input to help predict the outcome of the layer. GAN is a form of generative neural network that can learn patterns from input data, so the neural network model can generate new synthetic outputs that could reasonably have been obtained from the original dataset. A GAN may involve two neural networks working together, including a generative neural network that produces a synthesized output and a discriminative neural network that evaluates the output for authenticity. In MLP neural networks, data can be fed to an input layer, and one or more hidden layers provide levels of abstraction for the data. Predictions can then be made on the output layer based on the abstracted data.

계층화된 뉴럴 네트워크 아키텍처들(다수의 은닉 층들이 존재할 때 심층 뉴럴 네트워크들로 지칭됨)에서, 인공 뉴런들의 제 1 층의 출력은 인공 뉴런들의 제 2 층에 대한 입력이 되고, 인공 뉴런들의 제 2 층의 출력은 인공 뉴런들의 제 3 층에 대한 입력이 되고, 등등이다. CNN들은, 예를 들어, 특징들의 계층을 인식하도록 훈련될 수 있다. CNN 아키텍처들에서의 계산은 하나 이상의 계산 체인으로 구성될 수 있는 프로세싱 노드들의 집단에 걸쳐 분산될 수 있다. 이들 멀티-계층화된 아키텍처들은 한번에 하나의 계층씩 트레이닝될 수도 있고, 역 전파 (back propagation) 를 이용하여 미세-튜닝될 수도 있다.In layered neural network architectures (referred to as deep neural networks when there are multiple hidden layers), the output of a first layer of artificial neurons becomes the input to a second layer of artificial neurons, and the second layer of artificial neurons The output of that layer becomes the input to a third layer of artificial neurons, and so on. CNNs can, for example, be trained to recognize hierarchies of features. Computation in CNN architectures may be distributed across a population of processing nodes, which may consist of one or more computation chains. These multi-layered architectures may be trained one layer at a time and fine-tuned using back propagation.

많은 경우들에서, 딥 러닝 기반 시스템들은 엔트로피 코딩에 사용되는 양자화된 레이턴시들에 걸쳐 확률 모델을 학습하는 것을 담당하는 오토인코더 서브-네트워크 (인코더 서브-네트워크) 및 제 2 서브-네트워크 (일부 경우들에서 하이퍼프라이어 (hyperprior) 네트워크라고도 지칭됨) 의 조합으로서 설계된다 (디코더 서브-네트워크). 일부 경우들에서, 디코더의 다른 서브-네트워크들이 존재할 수 있다. 이러한 딥 러닝 기반 시스템 아키텍처는 변환 플러스 양자화 모듈(transform plus quantization module)(또는 인코더 서브-네트워크)과 엔트로피 모델링 서브-네트워크 모듈의 조합으로 볼 수 있다.In many cases, deep learning based systems include an autoencoder sub-network (encoder sub-network) and a second sub-network (in some cases It is designed as a combination of (decoder sub-network) (also referred to as a hyperprior network). In some cases, other sub-networks of decoders may exist. This deep learning-based system architecture can be viewed as a combination of a transform plus quantization module (or encoder sub-network) and an entropy modeling sub-network module.

비디오 압축을 위한 대부분의 기존의 딥 러닝 기반 아키텍처는 RGB, YUV 4:4:4, 또는 다른 비-서브샘플링된 입력 포맷과 같은 비-서브샘플링된 입력 포맷에서 동작하도록 설계된다. 그러나, HEVC 및 VVC 와 같은 비디오 코딩 표준들은 그들 각각의 메인 프로파일들에서 YUV 4:2:0 컬러 포맷을 지원하도록 설계된다. 4:2:0 YUV 포맷을 지원하기 위해, 서브샘플링되지 않은 입력 포맷들에서 동작하도록 설계된 딥 러닝 기반 아키텍처들이 수정되어야 한다.Most existing deep learning-based architectures for video compression are designed to operate on non-subsampled input formats such as RGB, YUV 4:4:4, or other non-subsampled input formats. However, video coding standards such as HEVC and VVC are designed to support YUV 4:2:0 color format in their respective main profiles. To support the 4:2:0 YUV format, deep learning-based architectures designed to operate on non-subsampled input formats must be modified.

하나 이상의 프레임들(예를 들어, 비디오 프레임) 중 하나의 컬러 컴포넌트를 사용하여 프레임의 컬러 컴포넌트 및 다른 컬러 컴포넌트에 대한 정보를 추정할 수 있는 ML-기반 시스템(예를 들어, 딥-러닝 기반 시스템)을 제공하는 시스템들, 장치들, 프로세스들(방법들이라고도 지칭됨), 및 컴퓨터-판독가능 매체들(집합적으로 "시스템들 및 기법들"이라고 지칭됨)이 본 명세서에 설명된다. 일부 양태들에서, ML-기반 시스템은 휘도-색차(YUV) 입력 포맷들을 갖는 입력 데이터를 프로세싱하도록 설계될 수 있다. 이러한 양태들에서, ML-기반 시스템은 루마 성분 및 하나 이상의 크로마 성분들 양자에 대한 모션 정보 (예를 들어, 플로우 정보, 예컨대 광학 플로우 정보) 를 추정하기 위해 (예를 들어, ML-기반 시스템에 의해 재구성된) 이전에-재구성된 프레임 및 현재 프레임 양자의 루마 성분을 사용할 수 있다. 일부 경우들에서, 루마 성분에 대한 모션 정보를 학습한 후, 다운 샘플링을 갖는 컨볼루션 계층은 하나 이상의 크로마 성분들에 대한 모션 정보 (예를 들어, 플로우 정보) 를 학습하는데 사용될 수 있다. 일부 경우들에서, 하나 이상의 크로마 성분들에 대한 모션 정보는 (예를 들어, 컨볼루션 계층을 사용하지 않고) 루마 성분에 대한 모션 정보를 직접 서브샘플링함으로써 획득될 수 있다. 이러한 기술은 프레임의 모든 구성요소에 대해 수행될 수 있다. 이러한 기법들을 사용하여, ML-기반 시스템은 잠재 데이터 또는 비트스트림의 일부로서 코딩된 크로마 정보를 가질 필요 없이 크로마 모션 정보 (예를 들어, 플로우 정보) 를 결정할 수 있다 (예를 들어, 크로마 정보와 함께 사이드 정보를 전송할 필요성을 감소시킨다). ML-based systems (e.g., deep-learning based systems) that can use the color component of one of one or more frames (e.g., a video frame) to estimate information about the color component of the frame and other color components Described herein are systems, devices, processes (also referred to as methods), and computer-readable media (collectively referred to as “systems and techniques”) that provide. In some aspects, an ML-based system can be designed to process input data with luminance-chrominance (YUV) input formats. In these aspects, an ML-based system may be used to estimate motion information (e.g., flow information, such as optical flow information) for both a luma component and one or more chroma components (e.g., an ML-based system The luma component of both the previously-reconstructed frame (reconstructed by) and the current frame can be used. In some cases, after learning motion information for the luma component, a convolutional layer with down-sampling can be used to learn motion information (e.g., flow information) for one or more chroma components. In some cases, motion information for one or more chroma components may be obtained by directly subsampling motion information for the luma component (eg, without using a convolutional layer). This technique can be performed on all components of the frame. Using these techniques, an ML-based system can determine chroma motion information (e.g., flow information) without the need to have chroma information coded as part of the latent data or bitstream (e.g., chroma information and together reduces the need to transmit side information).

전술한 바와 같이, ML 기반 시스템은 YUV 입력 포맷을 갖는 입력 데이터를 처리하도록 설계될 수 있다. YUV 포맷은 휘도 채널(Y) 및 한 쌍의 색차 채널(U 및 V)을 포함한다. U 채널은 크로미넌스 (또는 크로마)-청색 채널로 지칭될 수 있고, U 채널은 크로미넌스 (또는 크로마)-적색 채널로 지칭될 수 있다. 일부 경우들에서, 루미넌스 (Y) 채널 또는 컴포넌트는 또한 루마 채널 또는 컴포넌트로 지칭될 수 있다. 일부 경우들에서, 색차(U 및 V) 채널들 또는 성분들은 또한 크로마 채널들 또는 성분들로 지칭될 수 있다. YUV 입력 포맷들은 특히 YUV 4:2:0, YUV 4:4:4, YUV 4:2:2을 포함할 수 있다. 일부 경우들에서, 본 명세서에 설명된 시스템들 및 기법들은 Y-크로마 블루 (Cb)-크로마 레드 (Cr) (YCbCr) 포맷, 레드-그린-블루 (RGB) 포맷, 및/또는 다른 포맷을 갖는 데이터와 같은 다른 입력 포맷들을 핸들링하도록 설계될 수 있다. 본원에 설명된 ML-기반 시스템은 다수의 프레임들을 포함하는 독립형 프레임들(이미지들이라고도 지칭됨) 및/또는 비디오 데이터를 인코딩 및/또는 디코딩할 수 있다. As described above, ML-based systems can be designed to process input data with YUV input format. The YUV format includes a luminance channel (Y) and a pair of chrominance channels (U and V). The U channel may be referred to as the chrominance (or chroma)-blue channel, and the U channel may be referred to as the chrominance (or chroma)-red channel. In some cases, the luminance (Y) channel or component may also be referred to as a luma channel or component. In some cases, chrominance (U and V) channels or components may also be referred to as chroma channels or components. YUV input formats may include YUV 4:2:0, YUV 4:4:4, YUV 4:2:2, among others. In some cases, the systems and techniques described herein may have a Y-chroma blue (Cb)-chroma red (Cr) (YCbCr) format, a red-green-blue (RGB) format, and/or other formats. It can be designed to handle different input formats such as data. The ML-based system described herein can encode and/or decode video data and/or stand-alone frames (also referred to as images) containing multiple frames.

본 개시내용의 추가적인 세부사항들 및 추가적인 양상들이 도면들과 관련하여 설명될 것이다. Additional details and additional aspects of the disclosure will be described in conjunction with the drawings.

도 1 은 본 명세서에 설명된 기능들 중 하나 이상을 수행하도록 구성된 중앙 처리 유닛(CPU)(102) 또는 멀티-코어 CPU를 포함할 수 있는 시스템-온-칩(SOC)(100)의 예시적인 구현을 도시한다. 다른 정보 중에서도, 파라미터들 또는 변수들(예를 들어, 신경 신호들 및 시냅스 가중치들), 계산 디바이스와 연관된 시스템 파라미터들(예를 들어, 가중치들을 갖는 뉴럴 네트워크), 지연들, 주파수 빈 정보, 태스크 정보는 신경 프로세싱 유닛(NPU)(108)과 연관된 메모리 블록, CPU(102)와 연관된 메모리 블록, 그래픽 프로세싱 유닛(GPU)(104)과 연관된 메모리 블록, 디지털 신호 프로세서(DSP)(106)와 연관된 메모리 블록, 메모리 블록(118)에 저장될 수 있고/있거나, 다수의 블록들에 걸쳐 분산될 수 있다. CPU (102) 에서 실행되는 명령들은 CPU (102) 와 연관된 프로그램 메모리로부터 로딩될 수도 있거나 메모리 블록 (118) 으로부터 로딩될 수도 있다.1 is an exemplary illustration of a system-on-chip (SOC) 100 that may include a central processing unit (CPU) 102 or a multi-core CPU configured to perform one or more of the functions described herein. Implementation is shown. Parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with a computational device (e.g., a neural network with weights), delays, frequency bin information, task, among other information. Information may include memory blocks associated with neural processing unit (NPU) 108, memory blocks associated with CPU 102, memory blocks associated with graphics processing unit (GPU) 104, and digital signal processor (DSP) 106. It may be stored in a memory block, memory block 118, and/or may be distributed across multiple blocks. Instructions executing on CPU 102 may be loaded from program memory associated with CPU 102 or may be loaded from memory block 118.

SOC (100) 는 또한, GPU (104), DSP (106), 5세대 (5G) 접속성, 4세대 롱 텀 에볼루션 (4G LTE) 접속성, Wi-Fi 접속성, USB 접속성, 블루투스 접속성 등을 포함할 수도 있는 접속성 블록 (110), 및 예를 들어, 제스처들을 검출 및 인식할 수도 있는 멀티미디어 프로세서 (112) 와 같은, 특정 기능들에 맞게 조정된 부가 프로세싱 블록들을 포함할 수도 있다. 일 구현에서, NPU는 CPU(102), DSP(106) 및/또는 GPU(104)에서 구현된다. SOC (100) 는 또한 센서 프로세서 (114), 이미지 신호 프로세서들 (ISP들)(116), 및/또는 내비게이션 모듈 (120) 을 포함할 수 있으며, 이는 글로벌 포지셔닝 시스템을 포함할 수도 있다.SOC (100) also includes GPU (104), DSP (106), 5th Generation (5G) connectivity, 4th Generation Long Term Evolution (4G LTE) connectivity, Wi-Fi connectivity, USB connectivity, and Bluetooth connectivity. connectivity block 110, which may include, and the like, and additional processing blocks tailored for specific functions, such as multimedia processor 112, which may detect and recognize gestures, for example. In one implementation, the NPU is implemented in CPU 102, DSP 106, and/or GPU 104. SOC 100 may also include sensor processor 114, image signal processors (ISPs) 116, and/or navigation module 120, which may include a global positioning system.

SOC (100) 는 ARM 명령 세트에 기초할 수도 있다. 본 개시의 일 양태에서, CPU(102)에 로딩된 명령어들은 입력 값과 필터 가중치의 곱셈 곱에 대응하는 룩업 테이블(LUT)에서 저장된 곱셈 결과를 검색하기 위한 코드를 포함할 수 있다. CPU(102)에 로딩된 명령어들은 또한 곱셈 곱의 룩업 테이블 히트가 검출될 때 곱셈 곱의 곱셈 연산 동안 곱셈기를 디스에이블하기 위한 코드를 포함할 수 있다. 또한, CPU(102)에 로딩된 명령어들은 곱셈 곱의 룩업 테이블 미스가 검출될 때 입력 값과 필터 가중치의 계산된 곱셈 곱을 저장하기 위한 코드를 포함할 수 있다.SOC 100 may be based on the ARM instruction set. In one aspect of the present disclosure, instructions loaded into CPU 102 may include code for retrieving a stored multiplication result from a lookup table (LUT) corresponding to a multiplication product of an input value and a filter weight. Instructions loaded into CPU 102 may also include code to disable the multiplier during a multiply operation of a multiply product when a lookup table hit of the multiply product is detected. Additionally, instructions loaded into the CPU 102 may include code for storing the calculated multiplication product of the input value and the filter weight when a lookup table miss of the multiplication product is detected.

SOC (100) 및/또는 그 컴포넌트들은 본 명세서에서 논의된 본 개시의 양태들에 따른 머신 학습 기법들을 사용하여 비디오 압축 및/또는 압축해제 (또한 비디오 인코딩 및/또는 디코딩으로서 지칭되며, 집합적으로 비디오 코딩으로서 지칭됨) 를 수행하도록 구성될 수도 있다. 비디오 압축 및/또는 압축해제를 수행하기 위해 딥 러닝 아키텍처들을 사용함으로써, 본 개시의 양태들은 디바이스 상의 비디오 압축 및/또는 압축해제의 효율을 증가시킬 수 있다. 예를 들어, 설명된 비디오 코딩 기법들을 사용하는 디바이스는 머신 학습 기반 기법들을 사용하여 비디오를 더 효율적으로 압축할 수 있고, 압축된 비디오를 다른 디바이스에 송신할 수 있고, 다른 디바이스는 본원에서 설명된 머신 학습 기반 기법들을 사용하여 압축된 비디오를 더 효율적으로 압축해제할 수 있다.SOC 100 and/or its components may perform video compression and/or decompression (also referred to as video encoding and/or decoding, collectively) using machine learning techniques in accordance with aspects of the disclosure discussed herein. may be configured to perform (referred to as video coding). By using deep learning architectures to perform video compression and/or decompression, aspects of the present disclosure can increase the efficiency of video compression and/or decompression on a device. For example, a device using the video coding techniques described may compress video more efficiently using machine learning-based techniques and transmit the compressed video to another device, which may use the techniques described herein. Machine learning-based techniques can be used to decompress compressed video more efficiently.

전술한 바와 같이, 뉴럴 네트워크는 기계 학습 시스템의 예이고, 입력 계층, 하나 이상의 은닉 계층들, 및 출력 계층을 포함할 수 있다. 데이터는 입력 계층의 입력 노드들로부터 제공되고, 프로세싱은 하나 이상의 은닉 계층들의 은닉 노드들에 의해 수행되고, 출력은 출력 계층의 출력 노드들을 통해 생성된다. 딥 러닝 네트워크들은 통상적으로 다수의 은닉 계층들을 포함한다. 뉴럴 네트워크의 각각의 계층은 인공 뉴런들(또는 노드들)을 포함할 수 있는 특징 맵들 또는 활성화 맵들을 포함할 수 있다. 특징 맵은 필터, 커널 등을 포함할 수 있다. 노드들은 계층들 중 하나 이상의 노드들의 중요도를 표시하는 데 사용되는 하나 이상의 가중치들을 포함할 수 있다. 일부 경우들에서, 딥 러닝 네트워크는 일련의 많은 숨겨진 계층들을 가질 수 있으며, 초기 계층들은 입력의 단순하고 낮은 레벨 특성들을 결정하는 데 사용되고, 이후 계층들은 더 복잡하고 추상적인 특성들의 계층을 구축한다.As described above, a neural network is an example of a machine learning system and may include an input layer, one or more hidden layers, and an output layer. Data is provided from input nodes of an input layer, processing is performed by hidden nodes of one or more hidden layers, and output is generated through output nodes of an output layer. Deep learning networks typically include multiple hidden layers. Each layer of the neural network may include feature maps or activation maps that may contain artificial neurons (or nodes). Feature maps may include filters, kernels, etc. Nodes may contain one or more weights that are used to indicate the importance of one or more nodes in the layers. In some cases, a deep learning network may have a series of many hidden layers, with initial layers used to determine simple, low-level features of the input, and later layers building a hierarchy of more complex and abstract features.

딥 러닝 아키텍처는 특징들의 계위를 학습할 수도 있다. 예를 들어, 시각적 데이터로 제시되면, 제 1 계층은 입력 스트림에서, 에지들과 같은 비교적 간단한 특징들을 인식하는 것을 학습할 수도 있다. 다른 예에서, 청각적 데이터로 제시되면, 제 1 계층은 특정 주파수들에서의 스펙트럼 전력을 인식하는 것을 학습할 수도 있다. 제 1 계층의 출력을 입력으로서 취하는 제 2 계층은, 시각 데이터에 대한 간단한 형상들 또는 청각 데이터에 대한 사운드들의 조합들과 같은 특징들의 조합들을 인식하는 것을 학습할 수도 있다. 예를 들어, 상위 계층들은 시각적 데이터에서의 복잡한 형상들 또는 청각적 데이터에서의 단어들을 나타내는 것을 학습할 수도 있다. 여전히 상위 계층들은 공통 시각적 객체들 또는 구어체들을 인식하는 것을 학습할 수도 있다.Deep learning architectures can also learn hierarchies of features. For example, when presented with visual data, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, when presented with auditory data, the first layer may learn to recognize spectral power at specific frequencies. The second layer, taking the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For example, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases.

딥 러닝 아키텍처들은 자연스러운 계위 구조를 갖는 문제들에 적용될 때 특히 잘 수행할 수도 있다. 예를 들어, 모터구동 차량들 (motorized vehicle) 의 분류는 휠들, 윈드쉴드들 및 다른 특징들을 인식하는 것을 먼저 학습하는 것으로 이익을 얻을 수 있다. 이러한 특징들은 자동차, 트럭, 및 비행기를 인식하기 위해 상이한 방식들로 상위 계층에서 조합될 수도 있다.Deep learning architectures may perform particularly well when applied to problems that have a natural hierarchical structure. For example, classification of motorized vehicles may benefit from first learning to recognize wheels, windshields and other features. These features may be combined in higher layers in different ways to recognize cars, trucks, and airplanes.

뉴럴 네트워크들은 다양한 접속성 패턴들로 설계될 수도 있다. 피드-포워드 네트워크들에서, 정보는 하위 계층에서 상위 계층으로 전달되고, 주어진 계층에서의 각각의 뉴런은 상위 계층들에서의 뉴런들에 통신한다. 계위적 표현은 상술한 바와 같이, 피드-포워드 네트워크의 연속적인 계층들에 구축될 수도 있다. 뉴럴 네트워크들은 또한 순환 (recurrent) 또는 피드백 (또한 하향식이라 함) 연결들을 가질 수도 있다. 순환 연결에서, 주어진 계층의 뉴런으로부터의 출력은 동일한 계층의 다른 뉴런으로 통신될 수도 있다. 순환 아키텍처는 시퀀스로 뉴럴 네트워크에 전달되는 입력 데이터 청크들 중 하나보다 많은 청크들에 걸쳐 있는 패턴들을 인식하는데 도움이 될 수도 있다. 주어진 계층의 뉴런에서 하위 계층의 뉴런으로의 연결은 피드백 (또는 하향식) 연결이라고 한다. 많은 피드백 연결들을 갖는 네트워크는 하이-레벨 개념의 인식이 입력의 특정 로우-레벨 특징들을 식별하는 것을 보조할 수도 있을 때 도움이 될 수도 있다. Neural networks may be designed with various connectivity patterns. In feed-forward networks, information is passed from a lower layer to a higher layer, and each neuron in a given layer communicates with neurons in higher layers. A hierarchical representation may be built on successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a circular connection, the output from a neuron in a given layer may be communicated to other neurons in the same layer. Recursive architectures can also help recognize patterns that span more than one chunk of input data that is passed to a neural network in sequence. Connections from neurons in a given layer to neurons in lower layers are called feedback (or top-down) connections. A network with many feedback connections may be helpful when recognition of high-level concepts may assist in identifying specific low-level features of the input.

뉴럴 네트워크의 계층들 사이의 연결들은 완전히 연결되거나 로컬로 연결될 수도 있다. 도 2a 는 완전히 연결된 뉴럴 네트워크(202)의 예를 도시한다. 완전히 연결된 뉴럴 네트워크 (202) 에서, 제 1 계층에서의 뉴런은 제 2 계층에서의 모든 뉴런에 그의 출력을 통신할 수도 있으므로, 제 2 계층에서의 각각의 뉴런이 제 1 계층에서의 모든 뉴런으로부터 입력을 수신할 것이다. 도 2b 는 국부적으로 연결된 뉴럴 네트워크(204)의 예를 도시한다. 로컬로 연결된 뉴럴 네트워크 (204) 에서, 제 1 층에서의 뉴런은 제 2 계층에서의 제한된 수의 뉴런들에 연결될 수도 있다. 보다 일반적으로, 로컬로 연결된 뉴럴 네트워크 (204) 의 로컬로 연결된 계층은 계층에서의 각각의 뉴런이 동일하거나 유사한 접속성 패턴을 가질 것이지만, 상이한 값들 (예를 들어, 210, 212, 214, 및 216) 을 가질 수도 있는 연결 강도들을 갖도록 구성될 수도 있다. 로컬로 연결된 접속성 패턴은 상위 계층에서 공간적으로 별개의 수용 필드들을 발생할 수도 있는데, 이는 주어진 영역에서 상위 계층 뉴런들이 네트워크에 대한 총 입력의 제한된 부분의 특성들에 대한 훈련을 통해 튜닝되는 입력들을 수신할 수도 있기 때문이다.Connections between layers of a neural network may be fully connected or locally connected. Figure 2A shows an example of a fully connected neural network 202. In a fully connected neural network 202, a neuron in the first layer may communicate its output to all neurons in the second layer such that each neuron in the second layer receives input from all neurons in the first layer. will receive. Figure 2B shows an example of a locally connected neural network 204. In a locally connected neural network 204, neurons in a first layer may be connected to a limited number of neurons in a second layer. More generally, a locally connected layer of a locally connected neural network 204 is such that each neuron in the layer will have the same or similar connectivity pattern, but different values (e.g., 210, 212, 214, and 216). ) may be configured to have connection strengths that may have. Locally connected connectivity patterns may give rise to spatially distinct receptive fields in higher layers, such that higher layer neurons in a given region receive inputs that are tuned through training on the properties of a limited portion of the total input to the network. Because you can.

로컬로 연결된 뉴럴 네트워크의 일 예는 컨볼루션 뉴럴 네트워크이다. 도 2c 는 컨볼루션 뉴럴 네트워크(206)의 예를 도시한다. 컨볼루션 뉴럴 네트워크 (206) 는 제 2 계층에서의 각각의 뉴런에 대한 입력들과 연관된 연결 강도들이 공유되도록 (예를 들어, 208) 구성될 수도 있다. 컨볼루션 뉴럴 네트워크들은 입력들의 공간적 위치가 의미있는 문제들에 매우 적합할 수도 있다. 컨볼루션 뉴럴 네트워크(206)는 본 개시내용의 양태들에 따라, 비디오 압축 및/또는 압축해제의 하나 이상의 양태들을 수행하는 데 사용될 수 있다.One example of a locally connected neural network is a convolutional neural network. Figure 2C shows an example of a convolutional neural network 206. Convolutional neural network 206 may be configured such that connection strengths associated with inputs for each neuron in the second layer are shared (e.g., 208). Convolutional neural networks may be well suited to problems where the spatial location of inputs is meaningful. Convolutional neural network 206 may be used to perform one or more aspects of video compression and/or decompression, in accordance with aspects of the present disclosure.

컨볼루션 뉴럴 네트워크의 하나의 타입은 딥 컨볼루션 네트워크 (DCN) 이다. 도 2d 는 자동차-장착 카메라와 같은 이미지 캡처링 디바이스(230)로부터 입력된 이미지(226)로부터 시각적 피처들을 인식하도록 설계된 DCN(200)의 상세한 예를 예시한다. 본 예의 DCN (200) 은 교통 표지판 및 교통 표지판 상에 제공된 번호를 식별하도록 훈련될 수도 있다. 물론, DCN (200) 은 차선 마킹들을 식별하거나 신호등들을 식별하는 것과 같은 다른 태스크들을 위해 훈련될 수도 있다.One type of convolutional neural network is a deep convolutional network (DCN). FIG. 2D illustrates a detailed example of a DCN 200 designed to recognize visual features from an image 226 input from an image capturing device 230, such as a car-mounted camera. DCN 200 in this example may be trained to identify traffic signs and numbers provided on traffic signs. Of course, DCN 200 may be trained for other tasks, such as identifying lane markings or traffic lights.

DCN (200) 은 지도 학습으로 훈련될 수도 있다. 훈련 동안, DCN (200) 은 속도 제한 표지판의 이미지 (226) 와 같은 이미지로 제시될 수도 있고, 그 후 순방향 패스가 출력 (222) 을 생성하기 위해 계산될 수도 있다. DCN (200) 은 특징 추출 섹션 및 분류 섹션을 포함할 수도 있다. 이미지 (226) 를 수신하면, 컨볼루션 계층 (232) 은 이미지 (226) 에 컨볼루션 커널들 (미도시) 을 적용하여 특징 맵들 (218) 의 제 1 세트를 생성할 수도 있다. 예로서, 컨볼루션 계층 (232) 에 대한 컨볼루션 커널은 28x28 특징 맵들을 생성하는 5x5 커널일 수도 있다. 본 예에서, 4개의 상이한 특징 맵이 특징 맵들의 제 1 세트 (218) 에서 생성되기 때문에, 4개의 상이한 컨볼루션 커널이 컨볼루션 계층 (232) 에서 이미지 (226) 에 적용되었다. 컨볼루션 커널들은 또한 필터들 또는 컨볼루션 필터들로 지칭될 수도 있다.DCN 200 can also be trained with supervised learning. During training, DCN 200 may be presented with an image, such as an image 226 of a speed limit sign, and then a forward pass may be computed to produce output 222. DCN 200 may include a feature extraction section and a classification section. Upon receiving image 226, convolution layer 232 may apply convolution kernels (not shown) to image 226 to generate a first set of feature maps 218. As an example, the convolution kernel for convolution layer 232 may be a 5x5 kernel that generates 28x28 feature maps. In this example, because four different feature maps are generated from the first set of feature maps 218, four different convolution kernels were applied to image 226 in convolution layer 232. Convolutional kernels may also be referred to as filters or convolutional filters.

특징 맵들의 제 1 세트 (218) 는 특징 맵들의 제 2 세트 (220) 를 생성하기 위해 최대 풀링 계층 (미도시) 에 의해 서브샘플링될 수도 있다. 최대 풀링 계층은 특징 맵들 (218) 의 제 1 세트의 사이즈를 감소시킨다. 즉, 14x14 와 같은 특징 맵들의 제 2 세트 (220) 의 사이즈는 28x28 과 같은 특징 맵들의 제 1 세트 (218) 의 사이즈보다 작다. 감소된 사이즈는 메모리 소비를 감소시키면서 후속 계층에 유사한 정보를 제공한다. 특징 맵들의 제 2 세트 (220) 는 추가로, 특징 맵들의 하나 이상의 후속 세트 (미도시) 를 생성하기 위해 하나 이상의 후속 컨볼루션 계층 (미도시) 을 통해 컨볼루션될 수도 있다.The first set of feature maps 218 may be subsampled by a maximum pooling layer (not shown) to generate a second set of feature maps 220. The max pooling layer reduces the size of the first set of feature maps 218. That is, the size of the second set of feature maps 220, such as 14x14, is smaller than the size of the first set of feature maps 218, such as 28x28. The reduced size provides similar information to subsequent layers while reducing memory consumption. The second set of feature maps 220 may further be convolved through one or more subsequent convolution layers (not shown) to generate one or more subsequent sets of feature maps (not shown).

도 2d 의 예에서, 제 2 세트의 특징 맵(220)은 제 1 특징 벡터(224)를 생성하도록 컨볼루션된다. 또한, 제 1 특징 벡터 (224) 는 제 2 특징 벡터 (228) 를 생성하도록 추가로 컨볼루션된다. 제 2 특징 벡터 (228) 의 각각의 특징은 "표지판", "60" 및 "100” 과 같은 이미지 (226) 의 가능한 특징에 대응하는 수를 포함할 수도 있다. 소프트맥스 함수 (softmax function)(미도시) 는 제 2 특징 벡터 (228) 에서의 수들을 확률로 변환할 수도 있다. 이와 같이, DCN (200) 의 출력 (222) 은 하나 이상의 특징을 포함하는 이미지 (226) 의 확률이다.In the example of FIG. 2D , the second set of feature maps 220 are convolved to produce a first feature vector 224. Additionally, first feature vector 224 is further convolved to generate second feature vector 228. Each feature of the second feature vector 228 may include a number corresponding to a possible feature of the image 226, such as “sign”, “60”, and “100”. Softmax function ( (not shown) may convert the numbers in the second feature vector 228 to probabilities.As such, the output 222 of the DCN 200 is the probability of the image 226 containing one or more features.

본 예에서, "부호" 및 "60"에 대한 출력(222)에서의 확률들은 "30", "40", "50", "70", "80", "90" 및 "100"과 같은 출력(222)의 다른 것들의 확률들보다 높다. 훈련 전에, DCN (200) 에 의해 생성된 출력 (222) 은 부정확할 가능성이 있다. 따라서, 출력 (222) 과 타겟 출력 사이에 에러가 계산될 수도 있다. 타겟 출력은 이미지 (226) 의 실측 자료(ground truth)(예를 들어, "표지판" 및 "60") 이다. DCN (200) 의 가중치들은 그 후 DCN (200) 의 출력 (222) 이 타겟 출력과 더 밀접하게 정렬되도록 조정될 수도 있다.In this example, the probabilities at output 222 for “sign” and “60” are: “30”, “40”, “50”, “70”, “80”, “90”, and “100”. It is higher than the probabilities of the others in output 222. Before training, the output 222 produced by DCN 200 is likely to be inaccurate. Accordingly, the error may be calculated between output 222 and the target output. The target output is the ground truth of image 226 (e.g., “Sign” and “60”). The weights of DCN 200 may then be adjusted so that the output 222 of DCN 200 is more closely aligned with the target output.

가중치들를 조정하기 위해, 러닝 알고리즘은 가중치들에 대한 그래디언트 벡터를 계산할 수도 있다. 그래디언트는 가중치가 조정되었으면 에러가 증가 또는 감소할 양을 표시할 수도 있다. 최상위 계층에서, 그래디언트는 끝에서 두번째 계층에서의 활성화된 뉴런 및 출력 계층에서의 뉴런을 연결하는 가중치의 값에 직접 대응할 수도 있다. 하위 계층들에서, 그래디언트는 가중치들의 값 및 상위 계층들의 계산된 에러 그래디언트들에 의존할 수도 있다. 가중치들은 그 후 에러를 감소시키기 위해 조정될 수도 있다. 가중치를 조정하는 이러한 방식은 뉴럴 네트워크를 통한 "역방향 패스” 를 수반하기 때문에 "역 전파” 로 지칭될 수도 있다.To adjust the weights, the learning algorithm may calculate a gradient vector for the weights. The gradient can also indicate the amount by which the error would increase or decrease if the weights were adjusted. In the top layer, the gradient may correspond directly to the values of the weights connecting the activated neurons in the penultimate layer and the neurons in the output layer. In lower layers, the gradient may depend on the value of the weights and the calculated error gradients of the higher layers. The weights may then be adjusted to reduce error. This method of adjusting weights may be referred to as “back propagation” because it involves a “backward pass” through the neural network.

실제로, 가중치들의 에러 그래디언트는 작은 수의 예들에 걸쳐 계산될 수도 있어서, 계산된 그래디언트는 실제 에러 그래디언트에 근사한다. 이러한 근사화 방법은 확률적 그래디언트 하강법 (stochastic gradient descent) 으로 지칭될 수도 있다. 확률적 그래디언트 하강법은 전체 시스템의 달성가능한 에러율이 감소하는 것을 멈출 때까지 또는 에러율이 타겟 레벨에 도달할 때까지 반복될 수도 있다. 학습 후에, DCN 은 새로운 이미지들을 제시받을 수도 있고, 네트워크를 통한 포워드 패스는 DCN 의 추론 또는 예측으로 고려될 수도 있는 출력 (222) 을 산출할 수도 있다.In practice, the error gradient of the weights may be computed over a small number of examples, so that the computed gradient approximates the actual error gradient. This approximation method may also be referred to as stochastic gradient descent. Stochastic gradient descent may be repeated until the achievable error rate of the overall system stops decreasing or until the error rate reaches a target level. After training, the DCN may be presented with new images, and a forward pass through the network may yield an output 222 that may be considered the DCN's inference or prediction.

DBN (deep belief network) 은 은닉된 노드들의 다중 계층들을 포함하는 확률 모델이다. DBN 은 훈련 데이터 세트의 계위적 표현을 추출하는데 사용될 수도 있다. DBN 은 제한된 볼츠만 머신 (Restricted Boltzmann Machines)(RBM) 의 계층들을 적층하여 획득될 수도 있다. RBM 은 입력들의 세트에 걸친 확률 분포를 학습할 수 있는 인공 뉴럴 네트워크의 타입이다. RBM들은 각각의 입력이 카테고리화되어야 하는 클래스에 관한 정보의 부재 시 확률 분포를 학습할 수 있기 때문에, RBM들은 종종 비지도 학습에 사용된다. 하이브리드 비지도 및 지도 패러다임을 사용하여, DBN 의 최하위 RBM들은 비지도 방식으로 훈련될 수도 있고 특징 추출기들로서 작용할 수도 있으며, 최상위 RBM 은 (이전 계층 및 타겟 클래스들로부터의 입력들의 공동 분포에 대해) 지도 방식으로 훈련될 수도 있고 분류기로서 작용할 수도 있다.DBN (deep belief network) is a probabilistic model that includes multiple layers of hidden nodes. DBN can also be used to extract hierarchical representations of training data sets. DBN may be obtained by stacking layers of Restricted Boltzmann Machines (RBM). RBM is a type of artificial neural network that can learn probability distributions over a set of inputs. RBMs are often used in unsupervised learning because RBMs can learn probability distributions in the absence of information about the class into which each input should be categorized. Using a hybrid unsupervised and supervised paradigm, the bottom-level RBMs of a DBN can be trained in an unsupervised manner and act as feature extractors, while the top-level RBMs are supervised (with respect to the joint distribution of inputs from the previous layer and target classes). It can be trained as a classifier or it can act as a classifier.

딥 컨볼루션 네트워크 (DCN) 는 추가적인 풀링 및 정규화 계층들로 구성된, 컨볼루션 네트워크들의 네트워크들이다. DCN들은 많은 태스크들에 대해 최첨단 성능을 달성하였다. DCN들은 입력 및 출력 타겟들 양자 모두가 많은 예시들에 대해 알려져 있고 그래디언트 하강 방법들의 사용에 의해 네트워크의 가중치들을 수정하는데 사용되는 지도 학습을 사용하여 훈련될 수 있다.Deep convolutional networks (DCN) are networks of convolutional networks, consisting of additional pooling and normalization layers. DCNs have achieved state-of-the-art performance for many tasks. DCNs can be trained using supervised learning where both input and output targets are known for many examples and are used to modify the network's weights by use of gradient descent methods.

DCN 은 피드-포워드 네트워크일 수도 있다. 또한, 상술한 바와 같이, DCN 의 제 1 계층에서의 뉴런으로부터 다음 상위 계층에서의 뉴런들의 그룹으로의 연결들은 제 1 계층에서의 뉴런들에 걸쳐 공유된다. DCN들의 피드-포워드 및 공유 연결들은 빠른 프로세싱을 위해 이용될 수도 있다. DCN 의 계산 부담은 예를 들어, 순환 또는 피드백 연결들을 포함하는 유사하게 사이징된 뉴럴 네트워크의 것보다 훨씬 적을 수도 있다.DCN may be a feed-forward network. Additionally, as described above, connections from a neuron in the first layer of the DCN to a group of neurons in the next higher layer are shared across the neurons in the first layer. DCNs' feed-forward and shared connections may be used for fast processing. The computational burden of a DCN may be much less than that of a similarly sized neural network containing, for example, recurrent or feedback connections.

컨볼루션 네트워크의 각각의 계층의 프로세싱은 공간적으로 불변 템플릿 또는 기저 투영으로 간주될 수도 있다. 입력이 컬러 이미지의 적색, 녹색 및 청색 채널들과 같은 다중 채널들로 먼저 분해되면, 그 입력에 대해 훈련된 컨볼루션 네트워크는 이미지의 축들을 따라 2개의 공간 차원 및 컬러 정보를 캡처하는 제 3 차원을 갖는, 3 차원으로 간주될 수도 있다. 컨볼루션 연결들의 출력들은 후속 계층에서 특징 맵을 형성하는 것으로 간주될 수도 있고, 특징 맵의 각각의 엘리먼트 (예를 들어, 220) 는 이전 계층에서의 뉴런들의 범위 (예를 들어, 특징 맵들 (218)) 로부터 그리고 다중 채널들 각각으로부터 입력을 수신한다. 피처 맵에서의 값들은 교정 (rectification) 과 같은 비-선형성, max(0,x) 으로 추가로 프로세싱될 수도 있다. 인접한 뉴런들로부터의 값들은 추가로 풀링될 수도 있으며, 이는 다운 샘플링에 대응하고, 부가적인 로컬 불변 및 차원성 감소를 제공할 수도 있다.The processing of each layer of a convolutional network may be viewed as a spatially invariant template or basis projection. If the input is first decomposed into multiple channels, such as the red, green, and blue channels of a color image, a convolutional network trained on that input can generate two spatial dimensions along the axes of the image and a third dimension that captures color information. It can also be considered as three-dimensional. The outputs of the convolutional connections may be considered to form a feature map in a subsequent layer, with each element of the feature map (e.g., 220) representing a range of neurons in the previous layer (e.g., feature maps (218) )) and receives input from each of multiple channels. The values in the feature map may be further processed with non-linearity, such as rectification, max(0,x). Values from neighboring neurons may be further pooled, which may correspond to down-sampling and provide additional local invariance and dimensionality reduction.

도 3 은 딥 컨볼루션 네트워크 (350) 의 일 예를 예시하는 블록도이다. 딥 컨볼루션 네트워크(350)는 연결성 및 가중치 공유에 기초하여 다수의 상이한 유형의 계층을 포함할 수 있다. 도 3 에 도시된 바와 같이, 딥 컨볼루션 네트워크(350)는 컨볼루션 블록들(354A, 354B)을 포함한다. 컨볼루션 블록들(354A, 354B) 각각은 컨볼루션 계층(CONV)(356), 정규화 계층(LNorm)(358), 및 최대 풀링 계층(MAX POOL)(360)으로 구성될 수 있다.3 is a block diagram illustrating an example of a deep convolutional network 350. Deep convolutional network 350 may include multiple different types of layers based on connectivity and weight sharing. As shown in Figure 3, deep convolutional network 350 includes convolutional blocks 354A and 354B. Each of the convolution blocks 354A and 354B may be composed of a convolution layer (CONV) 356, a normalization layer (LNorm) 358, and a maximum pooling layer (MAX POOL) 360.

컨볼루션 계층들(356)은 특징 맵을 생성하기 위해 입력 데이터(352)에 적용될 수 있는 하나 이상의 컨볼루션 필터를 포함할 수 있다. 블록들(354A, 354B) 상의 단지 2개의 컨볼루션(convolution)만이 도시되지만, 본 개시내용은 그렇게 제한되지 않으며, 대신에, 설계 선호도에 따라 임의의 수의 컨볼루션 블록들(예를 들어, 블록들(354A, 354B))이 딥 컨볼루션 네트워크(350)에 포함될 수 있다. 정규화 계층 (358) 은 컨볼루션 필터들의 출력을 정규화할 수도 있다. 예를 들어, 정규화 계층 (358) 은 화이트닝 또는 측면 억제를 제공할 수도 있다. 최대 풀링 계층 (360) 은 로컬 불변 및 차원성 감소를 위해 공간에 걸쳐 다운 샘플링 집성을 제공할 수도 있다.Convolutional layers 356 may include one or more convolutional filters that may be applied to input data 352 to generate a feature map. Although only two convolutions on blocks 354A and 354B are shown, the present disclosure is not so limited and instead, any number of convolution blocks (e.g., blocks 354A, 354B) may be included in the deep convolutional network 350. Normalization layer 358 may normalize the output of the convolutional filters. For example, normalization layer 358 may provide whitening or lateral suppression. The max pooling layer 360 may provide down-sampling aggregation across space for local invariance and dimensionality reduction.

예를 들어, 딥 컨볼루션 네트워크의 병렬 필터 뱅크들은 고성능 및 저전력 소비를 달성하기 위해 SOC (100) 의 CPU (102) 또는 GPU (104) 상에 로딩될 수도 있다. 대안적인 실시형태들에 있어서, 병렬 필터 뱅크들은 SOC (100) 의 DSP (106) 또는 ISP (116) 상에 로딩될 수도 있다. 또한, 딥 컨볼루션 네트워크 (350) 는 센서들 및 내비게이션에 각각 전용된, 센서 프로세서 (114) 및 내비게이션 모듈 (120) 과 같은 SOC (100) 상에 존재할 수도 있는 다른 프로세싱 블록들에 액세스할 수도 있다.For example, parallel filter banks of a deep convolutional network may be loaded on CPU 102 or GPU 104 of SOC 100 to achieve high performance and low power consumption. In alternative embodiments, parallel filter banks may be loaded on DSP 106 or ISP 116 of SOC 100. Additionally, deep convolutional network 350 may have access to other processing blocks that may exist on SOC 100, such as sensor processor 114 and navigation module 120, dedicated to sensors and navigation, respectively. .

딥 컨볼루션 네트워크(350)는 또한 계층(362A)("FC1"로 라벨링됨) 및 계층(362B)("FC2"로 라벨링됨)과 같은 하나 이상의 완전히 연결된 계층들을 포함할 수 있다. 딥 컨볼루션 네트워크 (350) 는 로지스틱 회귀 (LR) 계층 (364) 을 더 포함할 수도 있다. 딥 컨볼루션 네트워크(350)의 각각의 계층(356, 358, 360, 362A, 362B, 364) 사이에는 업데이트될 가중치들(도시되지 않음)이 있다. 계층들(예를 들어, 356, 358, 360, 362A, 362B, 364) 각각의 출력은 컨볼루션 블록들(354A) 중 제1 컨볼루션 블록에서 공급되는 입력 데이터(352)(예를 들어, 이미지들, 오디오, 비디오, 센서 데이터 및/또는 다른 입력 데이터)로부터 계층적 특징 표현들을 학습하기 위해 딥 컨볼루션 네트워크(350)에서 계층들(예를 들어, 356, 358, 360, 362A, 362B, 364) 중 후속하는 하나의 입력으로서 기능할 수 있다. 딥 컨볼루션 네트워크 (350) 의 출력은 입력 데이터 (352) 에 대한 분류 스코어 (366) 이다. 분류 스코어 (366) 는 확률들의 세트일 수도 있고, 여기서 각각의 확률은 특징들의 세트로부터의 특징을 포함하는, 입력 데이터의 확률이다.Deep convolutional network 350 may also include one or more fully connected layers, such as layer 362A (labeled “FC1”) and layer 362B (labeled “FC2”). Deep convolutional network 350 may further include a logistic regression (LR) layer 364. There are weights (not shown) to be updated between each layer 356, 358, 360, 362A, 362B, and 364 of the deep convolutional network 350. The output of each of the layers (e.g., 356, 358, 360, 362A, 362B, 364) is input data 352 (e.g., image) supplied from the first convolution block of the convolution blocks 354A. Layers (e.g., 356, 358, 360, 362A, 362B, 364) in deep convolutional network 350 to learn hierarchical feature representations from fields, audio, video, sensor data, and/or other input data. ) can function as one of the following inputs. The output of the deep convolutional network 350 is the classification score 366 for the input data 352. Classification score 366 may be a set of probabilities, where each probability is a probability of the input data containing a feature from the set of features.

전술한 바와 같이, 디지털 비디오 데이터는 많은 양의 데이터를 포함할 수 있으며, 이는 통신 네트워크들뿐만 아니라 비디오 데이터를 처리하고 저장하는 디바이스들에 상당한 부담을 줄 수 있다. 예를 들어, 압축되지 않은 비디오 콘텐츠를 기록하는 것은 일반적으로 기록된 비디오 콘텐츠의 해상도가 증가함에 따라 크게 증가하는 큰 파일 크기를 초래한다. 하나의 예시적인 예에서, 1080p/24에서 기록된 채널 당 압축되지 않은 16-비트 비디오(예를 들어, 초당 24개의 프레임들이 캡처되는, 폭이 1920 픽셀들 및 높이가 1080 픽셀들의 해상도)는 프레임 당 12.4 메가바이트, 또는 초당 297.6 메가바이트를 점유할 수 있다. 초당 24개의 프레임으로 4K 해상도로 기록된 채널당 압축되지 않은 16비트 비디오는 프레임당 49.8메가바이트, 즉 초당 1195.2메가바이트를 차지할 수 있다.As previously mentioned, digital video data can contain large amounts of data, which can place a significant burden on communication networks as well as devices that process and store video data. For example, recording uncompressed video content typically results in large file sizes that increase significantly as the resolution of the recorded video content increases. In one illustrative example, uncompressed 16-bit per channel video recorded at 1080p/24 (e.g., a resolution of 1920 pixels wide and 1080 pixels high, at which 24 frames per second are captured) is a frame It can occupy 12.4 megabytes per second, or 297.6 megabytes per second. Uncompressed 16-bit per channel video recorded at 4K resolution at 24 frames per second can occupy 49.8 megabytes per frame, or 1195.2 megabytes per second.

네트워크 대역폭은 큰 비디오 파일이 문제가 될 수 있는 또 다른 제약이다. 예를 들어, 비디오 콘텐츠는 종종 무선 네트워크들(예를 들어, LTE, LTE-Advanced, New Radio(NR), WiFi TM, Bluetooth TM, 또는 다른 무선 네트워크들을 통해)을 통해 전달되고, 소비자 인터넷 트래픽의 큰 부분을 구성할 수 있다. 무선 네트워크들에서 이용가능한 대역폭의 양의 진보들에도 불구하고, 이들 네트워크들에서 비디오 콘텐츠를 전달하는데 사용되는 대역폭의 양을 감소시키는 것이 여전히 바람직할 수도 있다.Network bandwidth is another limitation that can be problematic for large video files. For example, video content is often delivered over wireless networks (e.g., via LTE, LTE-Advanced, New Radio (NR), WiFi™, Bluetooth™, or other wireless networks) and is often transmitted over consumer Internet traffic. It can make up a large part. Despite advances in the amount of bandwidth available in wireless networks, it may still be desirable to reduce the amount of bandwidth used to deliver video content in these networks.

압축되지 않은 비디오 콘텐츠는 물리적 저장을 위한 상당한 메모리 및 송신을 위한 상당한 대역폭을 수반할 수 있는 큰 파일들을 초래할 수 있기 때문에, 비디오 코딩 기법들이 그러한 비디오 콘텐츠를 압축한 다음 압축해제하기 위해 이용될 수 있다.Because uncompressed video content can result in large files that can involve significant memory for physical storage and significant bandwidth for transmission, video coding techniques may be used to compress and then decompress such video content. .

비디오 콘텐츠의 크기 - 따라서 비디오 콘텐츠를 저장하는데 수반되는 저장의 양 - 및 비디오 콘텐츠를 전달하는데 수반되는 대역폭의 양을 감소시키기 위해, 다양한 비디오 코딩 기법들이 특히 HEVC, AVC, MPEG, VVC와 같은 특정 비디오 코딩 표준에 따라 수행될 수 있다. 비디오 코딩은 종종 인터-예측(inter-prediction) 또는 인트라-예측(intra-prediction)과 같은 예측 방법들을 사용하며, 이는 비디오 이미지들 또는 시퀀스들에 존재하는 리던던시들을 이용한다. 비디오 코딩 기법들의 공통 목적은 비디오 품질에 대한 열화들을 회피 또는 최소화하면서 더 낮은 비트 레이트를 사용하는 형태로 비디오 데이터를 압축하는 것이다. 비디오 서비스들에 대한 수요가 증가하고 새로운 비디오 서비스들이 이용가능해짐에 따라, 더 양호한 코딩 효율, 성능, 및 레이트 제어를 갖는 코딩 기법들이 필요하다.In order to reduce the size of video content - and therefore the amount of storage involved in storing the video content - and the amount of bandwidth involved in delivering the video content, various video coding techniques have been developed, particularly for specific video formats such as HEVC, AVC, MPEG, and VVC. It can be performed according to coding standards. Video coding often uses prediction methods such as inter-prediction or intra-prediction, which exploit redundancies present in video images or sequences. A common goal of video coding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradations to video quality. As demand for video services increases and new video services become available, coding techniques with better coding efficiency, performance, and rate control are needed.

일반적으로 , 인코딩 디바이스는 인코딩된 비디오 비트스트림을 생성하기 위해 비디오 코딩 표준에 따라 비디오 데이터를 인코딩한다. 일부 예들에 있어서, 인코딩된 비디오 비트스트림 (또는 "비디오 비트스트림" 또는 "비트스트림") 은 일련의 하나 이상의 코딩된 비디오 시퀀스들이다. 인코딩 디바이스는 각각의 픽처를 다중의 슬라이스들로 파티셔닝함으로써 픽처들의 코딩된 표현들을 생성할 수 있다. 슬라이스는, 그 슬라이스에서의 정보가 동일한 픽처 내의 다른 슬라이스들로부터의 데이터에 의존하지 않고 코딩되도록 다른 슬라이스들에 독립적이다. 슬라이스는 독립적인 슬라이스 세그먼트, 및 만약 존재한다면, 이전 슬라이스 세그먼트들에 의존하는 하나 이상의 종속적인 슬라이스 세그먼트들을 포함하는 하나 이상의 슬라이스 세그먼트들을 포함한다. HEVC 에서, 슬라이스들은 루마 샘플들 및 크로마 샘플들의 코딩 트리 블록들 (CTB들) 로 파티셔닝된다. 루마 샘플들의 CTB와 크로마 샘플들의 하나 이상의 CTB들이, 그 샘플들을 위한 신택스와 함께, 코딩 트리 유닛(coding tree unit, CTU)이라고 지칭된다. CTU 는 또한 "트리 블록" 또는 "최대 코딩 유닛" (largest coding unit; LCU) 으로 지칭될 수도 있다. CTU 는 HEVC 인코딩을 위한 기본 프로세싱 유닛이다. CTU 는 다양한 사이즈들의 다중 코딩 유닛들 (Cus) 로 분할될 수 있다. CU 는 코딩 블록들 (Cbs) 로 지칭되는 루마 및 크로마 샘플 어레이들을 포함한다.Typically, an encoding device encodes video data according to a video coding standard to produce an encoded video bitstream. In some examples, an encoded video bitstream (or “video bitstream” or “bitstream”) is a series of one or more coded video sequences. An encoding device can generate coded representations of pictures by partitioning each picture into multiple slices. A slice is independent of other slices so that information in that slice is coded without dependence on data from other slices within the same picture. A slice includes one or more slice segments, including an independent slice segment and, if any, one or more dependent slice segments that depend on previous slice segments. In HEVC, slices are partitioned into coding tree blocks (CTBs) of luma samples and chroma samples. The CTB of luma samples and one or more CTBs of chroma samples, along with the syntax for the samples, are referred to as a coding tree unit (CTU). A CTU may also be referred to as a “tree block” or “largest coding unit” (LCU). CTU is the basic processing unit for HEVC encoding. A CTU can be divided into multiple coding units (Cus) of various sizes. A CU contains luma and chroma sample arrays, referred to as coding blocks (Cbs).

루마 및 크로마 CB 들은 예측 블록 (PB) 들로 더 분할될 수 있다. PB 는 (이용 가능하거나 사용을 위해 인에이블될 때) 인터 예측 또는 인트라 블록 카피 (IBC) 예측에 대해 동일한 모션 파라미터들을 사용하는 루마 성분 또는 크로마 성분의 샘플들의 블록이다. 루마 PB 및 하나 이상의 크로마 PB들은, 관련 구문과 함께, 예측 유닛 (PU) 을 형성한다. 인터-예측을 위해, 모션 파라미터들의 세트 (예컨대, 하나 이상의 모션 벡터들, 참조 인덱스들 등) 가 각각의 PU 에 대해 비트스트림으로 시그널링되고, 루마 PB 및 하나 이상의 크로마 PB들의 인터-예측을 위해 사용된다. 모션 파라미터들은 또한, 모션 정보로서 지칭될 수 있다. CB 는 또한 하나 이상의 변환 블록들 (Tbs) 로 파티셔닝될 수 있다. TB 는, 예측 잔차 신호를 코딩하기 위해 잔차 변환 (예컨대, 일부 경우들에서 동일한 2 차원 변환) 이 적용되는 컬러 컴포넌트의 샘플들의 정사각형 블록을 나타낸다. 변환 유닛 (TU) 은 루마 및 크로마 샘플들의 TB 들 및 대응하는 신택스 엘리먼트들을 나타낸다. 변환 코딩은 하기에 보다 상세히 기재한다.Luma and chroma CBs can be further divided into prediction blocks (PBs). A PB is a block of samples of the luma component or chroma component that uses the same motion parameters for inter prediction or intra block copy (IBC) prediction (when available or enabled for use). The luma PB and one or more chroma PBs, together with the associated phrases, form a prediction unit (PU). For inter-prediction, a set of motion parameters (e.g., one or more motion vectors, reference indices, etc.) is signaled in the bitstream for each PU and used for inter-prediction of the luma PB and one or more chroma PBs. do. Motion parameters may also be referred to as motion information. A CB may also be partitioned into one or more transform blocks (Tbs). TB represents a square block of samples of the color component to which a residual transform (eg, in some cases the same two-dimensional transform) is applied to code the prediction residual signal. A transformation unit (TU) represents TBs of luma and chroma samples and corresponding syntax elements. Transform coding is described in more detail below.

HEVC 표준에 따르면, 변환들은 TU들을 이용하여 수행될 수도 있다. TU들은 주어진 CU 내의 PU들의 크기에 기초하여 사이징될 수도 있다. TU들은 통상적으로 PU들과 동일한 크기이거나 또는 PU들보다 더 작을 수도 있다. 일부 예들에서, CU 에 대응하는 잔차 샘플들은 잔차 쿼드트리 (residual quad tree; RQT) 로 알려진, 쿼드트리 구조를 이용하여 더 작은 유닛들로 세분될 수도 있다. RQT 의 리프 노드들은 TU들에 대응할 수도 있다. TU들에 연관되는 픽셀 차이 값들이 변환 계수들을 형성하도록 변환될 수도 있다. 그 다음, 변환 계수들은 인코딩 디바이스에 의해 양자화될 수도 있다.According to the HEVC standard, transformations may be performed using TUs. TUs may be sized based on the size of PUs within a given CU. TUs are typically the same size as the PUs or may be smaller than the PUs. In some examples, residual samples corresponding to a CU may be subdivided into smaller units using a quadtree structure, known as a residual quad tree (RQT). Leaf nodes of an RQT may correspond to TUs. Pixel difference values associated with TUs may be transformed to form transform coefficients. The transform coefficients may then be quantized by the encoding device.

일단 비디오 데이터의 픽처들이 CU들로 파티셔닝되면, 인코딩 디바이스는 예측 모드를 사용하여 각각의 PU 를 예측한다. 그 다음, 예측 유닛 또는 예측 블록은 잔차들을 얻기 위해 오리지널 비디오 데이터로부터 감산된다 (하기에서 설명됨). 각각의 CU 에 대해, 예측 모드는 신택스 데이터를 사용하여 비트스트림 내부에서 시그널링될 수도 있다. 예측 모드는 인트라 예측 (또는 인트라-화상 예측) 또는 인터 예측 (또는 인터-화상 예측) 을 포함할 수도 있다. 인트라 예측은 픽처 내에서 공간적으로 이웃하는 샘플 간의 상관 (correlation) 을 이용한다. 예를 들어, 인트라 예측을 사용하여, 각각의 PU 는, 예를 들어, PU 에 대한 평균값을 발견하기 위한 DC 예측, PU 에 대해 평면 표면을 피팅 (fitting) 하기 위한 평면 예측, 이웃하는 데이터로부터 외삽하기 위한 방향 예측, 또는 임의의 다른 적절한 유형의 예측을 사용하여 동일한 픽처 내의 이웃하는 이미지 데이터로부터 예측된다. 인터 예측은 이미지 샘플들의 블록에 대한 모션 보상된 예측을 도출하기 위해 픽처들 간의 시간적 상관을 이용한다. 예를 들어, 인터 예측을 사용하여, 각각의 PU 는 (출력 순서로 현재 픽처의 전 또는 후의) 하나 이상의 레퍼런스 픽처들에서의 이미지 데이터로부터의 모션 보상 예측을 사용하여 예측된다. 인터 픽처 또는 인트라 픽처 예측을 사용하여 픽처 영역을 코딩할지 여부의 결정은 예를 들어 CU 레벨에서 행해질 수도 있다.Once the pictures of video data are partitioned into CUs, the encoding device predicts each PU using a prediction mode. The prediction unit or prediction block is then subtracted from the original video data to obtain the residuals (described below). For each CU, the prediction mode may be signaled within the bitstream using syntax data. The prediction mode may include intra prediction (or intra-picture prediction) or inter prediction (or inter-picture prediction). Intra prediction uses correlation between spatially neighboring samples within a picture. For example, using intra prediction, each PU can be subjected to, for example, a DC prediction to find the average value for the PU, a planar prediction to fit a planar surface for the PU, and extrapolation from neighboring data. is predicted from neighboring image data within the same picture using directional prediction, or any other suitable type of prediction. Inter prediction uses temporal correlation between pictures to derive motion compensated predictions for blocks of image samples. For example, using inter prediction, each PU is predicted using motion compensated prediction from image data in one or more reference pictures (before or after the current picture in output order). The decision whether to code a picture region using inter-picture or intra-picture prediction may be made at the CU level, for example.

인트라 및/또는 인터 예측을 이용하여 예측을 수행한 후, 인코딩 디바이스는 변환 및 양자화를 수행할 수 있다. 예를 들어, 예측 다음에, 인코딩 디바이스는 PU에 대응하는 잔차 값들을 계산할 수도 있다. 잔차 값들은 코딩되는 픽셀들의 현재 블록 (PU) 과 현재 블록을 예측하는데 사용된 예측 블록 (예컨대, 현재 블록의 예측된 버전) 사이의 픽셀 차이 값들을 포함할 수도 있다. 예를 들어, 예측 블록을 생성한 (예컨대, 인터 예측 또는 인트라 예측을 발행한) 후, 인코딩 디바이스는 현재 블록으로부터 예측 유닛에 의해 생성된 예측 블록을 감산함으로써 잔차 블록을 생성할 수 있다. 잔차 블록은 현재 블록의 픽셀 값과 예측 블록의 픽셀 값 사이의 차이를 정량화하는 픽셀 차이 값들의 셋트를 포함한다. 일부 예들에서, 잔차 블록은 2 차원 블록 포맷 (예를 들어, 2 차원 매트릭스 또는 어레이의 픽셀 값들) 으로 표현될 수도 있다. 이러한 예에서, 잔차 블록은 픽셀 값들의 2 차원 표현이다.After performing prediction using intra and/or inter prediction, the encoding device can perform transformation and quantization. For example, following prediction, the encoding device may calculate residual values corresponding to the PU. Residual values may include pixel difference values between the current block (PU) of pixels being coded and the prediction block used to predict the current block (eg, a predicted version of the current block). For example, after generating a prediction block (e.g., issuing inter-prediction or intra-prediction), the encoding device may generate a residual block by subtracting the prediction block generated by the prediction unit from the current block. The residual block contains a set of pixel difference values that quantify the difference between the pixel value of the current block and the pixel value of the prediction block. In some examples, the residual block may be represented in a two-dimensional block format (e.g., a two-dimensional matrix or array of pixel values). In this example, the residual block is a two-dimensional representation of pixel values.

예측이 수행된 후에 남을 수 있는 임의의 잔차 데이터는 이산 코사인 변환, 이산 사인 변환, 정수 변환, 웨이브렛 변환, 다른 적절한 변환 함수 또는 이들의 임의의 조합에 기초할 수도 있는 블록 변환을 사용하여 변환된다. 일부 경우에서, 하나 이상의 블록 변환들 (예를 들어, 크기 32 x 32, 16 x 16, 8 x 8, 4 x 4, 또는 다른 적합한 사이즈) 이 각각의 CU 의 잔차 데이터에 적용될 수도 있다. 일부 실시형태들에서, TU 는 인코딩 디바이스에 의해 구현되는 변환 및 양자화 프로세스들을 위해 사용될 수도 있다. 하나 이상의 PU 들을 갖는 주어진 CU 는 하나 이상의 TU 들을 또한 포함할 수도 있다. 아래에 더 상세히 기술되는 바와 같이, 잔차 값들은 블록 변환을 사용하여 변환 계수로 변환될 수 있고, 그 후 TU를 사용하여 양자화되고 스캔되어 엔트로피 코딩을 위한 직렬화된 변환 계수를 생성할 수도 있다.Any residual data that may remain after the prediction is performed is transformed using a block transform that may be based on a discrete cosine transform, a discrete sine transform, an integer transform, a wavelet transform, another suitable transform function, or any combination thereof. . In some cases, one or more block transforms (e.g., sizes 32 x 32, 16 x 16, 8 x 8, 4 x 4, or other suitable sizes) may be applied to the residual data of each CU. In some embodiments, a TU may be used for transformation and quantization processes implemented by an encoding device. A given CU with one or more PUs may also include one or more TUs. As described in more detail below, the residual values may be converted to transform coefficients using a block transform, and then quantized and scanned using a TU to generate serialized transform coefficients for entropy coding.

인코딩 디바이스는 변환 계수들의 양자화를 수행할 수 있다. 양자화는 변환 계수들을 양자화하여 그 계수들을 나타내는데 사용되는 데이터의 양을 감소시킴으로써 추가의 압축을 제공한다. 예를 들어, 양자화는 그 계수들의 일부 또는 전부와 연관된 비트 심도를 감소시킬 수도 있다. 일례에 있어서, n 비트 값을 갖는 계수는 양자화 동안 m 비트 값으로 라운드-다운될 수도 있으며, 여기서, n 은 m 보다 크다.The encoding device may perform quantization of transform coefficients. Quantization provides additional compression by quantizing the transform coefficients to reduce the amount of data used to represent the coefficients. For example, quantization may reduce the bit depth associated with some or all of the coefficients. In one example, coefficients with n-bit values may be rounded down to m-bit values during quantization, where n is greater than m.

일단 양자화가 수행되면, 코딩된 비디오 비트스트림은 양자화된 변환 계수, 예측 정보 (예를 들어, 예측 모드, 모션 벡터, 블록 벡터 등), 파티셔닝 정보, 및 다른 구문 데이터와 같은 임의의 다른 적절한 데이터를 포함한다. 코딩된 비디오 비트스트림의 상이한 엘리먼트들은 그 후 인코딩 디바이스에 의해 엔트로피 인코딩될 수도 있다. 일부 예들에 있어서, 인코딩 디바이스는, 엔트로피 인코딩될 수 있는 직렬화된 벡터를 생성하기 위해 미리 정의된 스캔 순서를 활용하여 양자화된 변환 계수들을 스캐닝할 수도 있다. 일부 예들에서, 인코딩 디바이스는 적응적 스캔을 수행할 수도 있다. 벡터 (예를 들어, 1차원 벡터) 를 형성하기 위해 양자화된 변환 계수들을 스캐닝한 후, 인코딩 디바이스는 벡터를 엔트로피 인코딩할 수도 있다. 예를 들어, 인코딩 디바이스는 컨텍스트 적응적 가변 길이 코딩, 컨텍스트 적응적 이진 산술 코딩, 신택스 기반 컨텍스트 적응적 이진 산술 코딩, 확률 인터벌 파티셔닝 엔트로피 코딩, 또는 다른 적합한 엔트로피 인코딩 기법을 사용할 수도 있다.Once quantization is performed, the coded video bitstream contains any other appropriate data, such as quantized transform coefficients, prediction information (e.g., prediction mode, motion vector, block vector, etc.), partitioning information, and other syntax data. Includes. Different elements of the coded video bitstream may then be entropy encoded by the encoding device. In some examples, the encoding device may scan the quantized transform coefficients utilizing a predefined scan order to generate a serialized vector that can be entropy encoded. In some examples, the encoding device may perform adaptive scan. After scanning the quantized transform coefficients to form a vector (eg, a one-dimensional vector), the encoding device may entropy encode the vector. For example, the encoding device may use context adaptive variable length coding, context adaptive binary arithmetic coding, syntax-based context adaptive binary arithmetic coding, stochastic interval partitioning entropy coding, or other suitable entropy encoding techniques.

인코딩 디바이스는 인코딩된 비디오 비트스트림을 저장할 수 있고/있거나 인코딩된 비디오 비트스트림 데이터를 통신 링크를 통해 디코딩 디바이스를 포함할 수 있는 수신 디바이스로 전송할 수 있다. 디코딩 디바이스는, 인코딩된 비디오 데이터를 구성하는 하나 이상의 코딩된 비디오 시퀀스들의 엘리먼트들을 (예컨대, 엔트로피 디코더를 사용하여) 엔트로피 디코딩하고 추출함으로써 인코딩된 비디오 비트스트림 데이터를 디코딩할 수도 있다. 디코딩 디바이스는 그 후 인코딩된 비디오 비트스트림 데이터에 대해 리스케일링하고 역 변환을 수행할 수도 있다. 이어서, 잔차 데이터는 디코딩 디바이스의 예측 스테이지로 전달된다. 그 후, 디코딩 디바이스는 인트라-예측, 인터-예측, IBC, 및/또는 다른 타입의 예측을 사용하여 픽셀들의 블록 (예를 들어, PU) 을 예측한다. 일부 예들에 있어서, 예측은 역변환의 출력 (잔차 데이터) 에 부가된다. 디코딩 디바이스는 디코딩된 비디오를 비디오 목적지 디바이스에 출력할 수도 있으며, 비디오 목적지 디바이스는 디코딩된 비디오 데이터를 콘텐츠의 소비자에게 디스플레이하기 위한 디스플레이 또는 다른 출력 디바이스를 포함할 수도 있다.The encoding device can store the encoded video bitstream and/or transmit the encoded video bitstream data over a communication link to a receiving device, which can include a decoding device. A decoding device may decode encoded video bitstream data by entropy decoding (e.g., using an entropy decoder) and extracting elements of one or more coded video sequences that make up the encoded video data. The decoding device may then rescale and perform inverse transformation on the encoded video bitstream data. The residual data is then passed to the prediction stage of the decoding device. The decoding device then predicts a block of pixels (e.g., PU) using intra-prediction, inter-prediction, IBC, and/or other types of prediction. In some examples, prediction is added to the output of the inverse transform (residual data). The decoding device may output the decoded video to a video destination device, which may include a display or other output device for displaying the decoded video data to a consumer of the content.

다양한 비디오 코딩 표준들에 의해 정의된 비디오 코딩 시스템들 및 기법들 (예를 들어, 상기 설명된 HEVC 비디오 코딩 기법들) 은 원시 비디오 콘텐츠에서 정보의 많은 부분을 유지할 수 있을 수도 있고, 신호 프로세싱 및 정보 이론 개념들에 기초하여 선험적으로 정의될 수도 있다. 그러나, 일부 경우들에서, 기계 학습(ML) 기반 이미지 및/또는 비디오 시스템은 딥 러닝 기반 엔드-투-엔드 비디오 코딩(DLEC) 시스템과 같은 비-ML 기반 이미지 및 비디오 코딩 시스템들에 비해 이점들을 제공할 수 있다. 전술한 바와 같이, 많은 딥 러닝 기반 시스템들은 오토인코더 서브-네트워크(인코더 서브-네트워크) 및 엔트로피 코딩에 사용되는 양자화된 레이턴시들에 대해 확률 모델을 학습하는 것을 담당하는 제2 서브-네트워크의 조합으로서 설계된다. 이러한 아키텍처는 변환, 양자화 모듈(인코더 서브-네트워크) 및 엔트로피 모델링 서브-네트워크 모듈의 조합으로 간주될 수 있다.Video coding systems and techniques defined by various video coding standards (e.g., the HEVC video coding techniques described above) may be able to retain much of the information in the raw video content, and may require signal processing and information It may also be defined a priori based on theoretical concepts. However, in some cases, machine learning (ML)-based image and/or video systems offer advantages over non-ML-based image and video coding systems, such as deep learning-based end-to-end video coding (DLEC) systems. can be provided. As mentioned above, many deep learning based systems are a combination of an autoencoder sub-network and a second sub-network responsible for learning a probabilistic model for the quantized latencies used for entropy coding. It is designed. This architecture can be considered a combination of transformation, quantization modules (encoder sub-network) and entropy modeling sub-network modules.

도 4는 딥 러닝 기반 시스템(410)을 사용하여 비디오 인코딩 및 디코딩을 수행하도록 구성된 디바이스(402)를 포함하는 시스템(400)을 도시한다. 디바이스(402)는 카메라(407) 및 저장 매체(414)(예를 들어, 데이터 저장 디바이스)에 결합된다. 일부 구현예에서, 카메라(407)는 딥 러닝 기반 시스템(410)에 의한 인코딩을 위해 이미지 데이터(408)(예를 들어, 비디오 데이터 스트림)를 프로세서(404)에 제공하도록 구성된다. 일부 구현들에서, 디바이스(402)는 다수의 카메라들(예를 들어, 듀얼 카메라 시스템, 3개의 카메라들, 또는 다른 수의 카메라들)에 결합될 수 있고/있거나 이들을 포함할 수 있다. 일부 경우들에서, 디바이스(402)는 마이크로폰 및/또는 다른 입력 디바이스(예를 들어, 키보드, 마우스, 터치스크린 및/또는 터치패드와 같은 터치 입력 디바이스, 및/또는 다른 입력 디바이스)에 결합될 수 있다. 일부 예들에서, 카메라(407), 저장 매체(414), 마이크로폰, 및/또는 다른 입력 디바이스는 디바이스(402)의 일부일 수 있다.FIG. 4 illustrates a system 400 that includes a device 402 configured to perform video encoding and decoding using a deep learning based system 410 . Device 402 is coupled to camera 407 and storage medium 414 (e.g., a data storage device). In some implementations, camera 407 is configured to provide image data 408 (e.g., a video data stream) to processor 404 for encoding by deep learning based system 410. In some implementations, device 402 may be coupled to and/or include multiple cameras (e.g., a dual camera system, three cameras, or another number of cameras). In some cases, device 402 may be coupled to a microphone and/or other input device (e.g., a touch input device such as a keyboard, mouse, touchscreen, and/or touchpad, and/or other input device). there is. In some examples, a camera 407, storage medium 414, microphone, and/or other input device may be part of device 402.

디바이스 (402) 는 또한 하나 이상의 무선 네트워크들, 하나 이상의 유선 네트워크들, 또는 이들의 조합과 같은 송신 매체 (418) 를 통해 제 2 디바이스 (490)에 커플링된다. 예를 들어, 전송 매체(418)는 무선 네트워크, 유선 네트워크, 또는 유선 및 무선 네트워크의 조합에 의해 제공되는 채널을 포함할 수 있다. 전송 매체(418)는 패킷 기반 네트워크, 예컨대 로컬 영역 네트워크, 광역 네트워크, 또는 인터넷과 같은 글로벌 네트워크의 부분을 형성할 수도 있다. 전송 매체(418)는 라우터들, 스위치들, 기지국들, 또는 소스 디바이스로부터 수신 디바이스로의 통신을 용이하게 하는데 유용할 수도 있는 임의의 다른 장비를 포함할 수도 있다. 무선 네트워크가 임의의 무선 인터페이스 또는 무선 인터페이스들의 조합을 포함할 수도 있고, 임의의 적합한 무선 네트워크(예를 들어, 인터넷 또는 다른 광역 네트워크, 패킷 기반 네트워크, WiFiTM, RF (radio frequency), UWB, WiFi-Direct, 셀룰러, LTE(Long-Term Evolution), WiMaxTM 등)를 포함할 수도 있다. 유선 네트워크는 임의의 유선 인터페이스 (예 : 파이버, 이더넷, 전력선 이더넷, 동축 케이블을 통한 이더넷, DSL (Digital Signal Line) 등) 를 포함할 수도 있다. 유선 및/또는 무선 네트워크는 기지국, 라우터, 액세스 포인트, 브리지, 게이트웨이, 스위치 등과 같은 다양한 장비를 사용하여 구현될 수 있다. 인코딩된 비디오 데이터는 무선 통신 프로토콜과 같은 통신 표준에 따라 변조되고, 수신 디바이스로 송신될 수도 있다.Device 402 is also coupled to a second device 490 via a transmission medium 418, such as one or more wireless networks, one or more wired networks, or a combination thereof. For example, transmission medium 418 may include a channel provided by a wireless network, a wired network, or a combination of wired and wireless networks. Transmission medium 418 may form part of a packet-based network, such as a local area network, a wide area network, or a global network, such as the Internet. Transmission medium 418 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from a source device to a destination device. The wireless network may include any wireless interface or combination of wireless interfaces, and may include any suitable wireless network (e.g., the Internet or other wide area network, packet-based network, WiFiTM, radio frequency (RF), UWB, WiFi- Direct, cellular, Long-Term Evolution (LTE), WiMaxTM, etc.) A wired network may include any wired interface (e.g., fiber, Ethernet, powerline Ethernet, Ethernet over coaxial cable, Digital Signal Line (DSL), etc.). Wired and/or wireless networks can be implemented using a variety of equipment such as base stations, routers, access points, bridges, gateways, switches, etc. Encoded video data may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to a receiving device.

디바이스(402)는 메모리(406), 제1 인터페이스("I/F 1")(412) 및 제2 인터페이스("I/F 2")(416)에 결합된 하나 이상의 프로세서(본 명세서에서 "프로세서"로 지칭됨)(404)를 포함한다. 프로세서(404)는 카메라(407)로부터, 메모리(406)로부터, 및/또는 저장 매체(414)로부터 이미지 데이터(408)를 수신하도록 구성된다. 프로세서(404)는 제 1 인터페이스(412)를 통해(예를 들어, 메모리 버스를 통해) 저장 매체(414)에 커플링되고, 제 2 인터페이스(416)(예를 들어, 네트워크 인터페이스 디바이스, 무선 트랜시버 및 안테나, 하나 이상의 다른 네트워크 인터페이스 디바이스들, 또는 이들의 조합)를 통해 송신 매체(418)에 커플링된다.Device 402 includes memory 406, one or more processors (herein referred to as "I/F 1") coupled to a first interface ("I/F 1") 412 and a second interface ("I/F 2") 416. (referred to as “processor”) 404. Processor 404 is configured to receive image data 408 from camera 407, memory 406, and/or storage medium 414. Processor 404 is coupled to storage medium 414 via a first interface 412 (e.g., via a memory bus) and a second interface 416 (e.g., a network interface device, wireless transceiver). and an antenna, one or more other network interface devices, or a combination thereof) to a transmission medium 418.

프로세서(404)는 딥 러닝 기반 시스템(410)을 포함한다. 딥 러닝 기반 시스템(410)은 인코더 부분(462) 및 디코더 부분(466)을 포함한다. 일부 구현예에서, 딥 러닝 기반 시스템(410)은 하나 이상의 오토인코더를 포함할 수 있다. 인코더 부분(462)은 입력 데이터(470)를 수신하고 입력 데이터(470)를 처리하여 입력 데이터(470)에 적어도 부분적으로 기초하여 출력 데이터(474)를 생성하도록 구성된다. Processor 404 includes deep learning based system 410. Deep learning based system 410 includes an encoder portion 462 and a decoder portion 466. In some implementations, deep learning-based system 410 may include one or more autoencoders. Encoder portion 462 is configured to receive input data 470 and process input data 470 to generate output data 474 based at least in part on input data 470 .

일부 구현예에서, 딥 러닝 기반 시스템(410)의 인코더 부분(462)은 출력 데이터(474)를 생성하기 위해 입력 데이터(470)의 손실 압축을 수행하도록 구성되어, 출력 데이터(474)는 입력 데이터(470)보다 적은 비트를 갖는다. 인코더 부분(462)은 임의의 이전 표현들(예를 들어, 하나 이상의 이전에 재구성된 프레임들)에 기초하여 모션 보상을 사용하지 않고 입력 데이터(470)(예를 들어, 이미지들 또는 비디오 프레임들)를 압축하도록 트레이닝될 수 있다. 예를 들어, 인코더 부분(462)은 이전에 재구성된 프레임들의 임의의 데이터를 사용하지 않고 그 비디오 프레임으로부터만 비디오 데이터를 사용하여 비디오 프레임을 압축할 수 있다. 인코더 부분 (462)에 의해 프로세싱된 비디오 프레임들은 본 명세서에서 인트라-예측된 프레임 (I-프레임들) 으로서 지칭될 수 있다. 일부 예들에서, I-프레임들은 (예를 들어, HEVC, VVC, MPEG-4, 또는 다른 비디오 코딩 표준에 따라) 전통적인 비디오 코딩 기법들을 사용하여 생성될 수 있다. 이러한 예들에서, 프로세서 (404) 는 HEVC 표준에 대하여 위에서 설명된 것과 같은 블록-기반 인트라 예측을 수행하도록 구성된 비디오 코딩 디바이스 (예를 들어, 인코딩 디바이스) 를 포함하거나 또는 이와 커플링될 수도 있다. 이러한 예들에서, 딥 러닝 기반 시스템(410)은 프로세서(404)로부터 배제될 수 있다. In some implementations, the encoder portion 462 of the deep learning-based system 410 is configured to perform lossy compression of the input data 470 to generate output data 474, such that the output data 474 is It has fewer bits than 470. Encoder portion 462 encodes input data 470 (e.g., images or video frames) without using motion compensation based on any previous representations (e.g., one or more previously reconstructed frames). ) can be trained to compress. For example, encoder portion 462 may compress a video frame using video data only from that video frame and not using any data from previously reconstructed frames. Video frames processed by encoder portion 462 may be referred to herein as intra-predicted frames (I-frames). In some examples, I-frames may be generated using traditional video coding techniques (e.g., according to HEVC, VVC, MPEG-4, or another video coding standard). In these examples, processor 404 may include or be coupled to a video coding device (e.g., an encoding device) configured to perform block-based intra prediction such as described above for the HEVC standard. In these examples, deep learning based system 410 may be excluded from processor 404.

일부 구현에서, 딥 러닝 기반 시스템(410)의 인코더 부분(462)은 이전 표현(예를 들어, 하나 이상의 이전에 재구성된 프레임)에 기초한 모션 보상을 사용하여 입력 데이터(470)(예를 들어, 비디오 프레임)를 압축하도록 트레이닝될 수 있다. 예를 들어, 인코더 부분(462)은 그 비디오 프레임으로부터의 비디오 데이터를 사용하여 그리고 이전에 재구성된 프레임들의 데이터를 사용하여 비디오 프레임을 압축할 수 있다. 인코더 부분 (462)에 의해 프로세싱된 비디오 프레임들은 본 명세서에서 인트라-예측된 프레임 (P-프레임들) 으로서 지칭될 수 있다. 모션 보상은 이전에 재구성된 프레임으로부터의 픽셀들이 잔차 정보와 함께 현재 프레임에서의 새로운 위치들로 어떻게 이동하는지를 설명함으로써 현재 프레임의 데이터를 결정하는데 사용될 수 있다. In some implementations, encoder portion 462 of deep learning-based system 410 uses motion compensation based on a previous representation (e.g., one or more previously reconstructed frames) to encode input data 470 (e.g., can be trained to compress video frames). For example, encoder portion 462 may compress a video frame using video data from that video frame and using data from previously reconstructed frames. Video frames processed by encoder portion 462 may be referred to herein as intra-predicted frames (P-frames). Motion compensation can be used to determine data in the current frame by accounting for how pixels from a previously reconstructed frame move to new positions in the current frame along with residual information.

도시된 바와 같이, 딥 러닝 기반 시스템(410)의 인코더 부분(462)은 뉴럴 네트워크(463) 및 양자화기(464)를 포함할 수 있다. 뉴럴 네트워크(463)는 하나 이상의 컨볼루션 뉴럴 네트워크(CNN), 하나 이상의 완전-연결 뉴럴 네트워크, 하나 이상의 게이트 순환 유닛(GRU), 하나 이상의 장기 단기 메모리(LSTM) 네트워크, 하나 이상의 ConvRNN, 하나 이상의 ConvGRU, 하나 이상의 ConvLSTM, 하나 이상의 GAN, 이들의 임의의 조합, 및/또는 중간 데이터(472)를 생성하는 다른 유형의 뉴럴 네트워크 아키텍처를 포함할 수 있다. 중간 데이터(472)는 양자화기(464)에 입력된다. 인코더 부분 (462) 에 포함될 수도 있는 컴포넌트들의 예들이 도 6 - 도 10 에 예시되어 있다. As shown, the encoder portion 462 of the deep learning based system 410 may include a neural network 463 and a quantizer 464. Neural network 463 may include one or more convolutional neural networks (CNNs), one or more fully-connected neural networks, one or more gated recurrent units (GRUs), one or more long-term short-term memory (LSTM) networks, one or more ConvRNNs, and one or more ConvGRUs. , may include one or more ConvLSTMs, one or more GANs, any combination thereof, and/or other types of neural network architectures that generate intermediate data 472. Intermediate data 472 is input to quantizer 464. Examples of components that may be included in encoder portion 462 are illustrated in FIGS. 6-10.

양자화기(464)는 출력 데이터(474)를 생성하기 위해 중간 데이터(472)의 양자화 및 일부 경우들에서 엔트로피 코딩을 수행하도록 구성된다. 출력 데이터(474)는 양자화된(및 일부 경우에 엔트로피 코딩된) 데이터를 포함할 수 있다. 양자화기(464)에 의해 수행되는 양자화 동작들은 중간 데이터(472)로부터 양자화된 코드들(또는 딥 러닝 기반 시스템(410)에 의해 생성된 양자화된 코드들을 나타내는 데이터)의 생성을 야기할 수 있다. 양자화 코드들(또는 양자화된 코드들을 나타내는 데이터)은 또한 잠재 코드들(latent codes) 또는 잠재(latent)(z로 표시됨)로 지칭될 수 있다. 잠재성에 적용되는 엔트로피 모델은 본 명세서에서 "프라이어"로 지칭될 수 있다. 일부 예들에서, 양자화 및/또는 엔트로피 코딩 동작들은 기존의 비디오 코딩 표준들에 따라 비디오 데이터를 인코딩 및/또는 디코딩할 때 수행되는 기존의 양자화 및 엔트로피 코딩 동작들을 사용하여 수행될 수 있다. 일부 예들에서, 양자화 및/또는 엔트로피 코딩 동작들은 딥 러닝 기반 시스템(410)에 의해 수행될 수 있다. 하나의 예시적인 예에서, 딥 러닝 기반 시스템(410)은 지도 트레이닝(supervised training)을 사용하여 트레이닝될 수 있으며, 잔차 데이터는 입력으로서 사용되고, 양자화된 코드들 및 엔트로피 코드들은 트레이닝 동안 알려진 출력(라벨들)으로서 사용된다.Quantizer 464 is configured to perform quantization and, in some cases, entropy coding of intermediate data 472 to produce output data 474. Output data 474 may include quantized (and in some cases entropy coded) data. Quantization operations performed by quantizer 464 may result in the generation of quantized codes (or data representing quantized codes generated by deep learning based system 410) from intermediate data 472. Quantization codes (or data representing quantized codes) may also be referred to as latent codes or latent (denoted z). The entropy model applied to potential may be referred to herein as a “prior”. In some examples, quantization and/or entropy coding operations may be performed using existing quantization and entropy coding operations performed when encoding and/or decoding video data according to existing video coding standards. In some examples, quantization and/or entropy coding operations may be performed by deep learning based system 410. In one illustrative example, deep learning-based system 410 can be trained using supervised training, where residual data is used as input, and quantized codes and entropy codes are used to produce known outputs (labels) during training. s) are used.

딥 러닝 기반 시스템(410)의 디코더 부분(466)은 (예를 들어, 양자화기(464)로부터 및/또는 저장 매체(414)로부터 직접) 출력 데이터(474)를 수신하도록 구성된다. 디코더 부분(466)은 출력 데이터(474)를 처리하여 출력 데이터(474)에 적어도 부분적으로 기초하여 입력 데이터(470)의 표현(476)을 생성할 수 있다. 일부 예에서, 딥 러닝 기반 시스템(410)의 디코더 부분(466)은 하나 이상의 CNN, 하나 이상의 완전 연결 뉴럴 네트워크, 하나 이상의 GRU, 하나 이상의 LSTM(Long short-term memory) 네트워크, 하나 이상의 ConvRNN, 하나 이상의 ConvGRU, 하나 이상의 ConvLSTM, 하나 이상의 GAN, 이들의 임의의 조합, 및/또는 다른 유형의 뉴럴 네트워크 아키텍처를 포함할 수 있는 신경망(468)을 포함한다. 디코더 부분(466)에 포함될 수 있는 컴포넌트들의 예들이 도 6 - 도 10 에 예시된다. Decoder portion 466 of deep learning based system 410 is configured to receive output data 474 (e.g., directly from quantizer 464 and/or from storage medium 414). Decoder portion 466 may process output data 474 to generate a representation 476 of input data 470 based at least in part on output data 474 . In some examples, the decoder portion 466 of deep learning-based system 410 may include one or more CNNs, one or more fully connected neural networks, one or more GRUs, one or more long short-term memory (LSTM) networks, one or more ConvRNNs, one or more and a neural network 468, which may include one or more ConvGRUs, one or more ConvLSTMs, one or more GANs, any combination thereof, and/or other types of neural network architectures. Examples of components that may be included in decoder portion 466 are illustrated in FIGS. 6-10.

프로세서 (404) 는 출력 데이터 (474) 를 송신 매체 (418) 또는 저장 매체 (414) 중 적어도 하나에 전송하도록 구성된다. 예를 들어, 출력 데이터(474)는 재구성된 데이터로서 입력 데이터(470)의 표현(476)을 생성하기 위해 디코더 부분(466)에 의한 나중의 검색 및 디코딩(또는 압축해제)을 위해 저장 매체(414)에 저장될 수 있다. 재구성된 데이터는 출력 데이터(474)를 생성하기 위해 인코딩/압축된 비디오 데이터의 재생을 위한 것과 같은 다양한 목적을 위해 사용될 수 있다. 일부 구현들에서, 출력 데이터 (474) 는 재구성된 데이터로서 입력 데이터 (470) 의 표현 (476) 을 생성하기 위해 디코더 부분 (466)에 매칭하는 다른 디코더 디바이스에서 (예를 들어, 디바이스 (402)에서, 제 2 디바이스 (490)에서, 또는 다른 디바이스에서) 디코딩될 수도 있다. 예를 들어, 제 2 디바이스 (490) 는 디코더 부분 (466) 과 매칭 (또는 실질적으로 매칭) 하는 디코더를 포함할 수도 있고, 출력 데이터 (474) 는 송신 매체 (418) 를 통해 제 2 디바이스 (490) 로 송신될 수도 있다. 제 2 디바이스(490)는 입력 데이터(470)의 표현(476)을 재구성된 데이터로서 생성하기 위해 출력 데이터(474)를 처리할 수 있다.Processor 404 is configured to transmit output data 474 to at least one of transmission medium 418 or storage medium 414 . For example, output data 474 may be stored on a storage medium (or decompressed) for later retrieval and decoding (or decompression) by decoder portion 466 to produce a representation 476 of input data 470 as reconstructed data. 414). The reconstructed data can be used for various purposes, such as for playback of encoded/compressed video data to generate output data 474. In some implementations, output data 474 is reconstructed data in another decoder device (e.g., device 402) that matches decoder portion 466 to produce a representation 476 of input data 470. , at the second device 490, or at another device). For example, second device 490 may include a decoder that matches (or substantially matches) decoder portion 466 and output data 474 is transmitted to second device 490 over transmission medium 418. ) may also be transmitted. The second device 490 may process the output data 474 to generate a representation 476 of the input data 470 as reconstructed data.

시스템 (400) 의 컴포넌트들은, 하나 이상의 프로그래밍가능 전자 회로들 (예컨대, 마이크로프로세서들, 그래픽 프로세싱 유닛들 (GPU들), 디지털 신호 프로세서들 (DSP들), 중앙 프로세싱 유닛들 (CPU들), 및/또는 다른 적합한 전자 회로들) 을 포함할 수 있는 전자 회로들 또는 다른 전자적 하드웨어를 포함할 수 있고 및/또는 이들을 이용하여 구현될 수 있고, 및/또는 본원에 기술된 다양한 동작들을 수행하기 위해 컴퓨터 소프트웨어, 펌웨어, 또는 이들의 임의의 조합을 포함할 수 있고 및/또는 이들을 이용하여 구현될 수 있다.Components of system 400 include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and /or other suitable electronic circuits), and/or may be implemented using electronic circuits or other electronic hardware, and/or use a computer to perform the various operations described herein. It may include and/or be implemented using software, firmware, or any combination thereof.

시스템 (400) 이 특정 컴포넌트들을 포함하는 것으로 도시되지만, 당업자는 시스템 (400) 이 도 4 에 도시된 컴포넌트들보다 더 많거나 더 적은 컴포넌트들을 포함할 수 있다는 것을 이해할 것이다. 예를 들어, 시스템(400)은 또한 입력 디바이스 및 출력 디바이스(도시되지 않음)를 포함하거나 이를 포함하는 컴퓨팅 디바이스의 일부일 수 있다. 일부 구현들에서, 시스템(400)은 또한, 하나 이상의 메모리 디바이스들(예를 들어, 하나 이상의 랜덤 액세스 메모리(RAM) 컴포넌트들, 판독-전용 메모리(ROM) 컴포넌트들, 캐시 메모리 컴포넌트들, 버퍼 컴포넌트들, 데이터베이스 컴포넌트들, 및/또는 다른 메모리 디바이스들), 하나 이상의 메모리 디바이스들과 통신하고 그리고/또는 그에 전기적으로 접속되는 하나 이상의 프로세싱 디바이스들(예를 들어, 하나 이상의 CPU들, GPU들, 및/또는 다른 프로세싱 디바이스들), 무선 통신들을 수행하기 위한 하나 이상의 무선 인터페이스들(예를 들어, 각각의 무선 인터페이스에 대한 기저대역 프로세서 및 하나 이상의 트랜시버들을 포함함), 하나 이상의 하드와이어드 연결들을 통해 통신들을 수행하기 위한 하나 이상의 유선 인터페이스들(예를 들어, 직렬 인터페이스, 이를테면 범용 직렬 버스(USB) 입력, 라이트닝 커넥터, 및/또는 다른 유선 인터페이스), 및/또는 도 4 에 도시되지 않은 다른 컴포넌트들을 포함하는 컴퓨팅 디바이스를 포함할 수 있거나, 또는 그 일부일 수 있다.Although system 400 is shown as including specific components, those skilled in the art will understand that system 400 may include more or fewer components than those shown in FIG. 4 . For example, system 400 may also include or be part of a computing device that includes input devices and output devices (not shown). In some implementations, system 400 also includes one or more memory devices (e.g., one or more random access memory (RAM) components, read-only memory (ROM) components, cache memory components, buffer components) , database components, and/or other memory devices), one or more processing devices (e.g., one or more CPUs, GPUs, and /or other processing devices), one or more wireless interfaces for performing wireless communications (e.g., including a baseband processor and one or more transceivers for each wireless interface), communicating via one or more hardwired connections one or more wired interfaces (e.g., a serial interface, such as a Universal Serial Bus (USB) input, Lightning connector, and/or other wired interface), and/or other components not shown in Figure 4. It may include, or may be part of, a computing device.

일부 구현들에서, 시스템(400)은 컴퓨팅 디바이스에 의해 로컬로 구현되고 그리고/또는 컴퓨팅 디바이스에 포함될 수 있다. 예를 들어, 컴퓨팅 디바이스는 모바일 디바이스, 개인용 컴퓨터, 태블릿 컴퓨터, 가상 현실(VR) 디바이스(예를 들어, 헤드 마운트 디스플레이(HMD) 또는 다른 VR 디바이스), 증강 현실(AR) 디바이스(예를 들어, HMD, AR 안경, 또는 다른 AR 디바이스), 웨어러블 디바이스, 서버(예를 들어, SaaS(software as a service) 시스템 또는 다른 서버 기반 시스템), 텔레비전, 및/또는 본 명세서에 설명된 기술들을 수행하기 위한 자원 능력들을 갖는 임의의 다른 컴퓨팅 디바이스를 포함할 수 있다.In some implementations, system 400 may be implemented locally by and/or included in a computing device. For example, computing devices include mobile devices, personal computers, tablet computers, virtual reality (VR) devices (e.g., head-mounted displays (HMDs) or other VR devices), augmented reality (AR) devices (e.g., HMD, AR glasses, or other AR devices), wearable devices, servers (e.g., software as a service (SaaS) systems or other server-based systems), televisions, and/or devices for performing the techniques described herein. Can include any other computing device with resource capabilities.

일 예에서, 딥 러닝 기반 시스템(410)은 프로세서(404)에 결합되고 프로세서(404)에 의해 실행 가능한 명령어들을 저장하도록 구성된 메모리(406), 및 안테나 및 프로세서(404)에 결합되고 출력 데이터(474)를 원격 디바이스에 송신하도록 동작 가능한 무선 트랜시버를 포함하는 휴대용 전자 디바이스에 통합될 수 있다. In one example, the deep learning based system 410 includes a memory 406 coupled to the processor 404 and configured to store instructions executable by the processor 404, and an antenna coupled to the processor 404 and output data ( 474) may be incorporated into a portable electronic device including a wireless transceiver operable to transmit to a remote device.

위에서 언급된 바와 같이, 딥 러닝 기반 시스템들은 통상적으로 RGB 또는 YUV 4:4:4과 같은 비-서브샘플링된 입력 포맷들을 프로세싱하도록 설계된다. RGB 입력을 타겟으로 하는 이미지 및 비디오 코딩 방식들의 예들은 J. Balle, D. Minnen, S. Singh, S. J. Hwang, N. Johnston, "Variational image compression with a scale hyperprior", ICLR, 2018("J. Balle Paper"로 지칭됨) 및 D. Minnen, J. Balle, G. Toderici, "Joint Autoregressive and Hierarchical Priors for Learned Image Compression", CVPR 2018("D. Minnen Paper"로 지칭됨)에 기술되어 있으며, 이는 그 전체로서 그리고 모든 목적을 위해 본원에 참고로 포함된다. As mentioned above, deep learning based systems are typically designed to process non-subsampled input formats such as RGB or YUV 4:4:4. For examples of image and video coding methods targeting RGB input, see J. Balle, D. Minnen, S. Singh, S. J. Hwang, N. Johnston, "Variational image compression with a scale hyperprior", ICLR, 2018 ("J. Balle Paper") and D. Minnen, J. Balle, G. Toderici, "Joint Autoregressive and Hierarchical Priors for Learned Image Compression", CVPR 2018 (referred to as the "D. Minnen Paper") It is incorporated herein by reference in its entirety and for all purposes.

도 5는 딥러닝 기반 시스템(500)의 일 예를 나타낸 도면이다. 도 5의 딥 러닝 기반 시스템에서의 g_a 및 g_s 서브 네트워크들은 각각 인코더 서브 네트워크(예를 들어, 인코더 부분(462)) 및 디코더 서브 네트워크(예를 들어, 디코더 부분(466))에 대응한다. 도 5 의 g_a 및 g_s 서브-네트워크들은 3-채널 RGB 입력을 위해 설계되고, 여기서 3개의 R, G, 및 B 입력 채널들 모두가 통과하고 동일한 뉴럴 네트워크 계층들(컨볼루션 계층들 및 일반화된 분할 정규화(GDN) 계층들)에 의해 프로세싱된다. 뉴럴 네트워크 계층들은 컨볼루션 연산들을 수행하는 컨볼루션 계층들(컨볼루션 계층(510)을 포함함) 및 로컬 분할 정규화를 구현하는 GDN 및/또는 IGDN(inverse-GDN) 비선형 계층들을 포함할 수 있다. 로컬 분할 정규화는 이미지의 밀도 모델링 및 압축에 특히 적합한 것으로 나타난 변환 유형이다. 딥 러닝 기반 시스템(도 5에 도시된 것과 같음)은 RGB 데이터(상이한 R, G 및 B 채널의 통계적 속성이 유사함)와 같은 유사한 통계적 특성을 갖는 입력 채널을 대상으로 한다.Figure 5 is a diagram showing an example of a deep learning-based system 500. The g _a and g _s subnetworks in the deep learning-based system of Figure 5 correspond to an encoder subnetwork (e.g., encoder portion 462) and a decoder subnetwork (e.g., decoder portion 466), respectively. . The g _a and g _s sub-networks in Figure 5 are designed for a three-channel RGB input, where all three R, G, and B input channels pass through the same neural network layers (convolutional layers and generalization processed by segmented normalization (GDN) layers). Neural network layers may include convolutional layers (including convolutional layer 510) that perform convolutional operations and GDN and/or inverse-GDN (IGDN) nonlinear layers that implement local division normalization. Local partition normalization is a type of transformation that has been shown to be particularly suitable for density modeling and compression of images. Deep learning based systems (such as the one shown in Figure 5) target input channels with similar statistical properties, such as RGB data (statistical properties of different R, G and B channels are similar).

많은 딥 러닝-기반 시스템들이 RGB 입력을 프로세싱하도록 설계되지만, 대부분의 이미지 및 비디오 코딩 시스템들은 YUV 입력 포맷들(예를 들어, 많은 경우들에서 YUV 4:2:0 입력 포맷)을 사용한다. YUV 포맷의 데이터의 색차(U 및 V) 채널들은 휘도(Y) 채널에 대해 서브샘플링될 수 있다. 서브샘플링은 시각적 품질에 대한 최소한의 영향을 초래한다(예를 들어, 시각적 품질에 대한 유의하거나 눈에 띄는 영향이 없다). 서브샘플링된 포맷들은 YUV 4:2:0 포맷, YUV 4:2:2 포맷, 및/또는 다른 YUV 포맷들을 포함한다. 채널들에 걸친 상관은 YUV 포맷에서 감소되며, 이는 다른 컬러 포맷들(예를 들어, RGB 포맷)의 경우가 아닐 수 있다. 또한, 휘도(Y) 및 색차(U 및 V) 채널들의 통계치들은 상이하다. 예를 들어, U 및 V 채널들은 휘도 채널에 비해 더 작은 분산을 갖는 반면, 예를 들어, RGB 포맷들에서, 상이한 R, G, 및 B 채널들의 통계적 속성들은 더 유사하다. 비디오 코더들-디코더들 (또는 코덱들) 은 데이터의 입력 특성들에 따라 설계된다 (예를 들어, 코덱은 데이터의 입력 포맷에 따라 데이터를 인코딩 및/또는 디코딩할 수 있다). 예를 들어, 프레임의 색차 채널들이 서브샘플링되면(예를 들어, 색차 채널들이 휘도 채널과 비교하여 해상도의 절반임), CODEC이 모션 보상을 위해 프레임의 블록을 예측할 때, 휘도 블록은 색차 블록들과 비교하여 폭 및 높이 둘 모두에 대해 2배만큼 클 것이다. 다른 예에서, CODEC 은, 다른 것들 중에서도, 얼마나 많은 픽셀들이 크로미넌스 및 휘도에 대해 인코딩되거나 디코딩될 것인지를 결정할 수 있다. Although many deep learning-based systems are designed to process RGB input, most image and video coding systems use YUV input formats (e.g., YUV 4:2:0 input format in many cases). The chrominance (U and V) channels of data in YUV format may be subsampled with respect to the luminance (Y) channel. Subsampling results in minimal effect on visual quality (e.g., no significant or noticeable effect on visual quality). Subsampled formats include YUV 4:2:0 format, YUV 4:2:2 format, and/or other YUV formats. Correlation across channels is reduced in YUV format, which may not be the case for other color formats (eg, RGB format). Additionally, the statistics of the luminance (Y) and chrominance (U and V) channels are different. For example, the U and V channels have smaller variance compared to the luminance channel, whereas in RGB formats, for example, the statistical properties of the different R, G, and B channels are more similar. Video coders-decoders (or codecs) are designed according to the input characteristics of the data (eg, a codec can encode and/or decode data according to the input format of the data). For example, if the chrominance channels of a frame are subsampled (e.g., the chrominance channels are half the resolution compared to the luminance channels), then when CODEC predicts blocks of the frame for motion compensation, the luminance blocks are the chrominance blocks. It will be twice as large in both width and height compared to . In another example, CODEC may determine how many pixels to be encoded or decoded for chrominance and luminance, among other things.

YUV 포맷들(예를 들어, YUV 4:2:0 포맷)을 지원하기 위해, 딥 러닝 기반 아키텍처들이 수정되어야 한다. 예를 들어, RGB 입력 데이터(위에서 언급한 바와 같이, 대부분의 딥 러닝 기반 시스템이 처리하도록 설계됨)가 YUV 4:4:4 입력 데이터(모든 채널이 동일한 차원을 가짐)로 대체되면, 입력 데이터를 처리하는 딥 러닝 기반 시스템의 성능은 휘도(Y) 및 색차(U 및 V) 채널의 상이한 통계적 특성으로 인해 감소된다. 전술한 바와 같이, 색차(U 및 V) 채널들은 YUV 4:2:0의 경우와 같은 일부 YUV 포맷들에서 서브샘플링된다. 예를 들어, YUV 4:2:0 포맷을 갖는 콘텐츠에 대해, U 및 V 채널 해상도는 Y 채널 해상도의 절반이다(U 및 V 채널들은 폭 및 높이가 절반으로 되기 때문에 Y 채널의 4분의 1인 크기를 갖는다). 이러한 서브샘플링은 입력 데이터가 딥 러닝 기반 시스템의 입력과 호환되지 않게 할 수 있다. 입력 데이터는 딥 러닝 기반 시스템이 인코딩 및/또는 디코딩하려고 시도하고 있는 정보(예를 들어, 휘도(Y) 및 색차(U 및 V) 채널들을 포함하는 3개의 채널들을 포함하는 YUV 프레임)이다. To support YUV formats (e.g., YUV 4:2:0 format), deep learning-based architectures must be modified. For example, if RGB input data (which, as mentioned above, most deep learning-based systems are designed to process) is replaced with YUV 4:4:4 input data (all channels have the same dimension), the input data can be The performance of deep learning-based systems processing is reduced due to the different statistical properties of the luminance (Y) and chrominance (U and V) channels. As mentioned above, chrominance (U and V) channels are subsampled in some YUV formats, such as in the case of YUV 4:2:0. For example, for content with the YUV 4:2:0 format, the U and V channel resolution is half the Y channel resolution (one quarter of the Y channel because the U and V channels are half the width and height). has a size that is). This subsampling may make the input data incompatible with the input of deep learning-based systems. The input data is the information that the deep learning-based system is attempting to encode and/or decode (e.g., a YUV frame containing three channels including luminance (Y) and chrominance (U and V) channels).

일부 엔드-투-엔드 비디오 코딩 딥 러닝 기반 시스템들에서, 오토인코더들은 오리지널 프레임들에 대한 인트라 프레임, 모션 벡터들(예를 들어, 조밀한 광학 흐름), 및 모션 보상된 프레임들의 잔차를 코딩하기 위해 사용된다. 일 예에서, 플로우 오토인코더는 광학 플로우 뿐만 아니라 스케일-공간을 코딩하기 위해 공동으로 학습하는데 사용될 수 있고, 레지듀얼 오토인코더는 와핑된 예측 프레임과 원본 프레임 사이의 레지듀얼을 모두 RGB 도메인에서 코딩한다. In some end-to-end video coding deep learning-based systems, autoencoders code intra-frames relative to the original frames, motion vectors (e.g., dense optical flow), and residuals of motion-compensated frames. It is used for. In one example, a flow autoencoder can be used to jointly learn to code the scale-space as well as the optical flow, and a residual autoencoder codes the residual between the warped prediction frame and the original frame both in the RGB domain. .

전술한 바와 같이, 하나 이상의 YUV 포맷(예를 들어, YUV 4:2:0 포맷)을 효율적으로 지원하는 ML 기반 시스템(예를 들어, 하나 이상의 딥 러닝 기반 아키텍처를 포함함)을 제공하는 시스템 및 기술이 본 명세서에 설명된다. 딥 러닝 기반 아키텍처(들)는 독립형 프레임들(또는 이미지들) 및/또는 다수의 프레임들을 포함하는 비디오 데이터를 인코딩 및/또는 디코딩할 수 있다. 예를 들어, ML-기반 시스템은 입력으로서, ML-기반 시스템의 이전 인스턴스에 의해 재구성될 수 있는, 현재 프레임의 루마 성분 및 이전에-재구성된 프레임의 루마 성분을 획득할 수 있다. ML-기반 시스템은 현재 프레임의 루마 성분에 대한 모션 정보(예를 들어, 광학 흐름 정보와 같은 흐름 정보)를 추정하기 위해 현재 및 이전 프레임들의 루마 성분들을 프로세싱할 수 있다. 현재 프레임의 루마 성분을 사용하여, ML-기반 시스템은 현재 프레임의 하나 이상의 크로마 성분들에 대한 모션 추정(예를 들어, 광학 흐름 정보와 같은 흐름 정보)을 결정(예를 들어, 추정)할 수 있다. 이러한 기술은 프레임의 모든 구성요소에 대해 수행될 수 있다. 더 자세한 내용은 후술한다. As described above, a system providing an ML-based system (e.g., comprising one or more deep learning-based architectures) that efficiently supports one or more YUV formats (e.g., YUV 4:2:0 format), and The techniques are described herein. Deep learning-based architecture(s) may encode and/or decode video data containing stand-alone frames (or images) and/or multiple frames. For example, an ML-based system may obtain as input the luma component of the current frame and the luma component of a previously-reconstructed frame, which can be reconstructed by a previous instance of the ML-based system. An ML-based system may process luma components of current and previous frames to estimate motion information (e.g., flow information, such as optical flow information) for the luma component of the current frame. Using the luma component of the current frame, an ML-based system can determine (e.g., estimate) a motion estimate (e.g., flow information such as optical flow information) for one or more chroma components of the current frame. there is. This technique can be performed on all components of the frame. More details will be provided later.

도 6은 비디오 코딩을 수행하도록 구성된 딥 러닝 기반 시스템(600)의 뉴럴 네트워크 아키텍처의 예를 예시하는 도면이다. 도 6의 뉴럴 네트워크 아키텍처는 인트라 예측 엔진(602) 및 인터 예측 엔진(610)을 포함한다. 인트라 예측 엔진(602) 및 인터 예측 엔진(610)은 도 6에 도시된 바와 같이 오토인코더들(예를 들어, 가변 오토인코더들(VAE))을 포함할 수 있지만, 다른 구현들에서 다른 타입들의 뉴럴 네트워크 아키텍처들을 포함할 수 있다. 도시된 바와 같이, 인트라 예측 엔진(602)은 입력 프레임(604)의 잠재 표현(로서 도시됨)을 생성하기 위해 입력 프레임(604)의 픽셀 정보를 프로세싱한다. 입력 프레임(604)은 입력 프레임(604)의 각각의 픽셀에 대해 루마 컴포넌트(로서 도시됨) 및 2개의 크로마 컴포넌트들( 및 로서 도시됨)을 포함한다. 잠재 표현은 또한 입력 프레임 (604) 의 코딩된 버전인 다수의 비트들을 포함하는 비트스트림으로 지칭될 수 있다. 잠재 표현 (또는 다른 디바이스로부터 수신된 잠재 표현/비트스트림)에 기초하여, 인트라-예측 엔진 (602) 의 디코더 서브-네트워크는 입력 프레임 (604) 의 재구성된 버전인 재구성된 프레임 (606) (, , 로서 도시됨, 여기서 컴포넌트들에 대한 "햇(hat)"는 재구성된 값들을 표시함) 을 생성할 수 있다.FIG. 6 is a diagram illustrating an example of a neural network architecture of a deep learning-based system 600 configured to perform video coding. The neural network architecture of Figure 6 includes an intra prediction engine 602 and an inter prediction engine 610. Intra prediction engine 602 and inter prediction engine 610 may include autoencoders (e.g., variable autoencoders (VAE)) as shown in FIG. 6, but may include other types of autoencoders in other implementations. May include neural network architectures. As shown, the intra prediction engine 602 generates a latent representation of the input frame 604 ( (shown as) processes the pixel information of the input frame 604. The input frame 604 includes a luma component ( shown as ) and two chroma components ( and shown as ). A latent representation may also be referred to as a bitstream containing a number of bits that are a coded version of the input frame 604. latent expression Based on the latent representation/bitstream received from another device, the decoder sub-network of intra-prediction engine 602 generates a reconstructed frame 606 (or a reconstructed version of the input frame 604). , , shown as , where “hats” for components indicate reconstructed values).

인터-예측 엔진(610)은 플로우 엔진(618), 레지듀얼 엔진(620), 및 워핑 엔진(622)을 포함한다. 도시된 바와 같이, 플로우 엔진(618)은 입력으로서 (시간 t에서) 현재 프레임(614)의 루마 성분()과 (이전 시간 t-1에서) 이전 프레임(615)의 재구성된 루마 성분()을 획득한다. 루마 성분() 및 루마 성분()을 사용하여, 플로우 엔진(618)은 현재 프레임(614)에 대한 루마 성분()에 대한 모션 정보(예를 들어, 플로우 정보)의 잠재적 표현()을 생성한다. 모션 정보는 (시간 t-1에서) 이전 프레임 (615)에 대한 (시간 t에서의) 현재 프레임 (614) 의 픽셀들의 이동을 표시하는 광학적 흐름 정보 (예를 들어, 복수의 모션 또는 변위 벡터들 및 일부 경우들에서는 픽셀 또는 샘플 당 스케일 성분) 를 포함할 수도 있다. 잠재적 표현()은 또한 비트스트림으로 지칭될 수 있고, 현재 프레임 (614)에 대한 루마 성분()의 코딩된 버전을 표현하는 다수의 비트들을 포함할 수 있다. 플로우 엔진 (618) 이 크로마 성분들이 아니라 현재 프레임 (614)에 대한 루마 성분()을 프로세싱하기 때문에, 잠재적 표현()(비트스트림) 은 모션 정보를 결정하기 위해 현재 프레임 (614) 의 모든 컴포넌트들을 사용하는 것과 비교하여 사이즈가 감소된다.The inter-prediction engine 610 includes a flow engine 618, a residual engine 620, and a warping engine 622. As shown, the flow engine 618 takes as input the luma component (at time t) of the current frame 614: ) and the reconstructed luma component (at previous time t-1) of the previous frame 615 ) to obtain. Luma component ( ) and luma component ( ), the flow engine 618 generates the luma component ( Potential representation of motion information (e.g., flow information) for ) ( ) is created. Motion information may be optical flow information (e.g., a plurality of motion or displacement vectors) indicating movement of pixels of the current frame 614 (at time t) relative to the previous frame 615 (at time t-1). and in some cases a per-pixel or per-sample scale component). Potential expression ( ) can also be referred to as a bitstream, and the luma component for the current frame 614 ( ) may contain a number of bits representing a coded version of The flow engine (618) uses luma components (not chroma components) for the current frame (614). ), the potential expression ( )(bitstream) is reduced in size compared to using all components of the current frame 614 to determine motion information.

루마 성분()의 잠재적 표현()(또는 프레임의 컴포넌트를 표현하는 다른 디바이스로부터 수신된 비트스트림)을 사용하여, 플로우 엔진(618)은 현재 프레임(614)의 루마 성분()에 대한 모션 정보(f^L 로 표시됨)를 결정하고, 또한 현재 프레임(614)의 크로마 성분들(, )에 대한 모션 정보(f^C 로 표시됨)를 결정한다. 루마 성분에 대한 결정된 모션 정보에 기초하여 크로마 성분에 대한 모션 정보를 결정 또는 추정하는 세부사항들은 이하에서 도 7a 및 7b 를 참조하여 설명된다.Luma component ( )'s potential expression ( ) (or a bitstream received from another device representing a component of the frame), the flow engine 618 generates the luma component ( ), and also determine the motion information (denoted as f ^L ) for the chroma components of the current frame 614 ( , ) determine the motion information (denoted as f ^C ). Details of determining or estimating motion information for the chroma component based on the determined motion information for the luma component are described below with reference to FIGS. 7A and 7B.

워핑 엔진 (622) 은 (시간 t에서) 현재 프레임 (614) 의 루마 성분() 및 크로마 성분(, )에 대해 결정된 모션 정보 (f^L 및f^C)를 사용하여 워핑을 수행하도록 구성된다. 예를 들어, 워핑 엔진 (622) 은 현재 프레임 (614)에 대한 루마 성분() 및 크로마 성분(, )의 모션 정보 (f^L 및 f^C)에 의해 표시된 양만큼 (시간 t에서) 현재 프레임 (614) 의 픽셀들을 워핑할 수 있다. 일부 양태들에서, 워핑 엔진(622)은 공간-스케일 흐름(SSF) 워핑을 수행할 수 있다. 예를 들어, SSF 워핑은 학습된 스케일-흐름 벡터들로부터 프레임간 예측들을 생성하기 위해 삼선형 보간을 적용할 수 있으며, 여기서 예측자들은 다음과 같이 공식화될 수 있다:The warping engine 622 determines the luma component (at time t) of the current frame 614. ) and chroma components ( , ) is configured to perform warping using the motion information (f ^L and f ^C ) determined for. For example, the warping engine 622 may determine the luma component ( ) and chroma components ( , ) can warp the pixels of the current frame 614 (at time t) by the amount indicated by the motion information (f ^L and f ^C ). In some aspects, warping engine 622 may perform spatial-scale flow (SSF) warping. For example, SSF warping can apply trilinear interpolation to generate inter-frame predictions from learned scale-flow vectors, where the predictors can be formulated as:

, 및 , and

따라서:thus:

식 (1)Equation (1)

위의 삼선형 보간은 루마 성분() 및 크로마 성분()의 모션 정보 (f^L 및 f^C)에 기초하여 결정된 하나 이상의 워핑 파라미터들에 기초하여 (예를 들어, 각각의 루마 성분에 대해 그리고 각각의 개별 U 및 V 크로마 성분에 대해 ) 성분 단위로 수행될 수 있다. 예를 들어, 워핑 파라미터들은 모션 또는 변위 벡터의 (x-방향으로의) 수평 성분을 표현하는 , 모션 또는 변위 벡터의 (y-방향으로의) 수직 성분을 표현하는 , 공간 모션/변위 정보 (vx 및 vy) 와 결합되는 재구성된 프레임들의 점진적으로 평활화된 버전을 표현하는 (스케일 필드로 지칭되는) 를 포함할 수도 있다.The above trilinear interpolation uses the luma component ( ) and chroma components ( ) performed on a component-by-component basis (e.g., for each luma component and for each individual U and V chroma component) based on one or more warping parameters determined based on the motion information (f ^L and f ^C ) of It can be. For example, warping parameters represent the horizontal component (in the x-direction) of a motion or displacement vector. , representing the vertical component (in the y-direction) of a motion or displacement vector. , (referred to as the scale field) representing a progressively smoothed version of the reconstructed frames combined with spatial motion/displacement information (vx and vy) It may also include .

(워핑 엔진 (622)에 의해 워핑이 수행된 후) 워핑 엔진 (622) 으로부터의 출력은 , , 로서 도 6에 나타낸 예측들을 포함하고, 여기서 는 루마 성분 ()에 대한 예측에 대응하고, 는 크로마 성분()에 대한 예측에 대응하고, 는 현재 프레임 (614) 의 크로마 성분( )에 대한 예측에 대응한다.The output from the warping engine 622 (after warping is performed by the warping engine 622) is , , Includes the predictions shown in Figure 6 as , where is the luma component ( ) corresponds to the prediction for, is the chroma component ( ) corresponds to the prediction for, is the chroma component of the current frame (614) ( ) corresponds to the prediction for.

그 다음, 딥 러닝 기반 시스템(600)은 루마 성분에 대한 레지듀얼 신호( ), 크로마 성분()에 대한 레지듀얼 신호(), 및 크로마 성분()에 대한 레지듀얼 신호()를 포함하는 레지듀얼 신호들을 획득하기 위해, 현재 프레임(614)의 대응하는 루마 성분() 및 크로마 성분(, )으로부터 예측들(, , )을 감산할 수 있다. 레지듀얼 엔진(620)은 레지듀얼에 대한 잠재적 표현()을 생성할 수 있다. 레지듀얼의 잠재적 표현() (또는 다른 디바이스로부터 수신된 레지듀얼의 잠재적 표현)을 사용하여, 레지듀얼 엔진(620)은 루마 성분에 대한 재구성된 레지듀얼 신호(), 크로마 성분()에 대한 재구성된 레지듀얼 신호(), 및 크로마 성분()에 대한 재구성된 레지듀얼 신호( )를 포함하는 현재 프레임에 대한 재구성된 레지듀얼을 생성할 수 있다. 딥 러닝 기반 시스템(600)은 재구성된 프레임(616)을 생성하기 위해, 재구성된 레지듀얼(, , )에 예측들(, , )을 추가할 수 있다.Next, the deep learning-based system 600 generates a residual signal for the luma component ( ), chroma component ( ) for the residual signal ( ), and chroma component ( ) for the residual signal ( ), the corresponding luma component of the current frame 614 ( ) and chroma components ( , ) predictions from ( , , ) can be subtracted. The residual engine 620 generates a potential expression for the residual ( ) can be created. Potential expression of residual ( ) (or a potential representation of the residual received from another device), the residual engine 620 generates the reconstructed residual signal for the luma component ( ), chroma component ( ) for the reconstructed residual signal ( ), and chroma component ( ) for the reconstructed residual signal ( ) can be generated for the current frame including the reconstructed residual. The deep learning-based system 600 uses the reconstructed residual ( , , ) to predictions ( , , ) can be added.

도 7a 는 루마 컴포넌트들 (722) 로서 집합적으로 도시된, (시간 t에서의) 현재 프레임의 루마 컴포넌트 () 및 (시간 t-1에서의) 이전 프레임의 재구성된 루마 컴포넌트()로 동작하는 플로우 엔진 (718) 의 일 예를 예시하는 다이어그램이다. 전술한 바와 같이, 일부 경우에 플로우 엔진(718)은 오토인코더(VAE_flow)로서 구현될 수 있다. 일부 경우들에서, 결합된 딥 러닝-기반 아키텍처는, 도 7a에 도시된 바와 같이, 플로우 엔진(718)이 루마 모션 정보(예를 들어, SSF f^L) 및 크로마 모션 정보(예를 들어, SSF f^C)를 추정하기 위해 현재 프레임 () 및 이전에 재구성된 프레임 () 둘 모두의 루마 성분을 사용하도록 설계될 수 있다. 예를 들어, 본원에 설명된 바와 같이, 크로마 모션 정보 (예를 들어, f^C) 는 루마 모션 정보 (예를 들어, f^L)에 기초하여 도출될 수 있다.7A shows the luma component of the current frame (at time t), collectively shown as luma components 722. ) and the reconstructed luma component of the previous frame (at time t-1) ( ) is a diagram illustrating an example of the flow engine 718 operating as ). As described above, in some cases the flow engine 718 may be implemented as an autoencoder (VAE _flow) . In some cases, the combined deep learning-based architecture, as shown in Figure 7A, allows the flow engine 718 to generate luma motion information (e.g., SSF f ^L ) and chroma motion information (e.g., SSF To estimate f ^C ), the current frame ( ) and previously reconstructed frames ( ) can be designed to use both luma components. For example, as described herein, chroma motion information (eg, f ^C ) can be derived based on luma motion information (eg, f ^L ).

도 7a에 도시된 바와 같이, 현재 프레임의 루마 컴포넌트()에 대한 모션 정보 (f^L)를 결정하기 위해, 현재 프레임의 루마 컴포넌트 () 및 이전 프레임의 재구성된 루마 컴포넌트()는 여러 컨볼루션 계층들 및 활성화 계층들에 의해 프로세싱된다 (집합적으로 순방향 패스 (723) 로서 도시됨). 도 7a의 "↓2" 및 "↑2" 표기는 스트라이드(stride) 값들을 지칭하고, 여기서 ↓2는 (↓"에 의해 표시된 바와 같은 다운샘플링에 대한) 2의 스트라이드를 지칭하고, ↑2는 또한 (↑"에 의해 표시된 바와 같은 업샘플링에 대한) 2의 스트라이드를 지칭한다. 예를 들어, 컨볼루션 계층 (724) 은 2의 스트라이드 값만큼 수평 및 수직 차원들에서 5x5 컨볼루션 필터를 적용함으로써 4 의 팩터만큼 입력 루마 성분들( 및 )을 다운샘플링한다. 컨볼루션 계층 (724) 의 결과적인 출력은 현재 프레임의 루마 성분()에 대한 루마 모션 정보 (f^L) 를 나타내는 피처 값들의 N 개의 어레이들 (N 개의 채널들에 대응함) 이다. 표기 "2/N"은 2개의 입력 채널 및 N개의 출력 채널을 나타낸다. 컨볼루션 계층(724)을 뒤따르는 비선형 계층은 컨볼루션 계층(724)에 의해 출력된 특징 값들을 프로세싱할 수 있다. 연속적인 컨볼루션 계층들 및 비선형 계층들 각각은 순방향 패스(723)의 최종 컨볼루션 계층(725)이 플로우 엔진(718)의 병목 부분(726)에 피처들을 출력할 때까지 이전 계층에 의해 출력된 피처들을 처리할 수 있다.As shown in Figure 7a, the luma component of the current frame ( ), the ^luma component of the current frame ( ) and the reconstructed luma component of the previous frame ( ) are processed by several convolutional layers and activation layers (collectively shown as forward pass 723). The notations “↓2” and “↑2” in Figure 7A refer to stride values, where ↓2 refers to a stride of 2 (for downsampling as indicated by ↓”), and ↑2 It also refers to a stride of 2 (for upsampling as indicated by ↑"). For example, convolutional layer 724 may multiply the input luma components by a factor of 4 by applying a 5x5 convolutional filter in the horizontal and vertical dimensions by a stride value of 2. and ) downsample. The resulting output of the convolutional layer 724 is the luma component of the current frame ( ) are N arrays of feature values (corresponding to N channels) representing luma motion information (f ^L ) for ). The notation “2/N” indicates two input channels and N output channels. A non-linear layer following the convolutional layer 724 may process the feature values output by the convolutional layer 724. Each of the successive convolutional layers and nonlinear layers outputs the features output by the previous layer until the final convolutional layer 725 of the forward pass 723 outputs features to the bottleneck portion 726 of the flow engine 718. Features can be processed.

순방향 패스(723)의 출력은 플로우 엔진(718)의 병목 부분(726)에 의해 처리되어 현재 프레임의 루마 성분()에 대한 루마 모션 정보(f^L)를 나타내는 비트스트림 또는 레이턴트를 생성한다. 병목 부분(726)은 순방향 패스(723)에서의 양자화 엔진 및 엔트로피 인코딩 엔진, 및 플로우 엔진(718)의 역방향 패스(728)에서의 엔트로피 디코딩 엔진 및 역양자화 엔진을 포함할 수 있다. 예를 들어, 양자화 엔진은 순방향 패스(723)의 최종 컨볼루션 계층(725)에 의해 출력된 피처들에 대해 양자화를 수행하여 양자화된 출력을 생성할 수 있다. 엔트로피 인코딩 엔진은 양자화 엔진으로부터의 양자화된 출력을 엔트로피 인코딩하여 비트스트림을 생성할 수 있다. 일부 경우들에서, 엔트로피 인코딩 엔진은 엔트로피 인코딩을 수행하기 위해 하이퍼프라이어 네트워크에 의해 생성된 프리어를 사용할 수 있다. 뉴럴 네트워크 시스템은 저장을 위해, 다른 디바이스로의 송신을 위해, 서버 디바이스 또는 시스템에 비트스트림을 출력할 수 있고, 그리고/또는 그렇지 않으면 비트스트림을 출력할 수 있다.The output of the forward pass 723 is processed by the bottleneck portion 726 of the flow engine 718 to produce the luma component ( ) Generates a bitstream or rateant representing the luma motion information (f ^L ). The bottleneck portion 726 may include a quantization engine and an entropy encoding engine in the forward pass 723 and an entropy decoding engine and an inverse quantization engine in the reverse pass 728 of the flow engine 718. For example, the quantization engine may perform quantization on the features output by the final convolution layer 725 of the forward pass 723 to generate a quantized output. The entropy encoding engine may generate a bitstream by entropy encoding the quantized output from the quantization engine. In some cases, the entropy encoding engine may use the prior generated by the hyperprior network to perform entropy encoding. A neural network system may output a bitstream to a server device or system for storage, transmission to another device, and/or otherwise output the bitstream.

역방향 패스(728)는 일부 경우들에서, 플로우 엔진(718)의 뉴럴 네트워크 시스템의 디코더 서브-네트워크 또는 (다른 디바이스의) 다른 플로우 엔진의 뉴럴 네트워크 시스템의 디코더 서브-네트워크일 수 있다. 플로우 엔진(718)의 엔트로피 디코딩 엔진은 병목(726)의 엔트로피 인코딩 엔진(또는 다른 플로우 엔진의 병목의 엔트로피 인코딩 엔진)에 의해 출력된 비트스트림을 엔트로피 디코딩하고, 엔트로피 디코딩된 데이터를 역방향 패스(728)의 역양자화 엔진에 출력할 수 있다. 엔트로피 디코딩 엔진은 엔트로피 디코딩을 수행하기 위해 하이퍼프라이어 네트워크에 의해 생성된 프라이어를 사용할 수 있다. 역양자화 엔진은 데이터를 역양자화할 수 있다.The reverse pass 728 may, in some cases, be a decoder sub-network of a neural network system of flow engine 718 or a decoder sub-network of a neural network system of another flow engine (on another device). The entropy decoding engine of the flow engine 718 entropy decodes the bitstream output by the entropy encoding engine of the bottleneck 726 (or the entropy encoding engine of the bottleneck of another flow engine), and reverse passes the entropy decoded data to 728. ) can be output to the inverse quantization engine. The entropy decoding engine may use the fryer generated by the hyperprior network to perform entropy decoding. The inverse quantization engine can inverse quantize data.

역방향 패스 (728) 의 컨볼루션 계층들 및 역 활성화 계층들은 그 후 병목 (726) 으로부터의 역양자화된 데이터를 프로세싱하여 현재 프레임의 루마 성분()에 대한 모션 정보 (729) (f^L) 를 생성할 수도 있다. 모션 정보 (729) (f^L) 는 현재 프레임의 루마 성분()의 각각의 샘플에 대한 모션 벡터와 같은 모션 벡터들 (예를 들어, 수평 또는 x-방향으로의 크기 및 수직 또는 y-방향으로의 크기를 가짐) 을 포함할 수 있다. 일부 경우에, 모션 정보(729)(fL)는 스케일 성분을 더 포함할 수 있다. 예를 들어, 예시를 위해 도 7a에 도시된 바와 같이, 모션 정보 (729) 는 성분, 성분, 및 성분을 포함한다. 위에서 언급된 바와 같이, , , 및 성분들은 예측들 , , 을 생성하기 위해 (시간 t에서) 현재 프레임(614)의 픽셀들을 워핑하도록 워핑 엔진(622)에 의해 식 (1)에서 사용될 수 있다.The convolution layers and inverse activation layers of the backward pass 728 then process the dequantized data from the bottleneck 726 to generate the luma component ( ) can also generate motion information 729 (f ^L ). Motion information (729) (f ^L ) is the luma component of the current frame ( ), such as a motion vector for each sample (e.g., having a magnitude in the horizontal or x-direction and a magnitude in the vertical or y-direction). In some cases, motion information 729(fL) may further include a scale component. For example, as shown in FIG. 7A for illustration purposes, motion information 729 is ingredient, ingredients, and Contains ingredients. As mentioned above, , , and Ingredients are predictions , , may be used in equation (1) by the warping engine 622 to warp the pixels of the current frame 614 (at time t) to generate .

현재 프레임의 루마 성분()에 대한 모션 정보(729)(f^L)를 학습한 후, 플로우 엔진(718)은 현재 프레임의 크로마 성분들에 대한 모션 정보(731)(f^C)를 결정 또는 예측할 수 있다. 예를 들어, 플로우 엔진 (718) 은 크로마 성분들에 대한 모션 정보 (731) (f^C) 를 획득하기 위해 루마 성분()에 대한 모션 정보 (729) (f^L) 를 서브샘플링할 수도 있다. 크로마 성분들에 대한 모션 정보 (731) (f^C) 는 현재 프레임의 크로마 성분들의 각각의 샘플에 대한 모션 벡터와 같은 모션 벡터들 (예를 들어, 수평 또는 x-방향으로의 크기 및 수직 또는 y-방향으로의 크기를 가짐) 을 포함할 수 있다. 일부 경우에, 모션 정보(731)(f^C)는 스케일 성분을 더 포함할 수 있다. 예를 들어, 도 7a에 도시된 바와 같이, 현재 프레임의 크로마 성분들에 대한 모션 정보(731)(f^C)는 성분, 성분 및 성분을 포함한다. 루마 성분에 대한 모션 정보 (729) (f^L) 와 유사하게, 크로마 모션 정보 (731) (f^C)의 , 및 성분들은 식(1)에서 이용되어 워핑 엔진(622)에 의해 (시간 t 에서) 현재 프레임 (614) 의 픽셀들을 워핑하여 예측들 , , 을 생성할 수 있다.The luma component of the current frame ( ), the flow engine 718 may determine or predict motion information 731 (f ^C ) for chroma components of ^the current frame. For example, the flow engine 718 uses the luma component (f C) to obtain motion information 731 (f ^C ) for the chroma components. ) motion information 729 (f ^L ) may be subsampled. Motion information 731 (f ^C ) for chroma components may include motion vectors (e.g., magnitude in the horizontal or x-direction and vertical or y - has a size in the direction). In some cases, motion information 731(f ^C ) may further include a scale component. For example, as shown in FIG. 7A, motion information 731 (f ^C ) for chroma components of the current frame is ingredient, ingredients and Contains ingredients. Similar to the motion information 729 (f ^L ) for the luma component, the chroma motion information 731 (f ^C ) , and The components are used in equation (1) to warp the pixels of the current frame 614 (at time t) by the warping engine 622 to generate the predictions. , , can be created.

일부 양태들에서, 다운 샘플링을 갖는 컨볼루션 계층 (730) 은 현재 프레임의 루마 성분()에 대한 모션 정보 (729) (fL)에 기초하여 현재 프레임의 크로마 성분들에 대한 모션 정보 (731) (fC) 를 학습하도록 (예를 들어, 비지도 학습 또는 트레이닝을 사용하여) 트레이닝될 수 있다. 하나의 예시적인 예에서, 플로우 엔진(718)을 트레이닝하기 위해 사용될 수 있는 트레이닝 세트는 루마 및 크로마 모션 정보를 (실측 정보로서) 포함할 수 있다. 루마 모션 정보는 플로우 엔진(718)의 뉴럴 네트워크에 입력될 수 있고, 플로우 엔진(718)으로부터 출력되는 결과적인 크로마 모션 정보는 손실 함수(예를 들어, L1 또는 절대 차이들의 합, L2 놈 또는 제곱 차이들의 합, 또는 다른 손실 함수)를 사용하여 실측 크로마 모션 정보를 사용하여 최소화될 수 있다.In some aspects, convolutional layer 730 with down-sampling determines the luma component of the current frame ( ) can be trained (e.g., using unsupervised learning or training) to learn motion information 731 (fC) for the chroma components of the current frame based on motion information 729 (fL) for there is. In one illustrative example, a training set that may be used to train flow engine 718 may include luma and chroma motion information (as ground truth). Luma motion information may be input to the neural network of the flow engine 718, and the resulting chroma motion information output from the flow engine 718 may be converted to a loss function (e.g., L1 or sum of absolute differences, L2 norm or square). It can be minimized using the ground truth chroma motion information (sum of differences, or other loss function).

컨볼루션 계층(730)은 도 7a에서 |3/3|5x5 conv↓2| 로서 표시된다. 표기 "3/3"은 3개의 출력 채널을 초래하는 3개의 입력 채널이 있음을 나타낸다. 위에서 언급된 바와 같이, "↓2" 및 "↑2" 표기는 스트라이드 값들을 지칭하며, ↓2는 ("↓"에 의해 표시된 바와 같은) 다운샘플링에 대한 2의 스트라이드를 지칭하고, ↑2는 ("↑"에 의해 표시된 바와 같은) 업샘플링에 대한 2의 스트라이드를 지칭한다. 예를 들어, 컨볼루션 계층(730)은 2의 스트라이드 값만큼 수평 및 수직 차원에서 5x5 컨볼루션 필터를 적용함으로써 루마 성분()에 대한 모션 정보(729)(f^L)를 4의 팩터만큼 (예를 들어, YUV 4:2:0 포맷에 대해) 다운샘플링한다. 일부 예들에서, 컨볼루션 계층(730)은 다른 포맷들(예를 들어, YUV 4:2:2 포맷 등)에 대한 다른 인자들에 의해 다운샘플링하도록 트레이닝될 수 있다. 컨볼루션 계층(724)의 결과적인 출력은 루마 성분()에 대한 모션 정보(729)(f^L)의 다운샘플링된 버전인 피처 값들의 3x3 어레이(3개의 채널들에 대응함)이다.The convolution layer 730 is |3/3|5x5 conv↓2| in FIG. 7A. It is displayed as . The notation "3/3" indicates that there are three input channels resulting in three output channels. As mentioned above, the notations "↓2" and "↑2" refer to stride values, with ↓2 referring to a stride of 2 for downsampling (as indicated by "↓"), and ↑2 Refers to a stride of 2 for upsampling (as indicated by “↑”). For example, the convolution layer 730 may apply a 5x5 convolutional filter in the horizontal and vertical dimensions with a stride value of 2 to obtain the luma component ( ) Downsample the motion information 729 (f ^L ) by a factor of 4 (e.g., for YUV 4:2:0 format). In some examples, convolutional layer 730 may be trained to downsample by different factors for different formats (eg, YUV 4:2:2 format, etc.). The resulting output of the convolutional layer 724 is the luma component ( ) is a 3x3 array of feature values (corresponding to three channels), which is a downsampled version of the motion information 729 (f ^L ) for ).

(도 7a에 도시되지 않은) 다른 양태들에서, 현재 프레임의 크로마 성분에 대한 모션 정보 (731) (fC) 는 루마 성분()에 대한 모션 정보 (729) (f^L) 를 직접 서브샘플링함으로써 획득될 수 있다. 예를 들어, 플로우 엔진 (718) 은 루마 플로우(들)를 프로세싱하기 위해 컨볼루션 계층 (730) 을 사용하지 않고서 크로마 플로우를 결정할 수 있다. 하나의 예시적인 예에서, 컨볼루션 계층 (730) 대신에, 플로우 엔진 (718) 은 크로마 모션 정보 (731) (f^C) 를 획득하기 위해 루마 모션 정보 (729) (f^L) 를 직접 서브샘플링할 수 있는 서브샘플러 (플로우 엔진 (718) 의 뉴럴 네트워크로부터 분리될 수 있음) 를 포함할 수 있다.In other aspects (not shown in FIG. 7A), the motion information 731 (fC) for the chroma component of the current frame is the luma component ( ) can be obtained by directly subsampling the motion information 729 (f ^L ). For example, flow engine 718 can determine chroma flow without using convolutional layer 730 to process luma flow(s). In one illustrative example, instead of the convolutional layer 730, flow engine 718 directly subsamples luma motion information 729 (f ^L ) to obtain chroma motion information 731 (f ^C ). It may include a subsampler (which may be separate from the neural network of the flow engine 718).

도 7b 는 현재 프레임에 대한 크로마 모션 정보를 획득하기 위해 (예를 들어, 도 7a 의 플로우 엔진 (718) 을 사용하여) 현재 프레임에 대해 결정된 루마 모션 정보를 서브샘플링하기 위한 서브샘플링 엔진 (735) 의 일 예를 예시한 다이어그램이다. 예시의 목적으로, 단순화된 예에는 4x4 (4 개의 행들 및 4 개의 열들) 의 해상도를 갖는 루마 모션 정보 (732) 의 각각의 채널 N (N = 2) 이 제공되며, 총 16 개의 플로우 모션 또는 변위 벡터들을 갖는다. 서브샘플링 엔진(735)은 루마 모션 정보(732)의 서브샘플링된/다운샘플링된 버전인 크로마 모션 정보(738)를 생성 또는 획득하기 위해 루마 모션 정보(732)를 서브샘플링 또는 다운샘플링한다.7B shows a subsampling engine 735 for subsampling the determined luma motion information for the current frame (e.g., using the flow engine 718 of FIG. 7A) to obtain chroma motion information for the current frame. This is a diagram illustrating an example. For illustrative purposes, a simplified example is provided with each N channel (N = 2) of luma motion information 732 having a resolution of 4x4 (4 rows and 4 columns), for a total of 16 flow motion or displacement It has vectors. The subsampling engine 735 subsamples or downsamples the luma motion information 732 to generate or obtain chroma motion information 738, which is a subsampled/downsampled version of the luma motion information 732.

도 7b 의 예시적인 예는 루마 모션 정보 (732) 의 사이즈의 쿼터인 크로마 모션 정보 (738) 를 도시한다. 예를 들어, 이전에 설명된 바와 같이, YUV 4:2:0 포맷을 갖는 콘텐츠에 대해, U 및 V 채널 해상도는 Y 채널 해상도의 절반이다(U 및 V 채널들은 폭 및 높이가 절반으로 되기 때문에, Y 채널의 4분의 1인 크기를 갖는다). 서브샘플링 엔진(735)은 4:2:0 포맷 이외의 다른 포맷들을 처리하도록 트레이닝되거나 달리 구성될 수 있으며, 이 경우 서브샘플링은 도 7a에 도시된 것과 상이한 해상도들을 갖는 크로마 정보를 생성하는 것을 포함할 수 있다.The illustrative example of FIG. 7B shows chroma motion information 738 that is a quarter of the size of luma motion information 732. For example, as previously explained, for content with the YUV 4:2:0 format, the U and V channel resolution is half the Y channel resolution (since the U and V channels are half the width and height). , has a size that is one quarter of the Y channel). Subsampling engine 735 may be trained or otherwise configured to handle formats other than the 4:2:0 format, in which case subsampling includes generating chroma information with different resolutions than shown in FIG. 7A. can do.

일부 양태들에서, 전술된 바와 같이, 서브샘플링 엔진(735)은 루마 모션 정보(732)로부터 크로마 모션 정보(738)를 결정하기 위해 (예를 들어, 비지도 학습 또는 트레이닝을 사용하여) 트레이닝될 수 있는 도 7a의 컨볼루션 계층(730)을 포함할 수 있다. 다른 양태들에서, 서브샘플링 엔진 (735) 은 크로마 모션 정보 (738) 를 획득하기 위해 루마 모션 정보 (732) 를 직접 서브샘플링하는 서브샘플러를 포함할 수 있다.In some aspects, as described above, subsampling engine 735 may be trained (e.g., using unsupervised learning or training) to determine chroma motion information 738 from luma motion information 732. It may include the convolution layer 730 of FIG. 7A. In other aspects, subsampling engine 735 may include a subsampler that directly subsamples luma motion information 732 to obtain chroma motion information 738.

순방향 패스(723) 및 역방향 패스(728)의 컨볼루션 또는 변환 계층들에서 뿐만 아니라 병목(M)에서의 채널들의 수(도 7a에서 N으로 표시됨)는 임의의 적절한 값으로 설정될 수 있다. 하나의 예시적인 예에서, 채널들의 수 N은 N=192 및 M=128로서 선택될 수 있다. (스케일 필드 s 와 연관된) 재구성된 프레임들의 연속적인 평활화된 버전들은 필터링 또는 평활화 연산자를 사용함으로써 획득될 수 있다. 일 예에서, 상이한 폭들을 갖는 가우시안 블러링 필터가 사용될 수 있다. 다른 예에서, 연속적인 필터링 및 보간을 갖는 가우시안 피라미드가 재구성된 프레임들의 평활화된 버전들을 생성하는 데 사용될 수 있다. 또한, 임의로 많은 수의 스케일들(S)이 사용될 수 있다. 일 예에서, 스케일 S는 S=3으로 설정될 수 있고, 스케일 레벨은 로서 선택될 수 있고, 여기서 는 가우시안 필터 폭을 나타낼 수 있다.The number of channels in the bottleneck M (denoted N in Figure 7A) as well as in the convolution or transform layers of the forward pass 723 and backward pass 728 may be set to any suitable value. In one illustrative example, the number of channels N may be selected as N=192 and M=128. Successive smoothed versions of the reconstructed frames (associated with the scale field s) can be obtained by using a filtering or smoothing operator. In one example, a Gaussian blurring filter with different widths may be used. In another example, a Gaussian pyramid with successive filtering and interpolation can be used to generate smoothed versions of the reconstructed frames. Additionally, an arbitrarily large number of scales S may be used. In one example, scale S can be set to S=3, and the scale level is can be selected as, where may represent the Gaussian filter width.

도 7a의 비선형 활성화 계층들은 예시적인 목적들을 위해 PReLU들로서 도시되지만, 일반화된 분할 정규화(generalized divisive normalization; GDN) 계층들, PReLU 및 GDN 계층들의 조합과 같은 다른 타입들의 비선형 활성화 계층들이 사용될 수 있다.The non-linear activation layers in FIG. 7A are shown as PReLUs for illustrative purposes, but other types of non-linear activation layers may be used, such as generalized divisive normalization (GDN) layers, a combination of PReLU and GDN layers.

일부 예들에서, 하나 이상의 YUV 포맷들(예를 들어, YUV 4:2:0)을 효과적으로 지원하기 위해, 도 6의 인트라 예측 엔진(602) 및 레지듀얼 엔진(620)은 도 8a, 도 9 또는 도 10에 도시된 일반적인 뉴럴 네트워크 아키텍처들에 기초하여 설계될 수 있다. 예를 들어, 도 8a, 도 9 및 도 10에 도시된 아키텍처들은 YUV 4:2:0 포맷을 갖는 입력 데이터를 처리하도록 구성될 수 있다. 일부 예들에서, 도 8a, 도 9 또는 도 10에 도시된 것과 유사한 뉴럴 네트워크 아키텍처는 다른 유형의 YUV 콘텐츠(예를 들어, YUV 4:4:4 포맷, YUV 4:2:2 포맷 등을 갖는 콘텐츠) 및/또는 다른 입력 포맷을 갖는 콘텐츠를 인코딩 및/또는 디코딩하는데 사용될 수 있다. 일부 경우들에서, 도 8a, 도 9 및 도 10 에 도시된 각각의 아키텍처는 YUV(예를 들어, 4:2:0) 레지듀얼로 동작하는 레지듀얼 오토인코더를 포함한다.In some examples, to effectively support one or more YUV formats (e.g., YUV 4:2:0), intra prediction engine 602 and residual engine 620 of FIG. 6 may be configured to support one or more YUV formats (e.g., YUV 4:2:0). It can be designed based on the general neural network architectures shown in FIG. 10. For example, the architectures shown in FIGS. 8A, 9, and 10 may be configured to process input data having the YUV 4:2:0 format. In some examples, a neural network architecture similar to that shown in Figures 8A, 9, or 10 can support other types of YUV content (e.g., content with YUV 4:4:4 format, YUV 4:2:2 format, etc. ) and/or can be used to encode and/or decode content with other input formats. In some cases, each architecture shown in FIGS. 8A, 9, and 10 includes a residual autoencoder operating with a YUV (eg, 4:2:0) residual.

도 8a는 4:2:0 입력 (Y, U 및 V) 데이터와 직접 동작하도록 구성될 수 있는 프론트-엔드 뉴럴 네트워크 시스템(800)의 예를 예시하는 도면이다. 도 8a에 도시된 바와 같이, 뉴럴 네트워크 시스템의 인코더 서브-네트워크(순방향 패스로 또한 지칭됨)에서, 분지형 루마 및 크로마 채널들(루마 Y 채널(802) 및 U 및 V 크로마 채널들(804))은 1x1 컨볼루션 계층(806)을 사용하여 결합된 다음, 비선형 계층(808)(비선형 연산자로 또한 지칭됨)이 적용된다. 유사한 동작들이 뉴럴 네트워크 시스템의 디코더 서브-네트워크(역방향 패스라고도 함)에서 수행되지만, 역순으로 수행된다. 예를 들어, 도 8a에 도시된 바와 같이, 역 비선형 계층(809)(역 비선형 연산자로도 지칭됨)이 적용되고, Y 및 U, V 채널들은 1x1 컨볼루션 계층(813)을 사용하여 분리되고, 개별 Y 및 U, V 채널들은 각각의 역 비선형 계층들(815, 816) 및 컨볼루션 계층들(817, 818)을 사용하여 프로세싱된다.FIG. 8A is a diagram illustrating an example of a front-end neural network system 800 that can be configured to operate directly with 4:2:0 input (Y, U, and V) data. As shown in Figure 8A, in the encoder sub-network (also referred to as the forward pass) of the neural network system, branched luma and chroma channels (luma Y channel 802 and U and V chroma channels 804) ) are combined using a 1x1 convolutional layer 806, and then a non-linear layer 808 (also referred to as a non-linear operator) is applied. Similar operations are performed in the decoder sub-network (also called backward pass) of a neural network system, but in reverse order. For example, as shown in Figure 8A, an inverse non-linear layer 809 (also referred to as an inverse non-linear operator) is applied, the Y and U, V channels are separated using a 1x1 convolution layer 813, and , individual Y and U, V channels are processed using respective inverse nonlinear layers 815, 816 and convolutional layers 817, 818.

도 8a의 뉴럴 네트워크 시스템(800)의 인코더 서브-네트워크의 첫 번째 2개의 뉴럴 네트워크 계층들은 제1 컨볼루션 계층(811)(Nconv |3x3|↓1로 표시됨), 제2 컨볼루션 계층(810)(Nconv |5x5|↓2로 표시됨), 제1 비선형 계층(814) 및 제2 비선형 계층(812)을 포함한다. 도 8a의 프론트-엔드 뉴럴 네트워크 아키텍처의 디코더 서브-네트워크에서의 마지막 2개의 뉴럴 네트워크 계층들은 제1 역 비선형 계층(816), 제2 역 비선형 계층(815), 프레임의 재구성된 색차 (U 및 V) 성분들을 생성하기 위한 제1 컨볼루션 계층(818)(2conv |3x3|↑1로 표시됨), 및 프레임의 재구성된 휘도 (Y) 성분을 생성하기 위한 제2 컨볼루션 계층(817)(1conv |5x5|↑2로 표시됨)을 포함한다. "Nconv" 표기는 주어진 컨볼루션 계층의 출력 채널들의 수(N)(출력 특징들의 수에 대응함)를 지칭한다(N의 값은 출력 채널들의 수를 정의한다). 3x3 및 5x5 표기는 각각의 컨볼루션 커널들(예를 들어, 3x3 커널 및 5x5 커널)의 크기를 나타낸다. "↓1" 및 "↓2" 표기는 스트라이드(stride) 값들을 지칭하고, 여기서 ↓1은 (↓"에 의해 표시된 바와 같은 다운샘플링에 대한) 1 의 스트라이드를 지칭하고, ↓2는 (다운샘플링에 대한) 1 의 스트라이드를 지칭한다. "↑1" 및 "↑2" 표기는 스트라이드 값들을 지칭하고, 여기서 ↑1은 (↑"에 의해 표시된 바와 같은 업샘플링에 대한) 1 의 스트라이드를 지칭하고, ↑2는 (업샘플링에 대한) 1 의 스트라이드를 지칭한다.The first two neural network layers of the encoder sub-network of the neural network system 800 in FIG. 8A are a first convolutional layer 811 (denoted as Nconv |3x3|↓1), a second convolutional layer 810 (denoted as Nconv |5x5|↓2), and includes a first non-linear layer 814 and a second non-linear layer 812. The last two neural network layers in the decoder sub-network of the front-end neural network architecture of Figure 8A are the first inverse non-linear layer 816, the second inverse non-linear layer 815, and the reconstructed chrominance of the frame (U and V). ) components (denoted as 2conv |3x3|↑1), and a second convolutional layer 817 (denoted as 2conv | (indicated as 5x5|↑2). The "Nconv" notation refers to the number of output channels (N) (corresponding to the number of output features) of a given convolutional layer (the value of N defines the number of output channels). The 3x3 and 5x5 notations indicate the size of each convolution kernel (e.g., 3x3 kernel and 5x5 kernel). The notations "↓1" and "↓2" refer to stride values, where ↓1 refers to a stride of 1 (for downsampling, as indicated by ↓"), and ↓2 refers to (for downsampling) (for upsampling) refers to a stride of 1. The notations "↑1" and "↑2" refer to stride values, where ↑1 refers to a stride (for upsampling as indicated by ↑") of 1 and , ↑2 refers to a stride of 1 (for upsampling).

예를 들어, 컨볼루션 계층 (810) 은 2 의 스트라이드 값만큼 수평 및 수직 차원들에서 5x5 컨볼루션 필터를 적용함으로써 4 의 팩터만큼 입력 루마 채널 (802) 을 다운샘플링한다. 컨볼루션 계층(810)의 결과적인 출력은 피처 값들의 N개의 어레이들(N개의 채널들에 대응함)이다. 컨볼루션 계층(811)은 1의 스트라이드 값에 의해 수평 및 수직 차원에서 3x3 컨볼루션 필터를 적용함으로써 입력 크로마(U 및 V) 채널(804)을 프로세싱한다. 컨볼루션 계층(811)의 결과적인 출력은 피처 값들의 N개의 어레이들(N개의 채널들에 대응함)이다. 컨볼루션 계층(810)에 의해 출력된 피처 값들의 어레이들은 컨볼루션 계층(811)에 의해 출력된 피처 값들의 어레이들과 동일한 치수를 갖는다. 비선형 계층(812)은 그 후 컨볼루션 계층(810)에 의해 출력된 피처 값들을 처리할 수 있고, 비선형 계층(814)은 컨볼루션 계층(811)에 의해 출력된 피처 값들을 처리할 수 있다.For example, convolutional layer 810 downsamples input luma channel 802 by a factor of 4 by applying a 5x5 convolutional filter in the horizontal and vertical dimensions by a stride value of 2. The resulting output of the convolutional layer 810 is N arrays of feature values (corresponding to N channels). The convolutional layer 811 processes the input chroma (U and V) channels 804 by applying a 3x3 convolutional filter in the horizontal and vertical dimensions with a stride value of 1. The resulting output of the convolutional layer 811 is N arrays of feature values (corresponding to N channels). The arrays of feature values output by the convolution layer 810 have the same dimensions as the arrays of feature values output by the convolution layer 811. Non-linear layer 812 can then process the feature values output by convolution layer 810, and non-linear layer 814 can process feature values output by convolution layer 811.

그 후, 1x1 컨볼루션 계층(806)은 비선형 계층들(812, 814)에 의해 출력된 피처 값들을 프로세싱할 수 있다. 1x1 컨볼루션 계층(806)은 루마 채널(802) 및 크로마 채널들(804)과 연관된 특징들의 선형 조합을 생성할 수 있다. 선형 조합 연산은 Y 및 UV 성분들의 값 당 크로스-채널 혼합(per-value cross-channel mixing)으로서 동작하여, 코딩 성능을 향상시키는 크로스-성분(예를 들어, 크로스-휘도 및 색차 성분) 예측을 초래한다. 1x1 컨볼루션 계층(806)의 각각의 1x1 컨볼루션 필터는 루마 채널(802)의 대응하는 N번째 채널 및 크로마 채널들(804)의 대응하는 N번째 채널에 적용되는 각각의 스케일링 인자를 포함할 수 있다.The 1x1 convolutional layer 806 may then process the feature values output by the non-linear layers 812 and 814. The 1x1 convolution layer 806 may generate a linear combination of features associated with the luma channel 802 and chroma channels 804. The linear combination operation operates as a per-value cross-channel mixing of the Y and UV components, predicting cross-components (e.g., cross-luminance and chrominance components), which improves coding performance. bring about Each 1x1 convolution filter of the 1x1 convolution layer 806 may include a respective scaling factor applied to the corresponding Nth channel of the luma channel 802 and the corresponding Nth channel of the chroma channels 804. there is.

도 8b는 1x1 컨볼루션 계층(838)의 예시적인 동작을 예시하는 도면이다. 전술한 바와 같이, N은 출력 채널의 수를 나타낸다. 도 8b 에 도시된 바와 같이, N-채널 크로마(결합된 U 및 V) 출력(832) 및 N-채널 루마(Y) 출력(834)을 포함하는 2N 채널들이 1x1 컨볼루션 계층(838)에 대한 입력으로서 제공된다. 도 8b 의 예에서, N 의 값은 2 와 동일하여, N-채널 크로마 출력 (832)에 대한 값들의 2 개의 채널들 및 N-채널 루마 출력 (834)에 대한 값들의 2 개의 채널들을 표시한다. 도 8a 를 참조하면, N-채널 크로마 출력 (832) 은 비선형 계층 (814) 으로부터의 출력일 수 있고, N-채널 루마 출력 (834) 은 비선형 계층 (812) 으로부터의 출력일 수 있다.FIG. 8B is a diagram illustrating example operation of a 1x1 convolution layer 838. As described above, N represents the number of output channels. As shown in Figure 8B, 2N channels, including an N-channel chroma (combined U and V) output 832 and an N-channel luma (Y) output 834, are used for the 1x1 convolution layer 838. Provided as input. In the example of Figure 8B, the value of N is equal to 2, indicating two channels of values for N-channel chroma output 832 and two channels of values for N-channel luma output 834. . Referring to FIG. 8A , N-channel chroma output 832 can be an output from non-linear layer 814 and N-channel luma output 834 can be an output from non-linear layer 812.

1x1 컨볼루션 계층(838)은 2N 채널들을 처리하고 2N 채널들의 특징적 선형 결합을 수행한 다음, 특징들 또는 계수들의 N-채널 세트를 출력한다. 1x1 컨볼루션 계층(838)은 2개의 1x1 컨볼루션 필터들(N=2에 기초함)을 포함한다. 제 1 1x1 컨볼루션 필터는 s1 값으로 도시되고, 제 2 1x1 컨볼루션 필터는 s2 값으로 도시된다. s1 값은 제 1 스케일링 인자를 나타내고 s2 값은 제 2 스케일링 인자를 나타낸다. 하나의 예시적인 예에서, s1 값은 3 과 동일하고 s2 값은 4 와 동일하다. 1x1 컨볼루션 계층(838)의 1x1 컨볼루션 필터들 각각은 1의 스트라이드 값을 가지며, 이는 스케일링 인자들 s1 및 s2 가 UV 출력(832) 및 Y 출력(834)에서의 각각의 값에 적용될 것임을 나타낸다.The 1x1 convolution layer 838 processes 2N channels and performs a characteristic linear combination of the 2N channels, then outputs an N-channel set of features or coefficients. The 1x1 convolution layer 838 includes two 1x1 convolution filters (based on N=2). The first 1x1 convolution filter is shown with s1 values, and the second 1x1 convolution filter is shown with s2 values. The s1 value represents the first scaling factor and the s2 value represents the second scaling factor. In one illustrative example, the s1 value is equal to 3 and the s2 value is equal to 4. Each of the 1x1 convolution filters in the 1x1 convolution layer 838 has a stride value of 1, indicating that scaling factors s1 and s2 will be applied to the respective values in the UV output 832 and Y output 834. .

예를 들어, 제 1 1x1 컨볼루션 필터의 스케일링 인자 s1 은 UV 출력(832)의 제 1 채널(C1) 내의 각각의 값 및 Y 출력(834)의 제 1 채널(C1) 내의 각각의 값에 적용된다. UV 출력(832)의 제 1 채널(C1)의 각각의 값 및 Y 출력(834)의 제 1 채널(C1)의 각각의 값이 제 1 1x1 컨볼루션 필터의 스케일링 인자 s1 에 의해 스케일링되면, 스케일링된 값들은 출력 값들(839)의 제 1 채널(C1)로 결합된다. 제 2 1x1 컨볼루션 필터의 스케일링 인자 s2 는 UV 출력(832)의 제 2 채널(C2) 내의 각각의 값 및 Y 출력(834)의 제 2 채널(C2) 내의 각각의 값에 적용된다. UV 출력 (832) 의 제 2 채널 (C2) 의 각각의 값 및 Y 출력 (834) 의 제 2 채널 (C2) 의 각각의 값이 제 2 1x1 컨볼루션 필터의 스케일링 팩터 s2 만큼 스케일링된 후, 스케일링된 값들은 출력 값들 (839) 의 제 2 채널 (C2)에 결합된다. 그 결과, 4개의 Y 및 UV 채널들(2개의 Y 채널들 및 2개의 결합된 UV 채널들)은 2개의 출력 채널들(C1 및 C2)로 혼합되거나 결합된다.For example, the scaling factor s1 of the first 1x1 convolution filter applies to each value in the first channel (C1) of the UV output (832) and to each value in the first channel (C1) of the Y output (834) do. If each value of the first channel C1 of the UV output 832 and each value of the first channel C1 of the Y output 834 are scaled by the scaling factor s1 of the first 1x1 convolution filter, scaling The combined values are combined into the first channel C1 of the output values 839. The scaling factor s2 of the second 1x1 convolution filter is applied to each value in the second channel (C2) of the UV output (832) and to each value in the second channel (C2) of the Y output (834). Each value of the second channel (C2) of the UV output 832 and each value of the second channel (C2) of the Y output 834 is scaled by the scaling factor s2 of the second 1x1 convolution filter, and then the scaling The combined values are coupled to the second channel (C2) of output values (839). As a result, the four Y and UV channels (two Y channels and two combined UV channels) are mixed or combined into two output channels (C1 and C2).

도 8a로 돌아가면, 1x1 컨볼루션 계층(806)의 출력은 인코더 서브-네트워크의 추가적인 비선형 계층들 및 추가적인 컨볼루션 계층들에 의해 프로세싱된다. 병목(820)은 인코더 서브-네트워크(또는 순방향 패스) 상의 양자화 엔진 및 엔트로피 인코딩 엔진 및 디코더 서브-네트워크(또는 역방향 패스) 상의 엔트로피 디코딩 엔진 및 역양자화 엔진을 포함할 수 있다. 양자화 엔진은 인코더 서브-네트워크의 최종 뉴럴 네트워크 계층(819)에 의해 출력된 특징들에 대해 양자화를 수행하여 양자화된 출력을 생성할 수 있다. 엔트로피 인코딩 엔진은 양자화 엔진으로부터의 양자화된 출력을 엔트로피 인코딩하여 비트스트림을 생성할 수 있다. 일부 경우들에서, 엔트로피 인코딩 엔진은 엔트로피 인코딩을 수행하기 위해 하이퍼프라이어 네트워크에 의해 생성된 프리어를 사용할 수 있다. 뉴럴 네트워크 시스템은 저장을 위해, 다른 디바이스로의 송신을 위해, 서버 디바이스 또는 시스템에 비트스트림을 출력할 수 있고, 그리고/또는 그렇지 않으면 비트스트림을 출력할 수 있다.Returning to Figure 8A, the output of the 1x1 convolutional layer 806 is processed by additional nonlinear layers of the encoder sub-network and additional convolutional layers. Bottleneck 820 may include a quantization engine and an entropy encoding engine on the encoder sub-network (or forward pass) and an entropy decoding engine and inverse quantization engine on the decoder sub-network (or backward pass). The quantization engine may perform quantization on the features output by the final neural network layer 819 of the encoder sub-network to generate a quantized output. The entropy encoding engine may generate a bitstream by entropy encoding the quantized output from the quantization engine. In some cases, the entropy encoding engine may use the prior generated by the hyperprior network to perform entropy encoding. A neural network system may output a bitstream to a server device or system for storage, transmission to another device, and/or otherwise output the bitstream.

뉴럴 네트워크 시스템의 디코더 서브-네트워크 또는 (다른 디바이스의) 다른 뉴럴 네트워크 시스템의 디코더 서브-네트워크는 비트스트림을 디코딩할 수 있다. (디코더 서브-네트워크의) 병목(820)의 엔트로피 디코딩 엔진은 비트스트림을 엔트로피 디코딩하고 엔트로피 디코딩된 데이터를 디코더 서브-네트워크의 역양자화 엔진에 출력할 수 있다. 엔트로피 디코딩 엔진은 엔트로피 디코딩을 수행하기 위해 하이퍼프라이어 네트워크에 의해 생성된 프라이어를 사용할 수 있다. 역양자화 엔진은 데이터를 역양자화할 수 있다. 역양자화된 데이터는 디코더 서브-네트워크의 다수의 컨볼루션 계층들 및 다수의 역 비선형 계층들에 의해 프로세싱될 수 있다.A decoder sub-network of a neural network system or a decoder sub-network of another neural network system (of another device) may decode the bitstream. The entropy decoding engine of the bottleneck 820 (of the decoder sub-network) may entropy decode the bitstream and output the entropy decoded data to the dequantization engine of the decoder sub-network. The entropy decoding engine may use the fryer generated by the hyperprior network to perform entropy decoding. The inverse quantization engine can inverse quantize data. The dequantized data may be processed by multiple convolutional layers and multiple inverse nonlinear layers of the decoder sub-network.

여러 개의 컨볼루션 및 비선형 계층들에 의해 처리된 후, 1x1 컨볼루션 계층(813)은 최종 역 비선형 계층(809)에 의해 출력된 데이터를 처리할 수 있다. 1x1 컨볼루션 계층(813)은 데이터를 Y 채널 특징들 및 결합된 UV 채널 특징들로 분할할 수 있는 2N 컨볼루션 필터들을 포함할 수 있다. 예를 들어, 역 비선형 계층(809)에 의해 출력된 N개의 채널들 각각은 1x1 컨볼루션 계층(813)의 2N개의 1x1 컨볼루션들을 사용하여 처리될 수 있다(스케일링을 초래함). N 개의 입력 채널들에 적용되는 (총 2N 개의 출력 채널들로부터의) 출력 채널에 대응하는 각각의 스케일링 팩터 ni 에 대해, 디코더 서브-네트워크는 N 개의 입력 채널들에 걸쳐 합산을 수행하여, 2N 개의 출력들을 초래할 수 있다. 하나의 예시적인 예에서, 스케일링 팩터 n1 에 대해, 디코더 서브-네트워크는 스케일링 팩터 n1 을 N 개의 입력 채널들에 적용할 수 있고 그 결과를 합산할 수 있고, 이는 하나의 출력 채널을 초래한다. 디코더 서브-네트워크는 2N 개의 상이한 스케일링 팩터들 (예를 들어, 스케일링 팩터 n1, 스케일링 팩터 n2 내지 스케일링 팩터 n2N)에 대해 이 동작을 수행할 수 있다.After being processed by several convolutional and non-linear layers, the 1x1 convolutional layer 813 can process the data output by the final inverse non-linear layer 809. The 1x1 convolution layer 813 may include 2N convolutional filters that may split the data into Y channel features and combined UV channel features. For example, each of the N channels output by the inverse nonlinear layer 809 can be processed (resulting in scaling) using 2N 1x1 convolutions of the 1x1 convolution layer 813. For each scaling factor ni corresponding to an output channel (from a total of 2N output channels) applied to the N input channels, the decoder sub-network performs a summation over the N input channels, resulting in 2N output channels. can result in outputs. In one illustrative example, for scaling factor n1, the decoder sub-network can apply scaling factor n1 to N input channels and sum the results, resulting in one output channel. The decoder sub-network may perform this operation for 2N different scaling factors (e.g., scaling factor n1, scaling factor n2 through scaling factor n2N).

1x1 컨볼루션 계층(813)에 의해 출력된 Y 채널 피처들은 역 비선형(815)에 의해 처리될 수 있다. 1x1 컨볼루션 계층(813)에 의해 출력된 결합된 UV 채널 피처들은 역 비선형 (816)에 의해 프로세싱될 수 있다. 컨볼루셔널 계층 (817) 은 Y 채널 피처들을 프로세싱하고 재구성된 Y 성분 (824) 으로서 도시된 재구성된 프레임의 샘플 또는 픽셀 (예를 들어, 휘도 샘플들 또는 픽셀들) 당 재구성된 Y 채널을 출력할 수 있다. 컨볼루션 계층(818)은 결합된 UV 채널 피처들을 프로세싱할 수 있고, 재구성된 U 및 V 성분들(825)로서 도시된, 재구성된 프레임의 픽셀 또는 샘플(예를 들어, 색차-블루 샘플들 또는 픽셀들) 당 재구성된 U 채널 및 재구성된 프레임의 픽셀 또는 샘플(예를 들어, 색차-레드 샘플들 또는 픽셀들) 당 재구성된 V 채널을 출력할 수 있다.Y channel features output by the 1x1 convolution layer 813 can be processed by the inverse nonlinearity 815. The combined UV channel features output by the 1x1 convolution layer 813 can be processed by the inverse nonlinearity 816. The convolutional layer 817 processes the Y channel features and outputs a reconstructed Y channel per sample or pixel (e.g., luminance samples or pixels) of the reconstructed frame, shown as reconstructed Y component 824. can do. Convolutional layer 818 may process the combined UV channel features and produce pixels or samples of the reconstructed frame, shown as reconstructed U and V components 825 (e.g., chrominance-blue samples or A reconstructed U channel per pixel) and a reconstructed V channel per pixel or sample (e.g., chrominance-red samples or pixels) of the reconstructed frame may be output.

일부 예들에서, 상이한 비선형성 연산자들을 갖는 도 8a의 아키텍처의 상이한 변형들이 인트라 예측 엔진(602) 및 레지듀얼 엔진(620)으로서 사용될 수 있다. 예를 들어, 도 9 및 도 10은 YUV 포맷을 갖는 데이터(예를 들어, Y, U 및 V 성분을 갖는 YUV 4:2:0 입력 데이터)를 처리하기 위해 구성된 도 8a의 프론트-엔드 아키텍처를 설명하기 위한 도면들이다. 도 9의 뉴럴 네트워크 시스템(900)에서, 인코더 측에서, 분지형 루마 및 크로마 채널들은 (도 8a의 것과 유사한) 1x1 컨볼루션 계층을 사용하여 결합되고, 그 후 GDN 비선형 연산자가 적용된다. 도 10의 뉴럴 네트워크 시스템(1000)에서, 인코더 측에서, 분지형 루마 및 크로마 채널들은 (도 8a의 것과 유사한) 1x1 컨볼루션 계층을 사용하여 결합되고, 그 후 PReLU 비선형 연산자가 적용된다. 일 예에서, VAE_res 및 VAE_intra 모두는 도 9에 도시된 변형을 사용할 수 있다. 다른 예에서, VAE_res 및 VAE_intra 모두는 도 10의 변형을 사용할 수 있다. 다른 예에서, VAE_res는 도 9의 변형을 사용할 수 있고, VAE_intra는 도 10의 변형을 사용할 수 있다. 다른 예에서, VAE_intra는 도 9의 변형을 사용할 수 있고, VAE_res는 도 10의 변형을 사용할 수 있다.In some examples, different variations of the architecture of FIG. 8A with different non-linearity operators may be used as intra prediction engine 602 and residual engine 620. For example, FIGS. 9 and 10 illustrate the front-end architecture of FIG. 8A configured to process data with YUV format (e.g., YUV 4:2:0 input data with Y, U, and V components). These are drawings for explanation. In the neural network system 900 of Figure 9, on the encoder side, the branched luma and chroma channels are combined using a 1x1 convolutional layer (similar to that in Figure 8A), and then the GDN non-linear operator is applied. In the neural network system 1000 of Figure 10, on the encoder side, the branched luma and chroma channels are combined using a 1x1 convolutional layer (similar to that in Figure 8A), and then the PReLU non-linear operator is applied. In one example, both VAE _res and VAE _intra may use the variant shown in FIG. 9. In another example, both VAE _res and VAE _intra may use a variation of Figure 10. In another example, VAE _res may use a variation of Figure 9 and VAE _intra may use a variation of Figure 10. In another example, VAE _intra may use a variation of Figure 9 and VAE _res may use a variation of Figure 10.

도 11 은 비디오 데이터를 프로세싱하기 위한 프로세스 (1100) 의 일 예를 예시하는 플로우 다이어그램이다. 블록(1102)에서, 프로세스(1100)는 기계 학습 시스템에 의해, 입력 비디오 데이터를 획득하는 단계를 포함한다. 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분 (예를 들어, 도 7a의 (시간 t에서의) 현재 프레임의 루마 성분 ) 을 포함한다. 일부 경우들에서, 입력 비디오 데이터는, 적어도 하나의 재구성된 휘도 성분으로 지칭될 수 있는, 이전에 재구성된 프레임에 대한 적어도 하나의 휘도 성분 (예를 들어, 도 7a의 (시간 t-1에서의) 이전 프레임의 재구성된 루마 성분 )을 포함한다. 일부 양태들에서, 현재 프레임은 비디오 프레임을 포함한다. 일부 경우들에서, 하나 이상의 색차 성분들은 적어도 하나의 색차-청색 성분 및 적어도 하나의 색차-적색 성분을 포함한다. 일부 양태들에서, 현재 프레임은 휘도-색차 (YUV) 포맷을 갖는다. 일부 경우에, YUV 포맷은 YUV 4:2:0 포맷이다.FIG. 11 is a flow diagram illustrating an example of a process 1100 for processing video data. At block 1102, process 1100 includes obtaining, by a machine learning system, input video data. The input video data includes at least one luminance component for the current frame (e.g., the luma component of the current frame (at time t) in Figure 7A ) includes. In some cases, the input video data contains at least one luminance component for a previously reconstructed frame (e.g., (at time t-1 in Figure 7A), which may be referred to as at least one reconstructed luminance component. ) Reconstructed luma component of the previous frame ) includes. In some aspects, the current frame includes a video frame. In some cases, the one or more chrominance components include at least one chrominance-blue component and at least one chrominance-red component. In some aspects, the current frame has a luminance-chrominance (YUV) format. In some cases, the YUV format is YUV 4:2:0 format.

블록 1104 에서, 프로세스는 기계 학습 시스템에 의해, 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분에 대한 모션 정보를 결정하는 단계를 포함한다. 일부 양태들에서, 프로세스 (1100) 는 현재 프레임의 적어도 하나의 휘도 성분 및 이전 프레임의 적어도 하나의 재구성된 루마 성분에 기초하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보를 결정하는 단계를 포함할 수도 있다. 일부 경우들에서, 프로세스 (1100) 는 현재 프레임의 적어도 하나의 휘도 성분에 대해 결정된 모션 정보를 사용하여 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 것을 더 포함할 수도 있다. 일부 경우들에서, 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보는 기계 학습 시스템의 컨볼루션 계층을 사용하여 결정된다. 예를 들어, 도 7a 를 예시적인 예로서 참조하면, 플로우 엔진 (718) 은 현재 프레임 및 이전에 재구성된 프레임 양자 모두의 루마 성분을 사용하여 현재 프레임 에 대한 루마 모션 정보 (예를 들어, SSF f^L) 및 크로마 모션 정보 (예를 들어, SSF f^C) 를 추정할 수도 있다. 위에서 언급된 바와 같이, 크로마 모션 정보 (예를 들어, f^C) (731) 는 컨볼루션 계층 (730) 을 사용하여 루마 모션 정보 (예를 들어, f^L) (729)에 기초하여 도출될 수 있다. 일부 경우들에서, 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보는 현재 프레임의 적어도 하나의 휘도 성분에 대해 결정된 모션 정보를 샘플링함으로써 적어도 부분적으로 결정된다.At block 1104, the process determines, by the machine learning system, motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame using the at least one luminance component for the current frame. It includes steps to: In some aspects, process 1100 includes determining motion information for at least one luminance component of the current frame based on the at least one luminance component of the current frame and the at least one reconstructed luma component of the previous frame. You may. In some cases, process 1100 may further include determining motion information for one or more chrominance components of the current frame using motion information determined for at least one luminance component of the current frame. In some cases, motion information for one or more chrominance components of the current frame is determined using a convolutional layer of a machine learning system. For example, referring to Figure 7A as an illustrative example, flow engine 718 may and previously reconstructed frames. Current frame using both luma components Luma motion information (eg, SSF f ^L ) and chroma motion information (eg, SSF f ^C ) may be estimated for . As mentioned above, chroma motion information (e.g., f ^C ) 731 can be derived based on luma motion information (e.g., f ^L ) 729 using a convolutional layer 730. there is. In some cases, motion information for one or more chrominance components of the current frame is determined at least in part by sampling motion information determined for at least one luminance component of the current frame.

일부 양태들에서, 프로세스(1100)는, 기계 학습 시스템에 의해, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 사용하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 결정하는 것을 포함한다. 일부 양태들에서, 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들은 공간-스케일 플로우 (SSF) 워핑 파라미터들을 포함한다. 일부 경우들에서, 공간-스케일 플로우(SSF) 워핑 파라미터들은 학습된 스케일-흐름 벡터들을 포함한다. 예시적인 예로서 도 6을 참조하면, 워핑 파라미터들은 모션 또는 변위 벡터의 (x-방향으로의) 수평 성분을 표현하는 , 모션 또는 변위 벡터의 (y-방향으로의) 수직 성분을 표현하는 , 공간 모션/변위 정보 (vx 및 vy) 와 결합되는 재구성된 프레임들의 점진적으로 평활화된 버전을 표현하는 (스케일 필드로 지칭되는) s 를 포함할 수도 있다.In some aspects, process 1100 includes, by a machine learning system, using motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame to and determining a warping parameter for one luminance component and one or more warping parameters for one or more chrominance components of the current frame. In some aspects, the warping parameter for at least one luminance component of the current frame and one or more warping parameters for one or more chrominance components of the current frame include spatial-scale flow (SSF) warping parameters. In some cases, space-scale flow (SSF) warping parameters include learned scale-flow vectors. Referring to Figure 6 as an illustrative example, warping parameters represent the horizontal component (in the x-direction) of a motion or displacement vector. , representing the vertical component (in the y-direction) of a motion or displacement vector. , s (referred to as the scale field) representing a progressively smoothed version of the reconstructed frames combined with spatial motion/displacement information (vx and vy).

프로세스(1100)는 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 사용하여 현재 프레임에 대한 하나 이상의 인터-프레임 예측들(예를 들어, 도 6의 예측자들 , , 및 )을 결정하는 것을 더 포함할 수 있다. 일부 경우들에서, 하나 이상의 인터-프레임 예측들은 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 사용하여 보간 연산을 적용함으로써 적어도 부분적으로 결정된다. 하나의 예시적인 예에서, 보간 연산은 삼선형 보간 연산을 포함한다.Process 1100 may generate one or more inter-frame predictions (e.g., , the predictors in Fig. 6 , , and ) may further include determining. In some cases, one or more inter-frame predictions are made at least in part by applying an interpolation operation using a warping parameter for at least one luminance component of the current frame and one or more warping parameters for one or more chrominance components of the current frame. It is decided. In one illustrative example, the interpolation operation includes a trilinear interpolation operation.

일부 예들에서, 본 명세서에 설명된 프로세스들은 도 11에 도시된 컴퓨팅 디바이스 아키텍처(1200)를 갖는 컴퓨팅 디바이스와 같은 컴퓨팅 디바이스 또는 장치에 의해 수행될 수 있다. 일 예에서, 프로세스 또는 프로세스들은 도 6에 도시된 뉴럴 네트워크 아키텍처 및/또는 도 7a, 도 7b, 도 8a, 도 9 및/또는 도 10에 도시된 뉴럴 네트워크 아키텍처들 중 임의의 하나 이상을 구현하는 컴퓨팅 디바이스 아키텍처(1200)를 갖는 컴퓨팅 디바이스에 의해 수행될 수 있다. 일부 예들에서, 컴퓨팅 디바이스는 모바일 디바이스(예를 들어, 모바일 폰, 태블릿 컴퓨팅 디바이스 등), 웨어러블 디바이스, 확장 현실 디바이스(예를 들어, 가상 현실(VR) 디바이스, 증강 현실(AR) 디바이스, 또는 혼합 현실(MR) 디바이스), 개인용 컴퓨터, 랩톱 컴퓨터, 비디오 서버, 텔레비전, 차량(또는 차량의 컴퓨팅 디바이스), 로봇 디바이스, 및/또는 본 명세서에 설명된 프로세스들을 수행하기 위한 리소스 능력들을 갖는 임의의 다른 컴퓨팅 디바이스를 포함하거나 그 일부일 수 있다.In some examples, the processes described herein may be performed by a computing device or apparatus, such as a computing device with computing device architecture 1200 shown in FIG. 11. In one example, the process or processes implements the neural network architecture shown in Figure 6 and/or any one or more of the neural network architectures shown in Figures 7A, 7B, 8A, 9, and/or 10. It may be performed by a computing device having computing device architecture 1200. In some examples, a computing device is a mobile device (e.g., a mobile phone, tablet computing device, etc.), a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a combination. real-world (MR) devices), personal computers, laptop computers, video servers, televisions, vehicles (or computing devices in vehicles), robotic devices, and/or any other device having resource capabilities to perform the processes described herein. It may include or be part of a computing device.

일부 경우들에서, 컴퓨팅 디바이스 또는 장치는 본 명세서에 설명된 프로세스들의 단계들을 수행하도록 구성되는 하나 이상의 입력 디바이스들, 하나 이상의 출력 디바이스들, 하나 이상의 프로세서들, 하나 이상의 마이크로프로세서들, 하나 이상의 마이크로컴퓨터들, 하나 이상의 송신기들, 수신기들 또는 결합된 송신기-수신기들(예를 들어, 트랜시버들로 지칭됨), 하나 이상의 카메라들, 하나 이상의 센서들, 및/또는 다른 컴포넌트(들)와 같은 다양한 컴포넌트들을 포함할 수 있다. 일부 예들에 있어서, 컴퓨팅 디바이스는 디스플레이, 데이터를 통신 및/또는 수신하도록 구성된 네트워크 인터페이스, 이들의 임의의 조합, 및/또는 다른 컴포넌트(들)를 포함할 수도 있다. 네트워크 인터페이스는 인터넷 프로토콜 (IP) 기반 데이터 또는 다른 타입의 데이터를 통신 및/또는 수신하도록 구성될 수도 있다.In some cases, a computing device or apparatus may include one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers configured to perform steps of the processes described herein. Various components, such as one or more transmitters, receivers or combined transmitter-receivers (e.g., referred to as transceivers), one or more cameras, one or more sensors, and/or other component(s) may include. In some examples, a computing device may include a display, a network interface configured to communicate and/or receive data, any combination thereof, and/or other component(s). A network interface may be configured to communicate and/or receive Internet Protocol (IP) based data or other types of data.

컴퓨팅 디바이스의 컴포넌트들은 회로부에서 구현될 수 있다. 예를 들어, 컴포넌트들은 본 명세서에서 설명된 다양한 동작들을 수행하기 위해, 하나 이상의 프로그래밍가능 전자 회로들 (예컨대, 마이크로프로세서들, 그래픽스 프로세싱 유닛들 (GPU들), 디지털 신호 프로세서들 (DSP들), 중앙 프로세싱 유닛들 (CPU들), 뉴럴 프로세싱 유닛들 (NPU들) 및/또는 다른 적합한 전자 회로들) 을 포함할 수 있는 전자 회로들 또는 다른 전자 하드웨어를 포함할 수 있고/있거나 이들을 사용하여 구현될 수 있고, 및/또는 컴퓨터 소프트웨어, 펌웨어, 또는 이들의 임의의 조합을 포함할 수 있고/있거나 이들을 사용하여 구현될 수 있다.Components of a computing device may be implemented in circuitry. For example, components may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), may include and/or be implemented using electronic circuits or other electronic hardware, which may include central processing units (CPUs), neural processing units (NPUs), and/or other suitable electronic circuits. may include and/or be implemented using computer software, firmware, or any combination thereof.

본원에 설명된 프로세스들은 논리 흐름도로서 예시되고, 그 동작은 하드웨어, 컴퓨터 명령들, 또는 이들의 조합으로 구현될 수 있는 동작들의 시퀀스를 표현한다. 컴퓨터 명령들의 맥락에서, 그 동작들은, 하나 이상의 프로세서들에 의해 실행될 경우, 기재된 동작들을 수행하는 하나 이상의 컴퓨터 판독가능 저장 매체들 상에 저장된 컴퓨터 실행가능 명령들을 나타낸다. 일반적으로, 컴퓨터 실행가능 명령들은 특정한 기능들을 수행하거나 또는 특정한 데이터 타입들을 구현하는 루틴들, 프로그램들, 오브젝트들, 컴포넌트들, 데이터 구조들 등을 포함한다. 동작들이 설명되는 순서는 제한으로서 해석되도록 의도되지 않으며, 임의의 수의 설명된 동작들은 프로세스들을 구현하기 위해 임의의 순서로 및/또는 병렬로 결합될 수 있다.The processes described herein are illustrated as logic flow diagrams, the operations of which represent a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof. In the context of computer instructions, the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the described operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be interpreted as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the processes.

추가적으로, 본 명세서에서 설명된 프로세스들은 실행가능 명령들로 구성된 하나 이상의 컴퓨터 시스템들의 제어 하에서 수행될 수도 있고, 집합적으로 하나 이상의 프로세서들 상에서 실행하는 코드 (예를 들어, 실행가능 명령들, 하나 이상의 컴퓨터 프로그램들, 또는 하나 이상의 애플리케이션들) 로서, 하드웨어에 의해, 또는 이들의 조합으로 구현될 수도 있다. 앞서 언급된 바와 같이, 코드는 컴퓨터 판독가능 또는 머신 판독가능 저장 매체 상에, 예를 들어, 하나 이상의 프로세서들에 의해 실행가능한 복수의 명령들을 포함하는 컴퓨터 프로그램의 형태로 저장될 수도 있다. 컴퓨터 판독가능 또는 머신 판독가능 저장 매체는 비일시적일 수도 있다.Additionally, the processes described herein may be performed under the control of one or more computer systems comprised of executable instructions, collectively code executing on one or more processors (e.g., executable instructions, one or more Computer programs, or one or more applications), may be implemented by hardware, or a combination thereof. As previously mentioned, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. Computer-readable or machine-readable storage media may be non-transitory.

도 12 는 본 명세서에 설명된 다양한 기술을 구현할 수 있는 일 예의 컴퓨팅 디바이스의 예시적인 컴퓨팅 디바이스 아키텍처 (1200) 를 도시한다. 일부 예들에서, 컴퓨팅 디바이스는 모바일 디바이스, 웨어러블 디바이스, 확장 현실 디바이스(예를 들어, 가상 현실(VR) 디바이스, 증강 현실(AR) 디바이스, 또는 혼합 현실(MR) 디바이스), 개인용 컴퓨터, 랩톱 컴퓨터, 비디오 서버, 차량(또는 차량의 컴퓨팅 디바이스), 또는 다른 디바이스를 포함할 수 있다. 예를 들어, 컴퓨팅 디바이스 아키텍처(1200)는 도 6 의 시스템을 구현할 수 있다. 컴퓨팅 디바이스 아키텍처(1200)의 컴포넌트들은 버스와 같은 연결(1205)을 사용하여 서로 전기적으로 통신하는 것으로 도시된다. 예시적인 컴퓨팅 디바이스 아키텍처(1200)는 프로세싱 유닛(CPU 또는 프로세서)(1210), 및 판독 전용 메모리(ROM)(1220) 및 랜덤 액세스 메모리(RAM)(1225)와 같은 컴퓨팅 디바이스 메모리(1215)를 포함하는 다양한 컴퓨팅 디바이스 컴포넌트들을 프로세서(1210)에 결합하는 컴퓨팅 디바이스 연결부(1205)를 포함한다.FIG. 12 illustrates an example computing device architecture 1200 of an example computing device that can implement various techniques described herein. In some examples, a computing device is a mobile device, a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, It may include a video server, a vehicle (or a computing device in the vehicle), or another device. For example, computing device architecture 1200 may implement the system of FIG. 6 . Components of the computing device architecture 1200 are shown to communicate electrically with each other using a bus-like connection 1205. The exemplary computing device architecture 1200 includes a processing unit (CPU or processor) 1210 and computing device memory 1215, such as read-only memory (ROM) 1220 and random access memory (RAM) 1225. and a computing device connection 1205 that couples various computing device components to the processor 1210.

컴퓨팅 디바이스 아키텍처(1200)는 프로세서(1210)와 직접 연결되거나, 그에 근접하거나, 또는 그의 일부로서 통합된 고속 메모리의 캐시를 포함할 수 있다. 컴퓨팅 디바이스 아키텍처(1200)는 프로세서(1210)에 의한 빠른 액세스를 위해 메모리(1215) 및/또는 저장 디바이스(1230)로부터 캐시(1212)로 데이터를 복사할 수 있다. 이러한 방식으로, 캐시는 데이터를 기다리는 동안 프로세서(1210) 지연들을 회피하는 성능 부스트를 제공할 수 있다. 이들 및 다른 모듈들은 다양한 액션들을 수행하도록 프로세서 (1210) 를 제어하거나 제어하도록 구성될 수 있다. 다른 컴퓨팅 디바이스 메모리(1215)가 또한 사용가능할 수도 있다. 메모리(1215)는 상이한 성능 특성을 갖는 다수의 상이한 유형의 메모리를 포함할 수 있다. 프로세서 (1210) 는, 임의의 범용 프로세서 및 프로세서 (1210) 를 제어하도록 구성된, 저장 디바이스(1230) 에 저장된, 서비스 1 (1232), 서비스 2 (1234), 및 서비스 3 (1236)과 같은, 하드웨어 또는 소프트웨어 서비스뿐만 아니라, 소프트웨어 명령들이 프로세서 설계에 통합되는 특수 목적 프로세서를 포함할 수 있다. 프로세서 (1210) 는 다중 코어 또는 프로세서, 버스, 메모리 컨트롤러, 캐시 등을 포함하는 독립형 시스템일 수도 있다. 다중 코어 프로세서는 대칭 또는 비대칭일 수도 있다.Computing device architecture 1200 may include a cache of high-speed memory integrated directly with, proximate to, or as part of processor 1210. Computing device architecture 1200 may copy data from memory 1215 and/or storage device 1230 to cache 1212 for quick access by processor 1210. In this way, the cache can provide a performance boost that avoids processor 1210 delays while waiting for data. These and other modules may control or be configured to control processor 1210 to perform various actions. Other computing device memory 1215 may also be available. Memory 1215 may include multiple different types of memory with different performance characteristics. Processor 1210 may include any general-purpose processor and hardware, such as Service 1 1232, Service 2 1234, and Service 3 1236, stored on storage device 1230, configured to control processor 1210. Alternatively, it may include a special-purpose processor in which software instructions, as well as software services, are integrated into the processor design. Processor 1210 may be multiple cores or a standalone system that includes processors, buses, memory controllers, caches, etc. Multi-core processors may be symmetric or asymmetric.

컴퓨팅 디바이스 아키텍처 (1200) 와의 사용자 상호작용을 가능하게 하기 위해, 입력 디바이스 (1245) 는 스피치를 위한 마이크로폰, 제스처 또는 그래픽 입력을 위한 터치 감지 스크린, 키보드, 마우스, 모션 입력, 스피치 등과 같은 임의의 수의 입력 메커니즘을 나타낼 수 있다. 출력 디바이스 (1235) 는 또한 디스플레이, 프로젝터, 텔레비전, 스피커 디바이스 등과 같이 당업자에게 알려진 다수의 출력 메커니즘 중 하나 이상일 수 있다. 일부 경우에, 다중모드 컴퓨팅 디바이스들은 사용자가 컴퓨팅 디바이스 아키텍처 (1200) 와 통신하기 위해 여러 유형의 입력을 제공하게 할 수 있다. 통신 인터페이스 (1240) 는 일반적으로 사용자 입력 및 컴퓨팅 디바이스 출력을 제어하고 관리할 수 있다. 임의의 특정 하드웨어 배열에 대해 동작하는 것에 대한 제한은 없으며, 따라서 여기서 기본 특징들은 이들이 개발됨에 따라 개선된 하드웨어 또는 펌웨어 배열들을 쉽게 대체할 수도 있다.To enable user interaction with computing device architecture 1200, input devices 1245 may be any number of devices, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, a keyboard, mouse, motion input, speech, etc. It can represent the input mechanism of . Output device 1235 may also be one or more of a number of output mechanisms known to those skilled in the art, such as a display, projector, television, speaker device, etc. In some cases, multimode computing devices may allow a user to provide multiple types of input to communicate with computing device architecture 1200. Communications interface 1240 may generally control and manage user input and computing device output. There are no restrictions on operation with any particular hardware arrangement, so the basic features herein may easily replace improved hardware or firmware arrangements as they are developed.

저장 디바이스 (1230) 는 비휘발성 메모리이고, 하드 디스크 또는 자기 카세트, 플래시 메모리 카드, 고체 상태 메모리 디바이스, 디지털 다기능 디스크, 카트리지, 랜덤 액세스 메모리들 (RAM들) (1225), 판독 전용 메모리 (ROM) (1220) 및 이들의 하이브리드와 같은, 컴퓨터에 의해 액세스가능한 데이터를 저장할 수 있는 다른 유형의 컴퓨터 판독가능 매체일 수 있다. 저장 디바이스(1230)는 프로세서(1210)를 제어하기 위한 서비스들(1232, 1234, 1236)을 포함할 수 있다. 다른 하드웨어 또는 소프트웨어 모듈들이 고려된다. 저장 디바이스(1230)는 컴퓨팅 디바이스 연결(1205)에 연결될 수 있다. 일 양태에서, 특정 기능을 수행하는 하드웨어 모듈은, 그 기능을 수행하기 위해, 프로세서 (1210), 커넥션 (1205), 출력 디바이스 (1235) 등과 같은 필요한 하드웨어 컴포넌트들과 관련하여 컴퓨터 판독가능 매체에 저장된 소프트웨어 컴포넌트를 포함할 수 있다.Storage device 1230 is non-volatile memory, such as a hard disk or magnetic cassette, flash memory card, solid state memory device, digital versatile disk, cartridge, random access memories (RAMs) 1225, read only memory (ROM) 1220 and hybrids thereof may be other types of computer-readable media capable of storing data accessible by a computer. Storage device 1230 may include services 1232, 1234, and 1236 for controlling processor 1210. Other hardware or software modules are considered. Storage device 1230 may be connected to computing device connection 1205. In one aspect, a hardware module performing a particular function is stored on a computer-readable medium associated with necessary hardware components, such as processor 1210, connection 1205, output device 1235, etc., to perform that function. May include software components.

본 개시의 양태들은 하나 이상의 능동 심도 감지 시스템들을 포함하거나 그들에 커플링된 (보안 시스템들, 스마트폰들, 태블릿들, 랩탑 컴퓨터들, 차량들, 드론들, 또는 다른 디바이스들과 같은) 임의의 적합한 전자 디바이스에 적용가능하다. 하나의 광 프로젝터를 갖거나 그에 커플링된 디바이스에 관하여 하기에서 설명되지만, 본 개시의 양태들은 임의의 수의 광 프로젝터들을 갖는 디바이스들에 적용가능하고, 따라서, 특정 디바이스들로 제한되지 않는다.Aspects of the present disclosure include any device (such as security systems, smartphones, tablets, laptop computers, vehicles, drones, or other devices) that includes or is coupled to one or more active depth sensing systems. Applicable to suitable electronic devices. Although described below with respect to a device having or coupled to a single light projector, aspects of the present disclosure are applicable to devices having any number of light projectors and, therefore, are not limited to specific devices.

용어 "디바이스" 는 (일 스마트폰, 일 제어기, 일 프로세싱 시스템 등과 같은) 하나 또는 특정 수의 물리적 오브젝트들로 제한되지 않는다. 본 명세서에서 사용되는 바와 같이, 디바이스는 본 개시의 적어도 일부 부분들을 구현할 수도 있는 하나 이상의 부분들을 갖는 임의의 전자 디바이스일 수도 있다. 하기의 설명 및 예들이 본 개시의 다양한 양태들을 설명하기 위해 용어 "디바이스" 를 사용하지만, 용어 "디바이스" 는 오브젝트들의 특정 구성, 타입, 또는 개수로 제한되지 않는다. 부가적으로, 용어 "시스템" 은 다중의 컴포넌트들 또는 특정 실시형태들로 제한되지 않는다. 예를 들어, 시스템은 하나 이상의 인쇄 회로 보드들 또는 다른 기판들 상에서 구현될 수도 있고, 이동가능 또는 정적 컴포넌트들을 가질 수도 있다. 하기의 설명 및 예들이 본 개시의 다양한 양태들을 설명하기 위해 용어 "시스템" 을 사용하지만, 용어 "시스템" 은 오브젝트들의 특정 구성, 타입, 또는 개수로 제한되지 않는다.The term “device” is not limited to one or a specific number of physical objects (such as a smartphone, a controller, a processing system, etc.). As used herein, a device may be any electronic device having one or more parts that may implement at least some portions of the present disclosure. Although the following description and examples use the term “device” to describe various aspects of the present disclosure, the term “device” is not limited to a particular configuration, type, or number of objects. Additionally, the term “system” is not limited to multiple components or specific embodiments. For example, a system may be implemented on one or more printed circuit boards or other substrates and may have movable or static components. Although the following description and examples use the term “system” to describe various aspects of the present disclosure, the term “system” is not limited to a particular configuration, type, or number of objects.

구체적 상세들은 본원에 제공된 실형태들 및 예들의 철저한 이해를 제공하기 위하여 상기 설명에서 제공되었다. 하지만, 실시형태들은 이들 특정 상세들 없이 실시될 수도 있음이 당업자에 의해 이해될 것이다. 설명의 명료성을 위해, 일부 사례들에서, 본 기술은 디바이스들, 디바이스 컴포넌트들, 소프트웨어로 구체화된 방법의 단계들 또는 루틴들, 또는 하드웨어와 소프트웨어의 조합들을 포함하는 기능적 블록들을 포함하는 개별의 기능적 블록들을 포함하는 것으로서 제시될 수도 있다. 도면들에서 도시되고/거나 본원에 기술된 것들 이외의 추가적인 컴포넌트들이 사용될 수도 있다. 예를 들어, 회로들, 시스템들, 네트워크들, 프로세스들, 및 다른 컴포넌트들은 그 실시형태들을 불필요한 상세로 불명료하게 하지 않기 위해 블록도 형태의 컴포넌트들로서 도시될 수도 있다. 다른 예들에서, 잘 알려진 회로들, 프로세스들, 알고리즘들, 구조들, 및 기술들은, 실시형태들을 불명료하게 하는 것을 회피하기 위해 불필요한 상세 없이 도시될 수도 있다.Specific details are provided in the above description to provide a thorough understanding of the embodiments and examples provided herein. However, it will be understood by one skilled in the art that the embodiments may be practiced without these specific details. For clarity of explanation, in some instances, the present technology refers to individual functional blocks comprising devices, device components, method steps or routines embodied in software, or combinations of hardware and software. It may also be presented as containing blocks. Additional components other than those shown in the drawings and/or described herein may be used. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form so as not to obscure the embodiments with unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the embodiments.

개별 실시형태들은, 플로우차트, 흐름도, 데이터 흐름도, 구조도, 또는 블록도로서 도시되는 프로세스 또는 방법으로서 위에서 설명될 수도 있다. 비록 플로우차트가 동작들을 순차적인 프로세스로서 기술할 수도 있지만, 동작들 중 다수는 병렬로 또는 동시에 수행될 수 있다. 부가적으로, 동작들의 순서는 재배열될 수도 있다. 프로세스는, 그의 동작들이 완료될 때 종료되지만, 도면에 포함되지 않은 추가적인 단계들을 가질 수 있다. 프로세스는 방법, 함수, 절차, 서브루틴, 서브프로그램 등에 대응할 수도 있다. 프로세스가 함수에 대응할 경우, 그의 종료는 그 함수의 호출 함수 또는 메인 함수로의 복귀에 대응할 수 있다.Individual embodiments may be described above as a process or method shown as a flowchart, flow diagram, data flow diagram, structure diagram, or block diagram. Although a flowchart may describe operations as a sequential process, many of the operations may be performed in parallel or concurrently. Additionally, the order of operations may be rearranged. The process ends when its operations are complete, but may have additional steps not included in the figure. A process may correspond to a method, function, procedure, subroutine, subprogram, etc. If a process corresponds to a function, its termination may correspond to a return to that function's calling function or main function.

상술된 예들에 따른 프로세스들 및 방법들은 컴퓨터 판독가능 매체들에 저장되거나 그 외에 컴퓨터 판독가능 매체들로부터 이용가능한 컴퓨터 실행가능 명령들을 이용하여 구현될 수 있다. 이러한 명령들은, 예를 들어, 범용 컴퓨터, 특수 목적 컴ㅍ터, 또는 프로세싱 디바이스가 특정 기능 또는 기능들의 그룹을 수행하게 하거나 그 외에 수행하도록 구성하는 명령들 및 데이터를 포함할 수 있다. 사용되는 컴퓨터 리소스들의 부분들은 네트워크를 통해 액세스가능할 수 있다. 컴퓨터 실행 가능 명령들은 예를 들어 바이너리, 어셈블리 언어, 펌웨어, 소스 코드 등과 같은 중간 형식 명령일 수도 있다.Processes and methods according to the above-described examples may be implemented using computer-executable instructions stored on or otherwise available from computer-readable media. These instructions may include, for example, instructions and data that cause or otherwise configure a general-purpose computer, special-purpose computer, or processing device to perform a particular function or group of functions. Portions of the computer resources used may be accessible via a network. Computer-executable instructions may be intermediate format instructions, such as, for example, binaries, assembly language, firmware, source code, etc.

용어 "컴퓨터 판독가능 매체" 는, 휴대 또는 비휴대 저장 디바이스, 광학 저장 디바이스, 및 명령(들) 및/또는 데이터를 저장, 포함 또는 나를 수 있는 다양한 다른 매체를 포함하지만, 이에 한정되지는 않는다. 컴퓨터 판독 가능 매체는 데이터가 저장될 수 있고 반송파 및/또는 무선 또는 유선 접속을 통해 전파되는 일시적 전자 신호를 포함하지 않는 비일시적 매체를 포함할 수도 있다. 비일시적 매체의 예들은, 특히 자기 디스크 또는 테이프, 플래시 메모리와 같은 광학 저장 매체, 메모리 또는 메모리 디바이스들, 자기 또는 광학 디스크들, 플래시 메모리, 비휘발성 메모리가 제공된 USB 디바이스들, 네트워크화된 저장 디바이스들, 컴팩트 디스크(CD) 또는 디지털 다기능 디스크(DVD), 또는 이들의 임의의 적절한 조합을 포함할 수 있지만, 이들로 제한되지 않는다. 컴퓨터 판독가능 매체는, 절차, 함수, 서브프로그램, 프로그램, 루틴, 서브루틴, 모듈, 소프트웨어 패키지, 클래스, 또는 명령들, 데이터 구조들, 또는 프로그램 스테이트먼트들의 임의의 조합을 나타낼 수도 있는 코드 및/또는 머신 실행가능 명령들이 저장될 수도 있다. 코드 세그먼트는, 정보, 데이터, 인수들 (arguments), 파라미터들, 또는 메모리 컨텐츠를 전달 및/또는 수신함으로써 다른 코드 세그먼트 또는 하드웨어 회로에 커플링될 수도 있다. 정보, 인수들, 파라미터들, 데이터 등은 메모리 공유, 메시지 전달, 토큰 전달, 네트워크 전송 등을 포함한 임의의 적합한 수단을 통해 전달, 포워딩, 또는 전송될 수도 있다.The term “computer-readable media” includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media that can store, contain, or carry instruction(s) and/or data. Computer-readable media may include non-transitory media on which data may be stored and that do not contain transient electronic signals propagated via carrier waves and/or wireless or wired connections. Examples of non-transitory media include, among others, magnetic disks or tapes, optical storage media such as flash memory, memory or memory devices, magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices. , compact disk (CD), or digital versatile disk (DVD), or any suitable combination thereof. A computer-readable medium may contain code and/or code that may represent a procedure, function, subprogram, program, routine, subroutine, module, software package, class, or any combination of instructions, data structures, or program statements. Machine executable instructions may be stored. A code segment may be coupled to another code segment or hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be communicated, forwarded, or transmitted through any suitable means, including memory sharing, message passing, token passing, network transport, etc.

일부 실시형태들에서, 컴퓨터 판독가능 저장 디바이스들, 매체들, 및 메모리들은 비트 스트림 등을 포함하는 무선 신호 또는 케이블을 포함할 수 있다. 하지만, 언급될 때, 비일시적인 컴퓨터 판독가능 저장 매체들은 에너지, 캐리어 신호들, 전자기 파들, 및 신호들 그 자체와 같은 매체들을 명시적으로 배제한다.In some embodiments, computer-readable storage devices, media, and memories may include a wireless signal or cable containing a bit stream, etc. However, when mentioned, non-transitory computer-readable storage media explicitly excludes media such as energy, carrier signals, electromagnetic waves, and signals themselves.

이들 개시에 따른 프로세스들 및 방법들을 구현하는 디바이스들은 하드웨어, 소프트웨어, 펌웨어, 미들웨어, 마이크로코드, 하드웨어 기술 언어, 또는 이들의 임의의 조합을 포함할 수 있고, 다양한 폼 팩터들 중 임의의 것을 취할 수 있다. 소프트웨어, 펌웨어, 미들웨어, 또는 마이크로코드로 구현될 경우, 필요한 태스크들을 수행하기 위한 프로그램 코드 또는 코드 세그먼트들 (예를 들어, 컴퓨터 프로그램 제품) 은 컴퓨터 판독가능 또는 머신 판독가능 매체에 저장될 수도 있다. 프로세서(들)는 필요한 태스크들을 수행할 수도 있다. 폼 팩터들의 통상적인 예들은 랩탑들, 스마트 폰들, 모바일 폰들, 태블릿 디바이스들 또는 다른 소형 폼 팩터 개인용 컴퓨터들, 개인용 디지털 보조기들, 랙마운트 디바이스들, 자립형 디바이스들 등을 포함한다. 본 명세서에서 설명된 기능은 또한, 주변기기들 또는 애드-인 (add-in) 카드들에서 구현될 수 있다. 그러한 기능은 또한, 추가의 예에 의해, 단일 디바이스에서 실행되는 상이한 칩들 또는 상이한 프로세스들 중에서 회로 보드 상에서 구현될 수 있다.Devices implementing the processes and methods according to these disclosures may include hardware, software, firmware, middleware, microcode, a hardware description language, or any combination thereof, and may take any of a variety of form factors. there is. When implemented as software, firmware, middleware, or microcode, program code or code segments (e.g., computer program product) for performing the necessary tasks may be stored in a computer-readable or machine-readable medium. Processor(s) may perform the necessary tasks. Common examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, etc. The functionality described herein can also be implemented in peripherals or add-in cards. Such functionality may also be implemented on a circuit board among different chips or different processes running in a single device, by way of further example.

명령들, 이러한 명령들을 운반하기 위한 매체들, 그것들을 시행하기 위한 컴퓨팅 리소스들, 및 이러한 컴퓨팅 리소스들을 지원하기 위한 다른 구조들은 본 개시물에서 설명될 기능들을 제공하기 위한 예시적인 수단들이다.Instructions, media for carrying these instructions, computing resources for implementing them, and other structures to support such computing resources are example means for providing the functionality described in this disclosure.

전술한 설명에서, 본 출원의 양태들은 그것들의 특정 실시형태들을 참조하여 설명되었지만, 당업자는 본원이 이에 제한되지 않는다는 것을 인식할 것이다. 따라서, 본 출원의 예시적인 실시형태들이 본원에 상세히 설명되었지만, 본 발명의 개념은 달리 다양하게 구체화되고 채택될 수 있으며, 첨부된 청구 범위는 선행 기술에 의해 제한되는 것을 제외하고는 그러한 변형을 포함하는 것으로 해석되도록 의도된다. 전술한 애플리케이션의 다양한 특징들 및 양태들은 개별적으로 또는 공동으로 사용될 수도 있다. 추가로, 실시형태들은 본 명세서의 더 넓은 사상 및 범위로부터 일탈함없이 본 명세서에서 설명된 것들을 넘어서는 임의의 수의 환경들 및 어플리케이션들에서 활용될 수 있다. 본 명세서 및 도면들은, 이에 따라, 제한적이라기 보다는 예시적인 것으로서 간주되어야 한다. 예시의 목적으로, 방법들은 특정 순서로 설명되었다. 대안적인 실시형태들에 있어서, 방법들은 설명된 것과는 상이한 순서로 수행될 수도 있음이 인식되어야 한다.In the foregoing description, aspects of the application have been described with reference to specific embodiments thereof, but those skilled in the art will recognize that the application is not limited thereto. Accordingly, although exemplary embodiments of the present application have been described in detail herein, the inventive concept may be embodied and adapted in various other ways, and the appended claims cover such modifications except as limited by prior art. It is intended to be interpreted as The various features and aspects of the above-described application may be used individually or jointly. Additionally, the embodiments may be utilized in any number of environments and applications beyond those described herein without departing from the broader spirit and scope of the disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. For purposes of illustration, methods have been described in a specific order. It should be appreciated that in alternative embodiments, methods may be performed in a different order than described.

당업자는 본 명세서에서 사용된 미만 ("<") 및 초과 (">") 기호들 또는 용어가 본 개시의 범위로부터 일탈함 없이, 각각 이하 ("≤") 및 이상 ("≥") 기호들로 대체될 수 있다는 것을 알 것이다.Those skilled in the art will understand that the less than ("<") and greater than (">") symbols or terms used herein do not depart from the scope of the present disclosure, and the less than ("≤") and greater than ("≥") symbols, respectively. You will see that it can be replaced with .

컴포넌트들이 특정 동작을 수행 "하도록 구성된" 것으로 기술되는 경우, 그러한 구성은 예를 들어, 전자 회로 또는 다른 하드웨어를 동작을 수행하도록 설계함으로써, 프로그래밍 가능한 전자 회로 (예를 들어, 마이크로 프로세서 또는 다른 적절한 전자 회로 ) 를 동작을 수행하도록 프로그래밍함으로써, 또는 이들의 임의의 조합으로써 달성될 수 있다.When components are described as being "configured to perform" a particular operation, such configuration may mean, for example, designing an electronic circuit or other hardware to perform the operation, such as a programmable electronic circuit (e.g., a microprocessor or other suitable electronic device). circuit) to perform an operation, or any combination thereof.

문구 "~ 에 커플링된 (coupled to)" 은 다른 컴포넌트에 직접적으로 또는 간접적으로 물리적으로 접속된 임의의 컴포넌트, 및/또는, 다른 컴포넌트와 직접적으로 또는 간접적으로 통신하는 (예컨대, 유선 또는 무선 접속, 및/또는 다른 적합한 통신 인터페이스를 통해 다른 컴포넌트에 접속된) 임의의 컴포넌트를 지칭한다.The phrase “coupled to” refers to any component that is physically connected, directly or indirectly, to another component, and/or that communicates directly or indirectly with the other component (e.g., through a wired or wireless connection). , and/or connected to another component via another suitable communication interface.

세트 "중 적어도 하나" 또는 세트 "중 하나 이상" 을 인용하는 청구항 언어 또는 다른 언어는 그 세트의 하나의 멤버 또는 그 세트의 다중의 멤버들 (임의의 조합) 이 청구항을 충족하는 것을 나타낸다. 예를 들어, "A 및 B 중 적어도 하나" 또는 "A 또는 B 중 적어도 하나"를 인용하는 청구항 언어는 A, B, 또는 A 및 B 를 의미한다. 다른 예에서, "A, B, 및 C 중 적어도 하나" 또는 "A, B, 또는 C 중 적어도 하나"를 인용하는 청구항 언어는 A, B, C, 또는 A 및 B, 또는 A 및 C, 또는 B 및 C, 또는 A 및 B 및 C 를 의미한다. 언어 세트 "중 적어도 하나" 및/또는 세트 중 "하나 이상" 은 세트를 그 세트에 열거된 항목들로 제한하지 않는다. 예를 들어, "A 및 B 중 적어도 하나" 또는 "A 또는 B 중 적어도 하나" 를 인용하는 청구항 언어는 A, B, 또는 A 및 B 를 의미할 수 있으며, A 및 B 의 세트에 열거되지 않은 항목들을 추가적으로 포함할 수 있다.Claim language or other language reciting “at least one of” a set or “one or more of” a set indicates that one member of that set or multiple members of that set (in any combination) satisfies the claim. For example, claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B. In another example, claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” may be defined as A, B, C, or A and B, or A and C, or B and C, or A and B and C. A language set “at least one of” and/or “one or more of” a set does not limit the set to the items listed in the set. For example, claim language reciting “at least one of A and B” or “at least one of A or B” may mean A, B, or A and B, and not listed in the set of A and B. Additional items may be included.

본 명세서에 개시된 실시형태들과 관련하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 펌웨어, 또는 이들의 조합들로서 구현될 수도 있다. 하드웨어와 소프트웨어의 이러한 상호대체 가능성을 분명히 예시하기 위해, 다양한 예시적인 컴포넌트들, 블록들, 모듈들, 회로들 및 단계들이 일반적으로 그들의 기능의 관점에서 상기 설명되었다. 이러한 기능성이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는지는 전체 시스템에 부과된 설계 제약들 및 특정한 애플리케이션에 의존한다. 당업자는 설명된 기능을 각각의 특정 애플리케이션에 대해 다양한 방식들로 구현할 수도 있지만, 그러한 구현 결정들이 본 출원의 범위로부터의 일탈을 야기하는 것으로서 해석되지는 않아야 한다.The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits and steps have been described above generally in terms of their functionality. Whether this functionality is implemented as hardware or software will depend on the specific application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be construed as causing a departure from the scope of the present application.

본원에 기술된 기법들은 또한 전자 하드웨어, 컴퓨터 소프트웨어, 펌웨어 또는 이들의 임의의 조합으로 구현될 수도 있다. 그러한 기술들은 범용 컴퓨터들, 무선 통신 디바이스 핸드셋들, 또는 무선 통신 디바이스 핸드셋들 및 다른 디바이스들에서의 애플리케이션을 포함하여 다중의 이용들을 갖는 집적 회로 디바이스들과 같은 임의의 다양한 디바이스들에서 구현될 수도 있다. 모듈들 또는 컴포넌트들로서 설명된 임의의 특징들은 집적된 로직 디바이스에서 함께 또는 별개지만 상호운용가능한 로직 디바이스들로서 별도로 구현될 수도 있다. 소프트웨어로 구현되면, 기법들은, 실행될 때, 위에서 설명된 방법들 중 하나 이상을 수행하는 명령들을 포함하는 프로그램 코드를 포함하는 컴퓨터 판독가능 데이터 저장 매체에 의해 적어도 부분적으로 실현될 수도 있다. 컴퓨터 판독가능 데이터 저장 매체는 패키징 재료들을 포함할 수도 있는 컴퓨터 프로그램 제품의 일부를 형성할 수도 있다. 컴퓨터 판독가능 매체는 메모리 또는 데이터 저장 매체, 이를테면 RAM (random access memory) 이를테면, SDRAM (synchronous dynamic random access memory), ROM (read-only memory), NVRAM (non-volatile random access memory), EEPROM (electrically erasable programmable read-only memory), FLASH 메모리, 자기 또는 광학 데이터 저장 매체 등을 포함할 수도 있다. 그 기법들은, 추가적으로 또는 대안적으로, 전파된 신호들 또는 파들과 같이, 명령들 또는 데이터 구조들의 형태로 프로그램 코드를 운반 또는 통신하고 그리고 컴퓨터에 의해 액세스, 판독, 및/또는 실행될 수 있는 컴퓨터 판독가능 통신 매체에 의해 적어도 부분적으로 실현될 수도 있다.The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices, such as general purpose computers, wireless communication device handsets, or integrated circuit devices with multiple uses, including applications in wireless communication device handsets and other devices. . Any features described as modules or components may be implemented together in an integrated logic device or separately as separate but interoperable logic devices. If implemented in software, the techniques may be realized, at least in part, by a computer-readable data storage medium comprising program code including instructions that, when executed, perform one or more of the methods described above. A computer-readable data storage medium may form part of a computer program product, which may include packaging materials. Computer-readable media includes memory or data storage media, such as random access memory (RAM), synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), and electrically generated EEPROM (EEPROM). It may include erasable programmable read-only memory, FLASH memory, magnetic or optical data storage media, etc. The techniques may, additionally or alternatively, be computer readable, carrying or communicating program code in the form of instructions or data structures, such as propagated signals or waves, and capable of being accessed, read, and/or executed by a computer. It may also be realized at least in part by an enabling communication medium.

프로그램 코드는, 하나 이상의 디지털 신호 프로세서들 (DSP들), 범용 마이크로프로세서들, 주문형 집적 회로들 (ASIC들), 필드 프로그래밍가능 로직 어레이들 (FPGA들), 또는 다른 등가의 집적된 또는 별개의 로직 회로부와 같은 하나 이상의 프로세서들을 포함할 수도 있는 프로세서에 의해 실행될 수도 있다. 그러한 프로세서는 본 개시에서 설명된 기법들 중 임의의 기법을 수행하도록 구성될 수도 있다. 범용 프로세서가 마이크로프로세서일 수도 있지만, 대체예에서, 그 프로세서는 기존의 임의의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 연계한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성으로서 구현될 수도 있다. 따라서, 본 명세서에서 사용된 바와 같은 용어 "프로세서" 는 전술한 구조, 전술한 구조의 임의의 조합, 또는 본 명세서에서 설명된 기법들의 구현에 적합한 임의의 다른 구조 또는 장치 중 임의의 것을 지칭할 수도 있다.Program code may be implemented in one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic. It may be executed by a processor, which may include one or more processors such as circuitry. Such a processor may be configured to perform any of the techniques described in this disclosure. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, such as a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term “processor,” as used herein, may refer to any of the foregoing structure, any combination of the foregoing structures, or any other structure or device suitable for implementing the techniques described herein. there is.

본 개시의 예시적인 예들은 다음을 포함한다:Illustrative examples of the present disclosure include:

양태 1:Aspect 1:

비디오 데이터를 프로세싱하는 방법으로서, 기계 학습 시스템에 의해, 입력 비디오 데이터를 획득하는 단계 - 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분을 포함함 -; 및 기계 학습 시스템에 의해, 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 단계를 포함하는, 방법.A method of processing video data, comprising: obtaining, by a machine learning system, input video data, wherein the input video data includes at least one luminance component for a current frame; and determining, by the machine learning system, using the at least one luminance component for the current frame, motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. How to.

양태 2:Aspect 2:

양태 1 에 있어서, 기계 학습 시스템에 의해, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 사용하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 결정하는 단계; 및 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 사용하여 현재 프레임에 대한 하나 이상의 인터-프레임 예측들을 결정하는 단계를 더 포함하는, 방법.The method of aspect 1, wherein the machine learning system uses motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame to determine determining a warping parameter and one or more warping parameters for one or more chrominance components of the current frame; and determining one or more inter-frame predictions for the current frame using the warping parameters for at least one luminance component of the current frame and one or more warping parameters for one or more chrominance components of the current frame. method.

양태 3:Aspect 3:

양태 2 에 있어서, 상기 하나 이상의 인터-프레임 예측들은 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대한 상기 워핑 파라미터 및 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 하나 이상의 워핑 파라미터들을 사용하여 보간 연산 (interpolation operation) 을 적용함으로써 적어도 부분적으로 결정되는, 방법.The method of aspect 2, wherein the one or more inter-frame predictions are interpolated using the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame. A method determined at least in part by applying an interpolation operation.

양태 4:Aspect 4:

양태 3 에 있어서, 상기 보간 연산은 삼선형 보간 (trilinear interpolation) 연산을 포함하는, 방법.The method of aspect 3, wherein the interpolation operation comprises a trilinear interpolation operation.

양태 5:Aspect 5:

양태 2 내지 4 중 어느 것에 있어서, 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대한 상기 워핑 파라미터 및 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 하나 이상의 워핑 파라미터들은 공간-스케일 플로우 (SSF) 워핑 파라미터들을 포함하는, 방법.The method of any of aspects 2 to 4, wherein the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame are space-scale flow (SSF). Method, including warping parameters.

양태 6:Aspect 6:

양태 5 에 있어서, 상기 SSF 워핑 파라미터들은 학습된 스케일-플로우 벡터들을 포함하는, 방법.The method of aspect 5, wherein the SSF warping parameters comprise learned scale-flow vectors.

양태 7:Aspect 7:

양태 1 내지 6 중 어느 것에 있어서, 현재 프레임에 대한 적어도 하나의 휘도 성분을 이용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 단계는, 현재 프레임의 적어도 하나의 휘도 성분 및 이전 프레임의 적어도 하나의 재구성된 루마 성분에 기초하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보를 결정하는 것; 및 현재 프레임의 적어도 하나의 휘도 성분에 대해 결정된 모션 정보를 이용하여 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하는 것을 포함하는, 방법.The method of any of aspects 1 to 6, further comprising using the at least one luminance component for the current frame to determine motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. Determining motion information for at least one luminance component of the current frame based on the at least one luminance component of the current frame and the at least one reconstructed luma component of the previous frame; and determining motion information for one or more chrominance components of the current frame using motion information determined for at least one luminance component of the current frame.

양태 8:Aspect 8:

양태 7 에 있어서, 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 모션 정보는 상기 기계 학습 시스템의 컨볼루션 계층 (convolutional layer) 을 사용하여 결정되는, 방법.The method of aspect 7, wherein the motion information for the one or more chrominance components of the current frame is determined using a convolutional layer of the machine learning system.

양태 9:Aspect 9:

양태 7 에 있어서, 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 모션 정보는 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대해 결정된 상기 모션 정보를 샘플링함으로써 적어도 부분적으로 결정되는, 방법.The method of aspect 7, wherein the motion information for the one or more chrominance components of the current frame is determined at least in part by sampling the motion information determined for the at least one luminance component of the current frame.

양태 10:Aspect 10:

양태 1 내지 9 중 어느 것에 있어서, 현재 프레임은 비디오 프레임을 포함하는, 방법.The method of any of aspects 1-9, wherein the current frame comprises a video frame.

양태 11:Aspect 11:

양태 1 내지 10 중 어느 것에 있어서, 하나 이상의 색차 성분들은 적어도 하나의 색차-청색 성분 및 색차-적색 성분을 포함하는, 방법.The method of any of Aspects 1 to 10, wherein the one or more color difference components comprise at least one color difference-blue component and at least one color difference-red component.

양태 12:Aspect 12:

양태 1 내지 11 중 어느 것에 있어서, 현재 프레임은 휘도-색차 (YUV) 포맷을 갖는, 방법.The method of any of aspects 1 to 11, wherein the current frame has a luminance-chrominance (YUV) format.

양태 13:Aspect 13:

양태 12 에 있어서, YUV 포맷은 YUV 4:2:0 포맷인, 방법.The method of aspect 12, wherein the YUV format is YUV 4:2:0 format.

양태 14:Aspect 14:

비디오 데이터를 프로세싱하기 위한 장치로서, 적어도 하나의 메모리; 및 상기 적어도 하나의 메모리에 커플링된 하나 이상의 프로세서들을 포함하고, 상기 하나 이상의 프로세서들은, 기계 학습 시스템을 사용하여, 입력 비디오 데이터를 획득하는 것으로서, 상기 입력 비디오 데이터는 현재 프레임에 대한 적어도 하나의 휘도 성분을 포함하는, 상기 입력 비디오 데이터를 획득하는 것을 수행하고; 그리고 상기 기계 학습 시스템을 사용하여, 상기 현재 프레임에 대한 적어도 하나의 휘도 성분을 사용하여 상기 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 상기 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하도록 구성되는, 장치.1. An apparatus for processing video data, comprising: at least one memory; and one or more processors coupled to the at least one memory, wherein the one or more processors acquire input video data using a machine learning system, wherein the input video data is configured to obtain at least one memory for the current frame. perform obtaining the input video data, including a luminance component; and using the machine learning system to generate motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame using the at least one luminance component for the current frame. A device configured to make a decision.

양태 15:Aspect 15:

양태 14 에 있어서, 상기 하나 이상의 프로세서들은, 기계 학습 시스템을 사용하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 사용하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 결정하고; 그리고 현재 프레임의 적어도 하나의 휘도 성분에 대한 워핑 파라미터 및 현재 프레임의 하나 이상의 색차 성분들에 대한 하나 이상의 워핑 파라미터들을 사용하여 현재 프레임에 대한 하나 이상의 인터-프레임 예측들을 결정하도록 구성되는, 장치.The method of aspect 14, wherein the one or more processors, using a machine learning system, use motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame to: determine a warping parameter for at least one luminance component and one or more warping parameters for one or more chrominance components of the current frame; and determine one or more inter-frame predictions for the current frame using the warping parameters for at least one luminance component of the current frame and the one or more warping parameters for one or more chrominance components of the current frame.

양태 16:Aspect 16:

양태 15 에 있어서, 상기 하나 이상의 인터-프레임 예측들은 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대한 상기 워핑 파라미터 및 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 하나 이상의 워핑 파라미터들을 사용하여 보간 연산을 적용함으로써 적어도 부분적으로 결정되는, 장치.The method of aspect 15, wherein the one or more inter-frame predictions are interpolated using the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame. A device determined at least in part by applying an operation.

양태 17:Aspect 17:

양태 16 에 있어서, 상기 보간 연산은 삼선형 보간 연산을 포함하는, 장치.The apparatus of aspect 16, wherein the interpolation operation comprises a trilinear interpolation operation.

양태 18:Aspect 18:

양태 15 내지 17 중 어느 것에 있어서, 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대한 상기 워핑 파라미터 및 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 하나 이상의 워핑 파라미터들은 공간-스케일 플로우 (SSF) 워핑 파라미터들을 포함하는, 장치.The method of any of aspects 15 to 17, wherein the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame are space-scale flow (SSF). A device containing warping parameters.

양태 19:Aspect 19:

양태 18 에 있어서, SSF 워핑 파라미터들은 학습된 스케일-플로우 벡터들을 포함하는, 장치.The apparatus of aspect 18, wherein the SSF warping parameters include learned scale-flow vectors.

양태 20:Aspect 20:

양태 14 내지 19 중 어느 것에 있어서, 현재 프레임에 대한 적어도 하나의 휘도 성분을 이용하여 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보 및 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하기 위해, 상기 하나 이상의 프로세서들은, 현재 프레임의 적어도 하나의 휘도 성분 및 이전 프레임의 적어도 하나의 재구성된 루마 성분에 기초하여, 현재 프레임의 적어도 하나의 휘도 성분에 대한 모션 정보를 결정하고; 그리고 현재 프레임의 적어도 하나의 휘도 성분에 대해 결정된 모션 정보를 이용하여 현재 프레임의 하나 이상의 색차 성분들에 대한 모션 정보를 결정하도록 구성되는, 장치.The method of any of aspects 14 to 19, wherein using the at least one luminance component for the current frame to determine motion information for at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. , the one or more processors determine motion information for at least one luminance component of the current frame based on the at least one luminance component of the current frame and the at least one reconstructed luma component of the previous frame; and determine motion information for one or more chrominance components of the current frame using motion information determined for at least one luminance component of the current frame.

양태 21:Aspect 21:

양태 20 에 있어서, 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 모션 정보는 상기 기계 학습 시스템의 컨볼루션 계층을 사용하여 결정되는, 장치.The apparatus of aspect 20, wherein the motion information for the one or more chrominance components of the current frame is determined using a convolutional layer of the machine learning system.

양태 22:Aspect 22:

양태 20 에 있어서, 상기 현재 프레임의 상기 하나 이상의 색차 성분들에 대한 상기 모션 정보를 결정하기 위해, 상기 하나 이상의 프로세서들은 상기 현재 프레임의 상기 적어도 하나의 휘도 성분에 대해 결정된 상기 모션 정보를 샘플링하도록 구성되는, 장치.The method of aspect 20, wherein to determine the motion information for the one or more chrominance components of the current frame, the one or more processors are configured to sample the motion information determined for the at least one luminance component of the current frame. used device.

양태 23:Aspect 23:

양태 14 내지 22 중 어느 것에 있어서, 현재 프레임은 비디오 프레임을 포함하는, 장치.The apparatus of any of aspects 14-22, wherein the current frame comprises a video frame.

양태 24:Aspect 24:

양태 14 내지 23 중 어느 것에 있어서, 하나 이상의 색차 성분들은 적어도 하나의 색차-청색 성분 및 색차-적색 성분을 포함하는, 장치.The device of any of Aspects 14-23, wherein the one or more chrominance components comprise at least one chrominance-blue component and at least one chrominance-red component.

양태 25:Aspect 25:

양태 14 내지 24 에 있어서, 현재 프레임은 휘도-색차 (YUV) 포맷을 갖는, 장치.The apparatus of aspects 14-24, wherein the current frame has a luminance-chrominance (YUV) format.

양태 26:Aspect 26:

양태 25 에 있어서, YUV 포맷은 YUV 4:2:0 포맷인, 장치.The apparatus of aspect 25, wherein the YUV format is a YUV 4:2:0 format.

양태 27:Aspect 27:

양태 14 내지 26 중 어느 것에 있어서, 하나 이상의 프레임을 캡처하도록 구성된 적어도 하나의 카메라를 더 포함하는, 장치.The apparatus of any of aspects 14-26, further comprising at least one camera configured to capture one or more frames.

양태 28:Aspect 28:

양태 14 내지 27 중 어느 것에 있어서, 하나 이상의 프레임을 디스플레이하도록 구성된 적어도 하나의 디스플레이를 더 포함하는, 장치.The apparatus of any of aspects 14-27, further comprising at least one display configured to display one or more frames.

양태 29:Aspect 29:

양태 14 내지 28 중 어느 것에 있어서, 장치는 모바일 디바이스를 포함하는, 장치.The apparatus of any of aspects 14-28, wherein the apparatus comprises a mobile device.

양태 30: 실행될 때, 하나 이상의 프로세서들로 하여금 양태 1 내지 29 의 동작들 중 임의의 것을 수행하게 하는 명령들을 저장하는 컴퓨터 판독가능 저장 매체.Aspect 30: A computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any of the operations of aspects 1-29.

양태 31: 양태 1 내지 29 의 동작들 중 임의의 것을 수행하기 위한 수단을 포함하는 장치. Aspect 31: An apparatus comprising means for performing any of the operations of aspects 1-29.

Claims

A method of processing video data, comprising:
Obtaining, by a machine learning system, input video data including at least one luminance component for a current frame; and
By the machine learning system, using the at least one luminance component for the current frame, motion information for the at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame A method comprising determining.

The method of claim 1, further
The at least one luminance component of the current frame, by the machine learning system, using motion information for the at least one luminance component of the current frame and motion information for the one or more chrominance components of the current frame. determining a warping parameter for and one or more warping parameters for the one or more chrominance components of the current frame; and
Using the warping parameters for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame, determine one or more inter-frame predictions for the current frame. A method comprising the steps of:

3. The method of claim 2, wherein the one or more inter-frame predictions are performed using the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame. A method determined at least in part by applying an interpolation operation.

4. The method of claim 3, wherein the interpolation operation comprises a trilinear interpolation operation.

3. The method of claim 2, wherein the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame are spatial-scale flow (SSF) warping parameters. Including, method.

6. The method of claim 5, wherein the SSF warping parameters include learned scale-flow vectors.

2. The method of claim 1, wherein the motion information for the at least one luminance component of the current frame using the at least one luminance component for the current frame and the motion for the one or more chrominance components of the current frame The steps to determine information are:
determining the motion information for the at least one luminance component of the current frame based on the at least one luminance component of the current frame and the at least one reconstructed luma component of a previous frame; and
Determining the motion information for the one or more chrominance components of the current frame using the motion information determined for the at least one luminance component of the current frame.

The method of claim 7, wherein the motion information for the one or more chrominance components of the current frame is determined using a convolutional layer of the machine learning system.

8. The method of claim 7, wherein the motion information for the one or more chrominance components of the current frame is determined at least in part by sampling the motion information determined for the at least one luminance component of the current frame.

The method of claim 1, wherein the current frame comprises a video frame.

The method of claim 1, wherein the one or more chrominance components include at least one chrominance-blue component and at least one chrominance-red component.

The method of claim 1, wherein the current frame has a luminance-chrominance (YUV) format.

13. The method of claim 12, wherein the YUV format is YUV 4:2:0 format.

A device for processing video data, comprising:
at least one memory; and
Comprising one or more processors coupled to the at least one memory, the one or more processors
Using a machine learning system, obtain input video data containing at least one luminance component for the current frame, and
Using the machine learning system, the at least one luminance component for the current frame is used to generate motion information for the at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. A device configured to determine information.

15. The method of claim 14, wherein the one or more processors:
Using the machine learning system, the at least one luminance of the current frame is determined based on motion information for the at least one luminance component of the current frame and motion information for the one or more chrominance components of the current frame. Determine a warping parameter for a component and one or more warping parameters for the one or more chrominance components of the current frame, and
Using the warping parameters for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame, determine one or more inter-frame predictions for the current frame. A device configured to:

16. The method of claim 15, wherein the one or more inter-frame predictions are performed using the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame. A device determined at least in part by applying an interpolation operation.

17. The apparatus of claim 16, wherein the interpolation operation comprises a trilinear interpolation operation.

16. The method of claim 15, wherein the warping parameter for the at least one luminance component of the current frame and the one or more warping parameters for the one or more chrominance components of the current frame are spatial-scale flow (SSF) warping parameters. Including device.

19. The apparatus of claim 18, wherein the SSF warping parameters include learned scale-flow vectors.

15. The method of claim 14, wherein the motion information for the at least one luminance component of the current frame using the at least one luminance component for the current frame and the motion for the one or more chrominance components of the current frame To determine information, the one or more processors:
determine the motion information for the at least one luminance component of the current frame based on the at least one luminance component of the current frame and the at least one reconstructed luma component of a previous frame, and
and determine the motion information for the one or more chrominance components of the current frame using the motion information determined for the at least one luminance component of the current frame.

21. The apparatus of claim 20, wherein the motion information for the one or more chrominance components of the current frame is determined using a convolutional layer of the machine learning system.

21. The method of claim 20, wherein to determine the motion information for the one or more chrominance components of the current frame, the one or more processors are configured to sample the motion information determined for the at least one luminance component of the current frame. configured device.

15. The apparatus of claim 14, wherein the current frame comprises a video frame.

15. The device of claim 14, wherein the one or more chrominance components include at least one chrominance-blue component and at least one chrominance-red component.

15. The apparatus of claim 14, wherein the current frame has a luminance-chrominance (YUV) format.

26. The device of claim 25, wherein the YUV format is YUV 4:2:0 format.

15. The apparatus of claim 14, further comprising at least one camera configured to capture one or more frames.

15. The apparatus of claim 14, further comprising at least one display configured to display one or more frames.

15. The device of claim 14, wherein the device comprises a mobile device.

A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to:
Using the machine learning system, obtain input video data containing at least one luminance component for the current frame, and
Using the machine learning system, the at least one luminance component for the current frame is used to generate motion information for the at least one luminance component of the current frame and motion information for one or more chrominance components of the current frame. A non-transitory computer-readable storage medium that stores information.