Skip to content

rayjoy/A-Warm-Welcome-to-ASN.1-and-DER---Chinese

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

A Warm Welcome to ASN.1 and DER


翻译说明

本文是对文章 A Warm Welcome to ASN.1 and DER 的翻译。


This document provides a gentle introduction to the data structures and formats that define the certificates used in HTTPS. It should be accessible to anyone with a little bit of computer science experience and a bit of familiarity with certificates.

本文对定义 HTTPS 中使用的证书的数据结构和格式进行了简要介绍。只要有一些计算机科学经验并且对证书稍稍熟悉,任何人都可以使用它。

An HTTPS certificate is a type of file, like any other file. Its contents follow a format defined by RFC 5280. The definitions are expressed in ASN.1, which is a language used to define file formats or (equivalently) data structures. For instance, in C you might write:

与其他文件一样,HTTPS 证书是文件的一种类型。其内容遵循 RFC 5280 定义的格式。这些定义用 ASN.1 表示,ASN.1 是一种用于定义文件格式或(等效)数据结构的语言。例如,在 C 语言中,您可以写为:

struct point {
  int x, y;
  char label[10];
};

In Go you would write:

在 Go 语言中,你会写做:

type point struct {
  x, y int
  label string
}

And in ASN.1 you would write:

而在 ASN.1 中,你可以写为:

Point ::= SEQUENCE {
  x INTEGER,
  y INTEGER,
  label UTF8String
}

The advantage of writing ASN.1 definitions instead of Go or C definitions is that they are language-independent. You can implement the ASN.1 definition of Point in any language, or (preferably) you can use a tool that takes the ASN.1 definition and automatically generates code implementing it in your favorite language. A set of ASN.1 definitions is called a “module”.

编写 ASN.1 定义而不是 Go 或 C 定义的好处是,它们是不依赖于语言的。您可以用任何语言实现 Point 的 ASN.1 定义,或者(最好)使用一个接受 ASN.1 定义,并自动生成用您喜欢的语言实现它的代码的工具。一组 ASN.1 定义被称为“模块”。

The other important thing about ASN.1 is that it comes with a variety of serialization formats-- ways to turn an in-memory data structure into a series of bytes (or a file) and back again. This allows a certificate generated by one machine to be read by a different machine, even if that machine is using a different CPU and operating system.

关于 ASN.1 的另一个重要的方面是它提供了多种序列化格式——将内存中的数据结构转换为一系列字节(或文件)并再次返回的方法。这样由一台计算机生成的证书可以被另一台计算机读取,即使该计算机使用不同的 CPU 和操作系统。

There are some other languages that do the same things as ASN.1. For instance, Protocol Buffers offer both a language for defining types and a serialization format for encoding objects of the types you’ve defined. Thrift also has both a language and a serialization format. Either Protocol Buffers or Thrift could have just as easily been used to define the format for HTTPS certificates, but ASN.1 (1984) had the significant advantage of already existing when certificates (1988) and HTTPS (1994) were invented.

还有一些其他语言与 ASN.1 有相同的功能。例如,Protocol Buffers 既提供了定义类型的语言,也提供了对已定义类型的对象进行编码的序列化格式。Thrift 也具有语言和序列化格式。Protocol Buffers 或 Thrift 都可以很容易地用于定义 HTTPS 证书的格式,但是 ASN.1(1984)具有在证书(1988)和 HTTPS(1994)发明时就已经存在的显著优势。

ASN.1 has been revised multiple times through the years, with editions usually identified by the year they were published. This document aims to teach enough ASN.1 to clearly understand RFC 5280 and other standards related to HTTPS certificates, so we’ll mainly talk about the 1988 edition, with a few notes on features that were added in later editions. You can download the various editions directly from ITU, with the caveat that some are only available to ITU members. The relevant standards are X.680 (defining the ASN.1 language) and X.690 (defining the serialization formats DER and BER). Earlier versions of those standards were X.208 and X.209, respectively.

多年来,ASN.1 已经被多次修订,版本通常以出版年份来确定。本文档旨在教会足够的 ASN.1 知识,以便能够清楚地理解 RFC 5280 和其他与 HTTPS 证书相关的标准,因此我们将主要讨论1988版,并对以后版本中添加的功能做一些注释。您可以直接从ITU下载各种版本,但需要注意的是,有些版本只对ITU成员可用。相关的标准是 X.680(定义 ASN.1 语言)和 X.690(定义序列化格式 DER 和 BER)。这些标准的早期版本分别是 X.208 和 X.209。

ASN.1’s main serialization format is “Distinguished Encoding Rules” (DER). They are a variant of “Basic Encoding Rules” (BER) with canonicalization added. For instance, if a type includes a SET OF, the members must be sorted for DER serialization.

ASN.1 的主要序列化格式是“可分辨编码规则” - “Distinguished Encoding Rules”(DER)。它们是添加了规范化的“基本编码规则” - “Basic Encoding Rules”(BER)的变体。例如,如果一个 type 包含一个 SET OF,则必须对成员排序以进行 DER 序列化。

A certificate represented in DER is often further encoded into PEM, which uses base64 to encode arbitrary bytes as alphanumeric characters (and ‘+’ and ‘/') and adds separator lines ("-----BEGIN CERTIFICATE-----” and “-----END CERTIFICATE-----"). PEM is useful because it’s easier to copy-paste.

DER 中表示的证书通常进一步编码到 PEM 中,PEM 使用 base64 将任意字节编码为字母数字字符(以及“+”和“/”),并添加分隔线(“BEGIN certificate-----”和“----END certificate-----”)。PEM很有用,因为它更容易复制粘贴。

This document will first describe the types and notation used by ASN.1, and will then describe how objects defined using ASN.1 are encoded. Feel free to flip back and forth between the sections, particularly since some features of the ASN.1 language directly specify encoding details. This document prefers more familiar terms, and so uses “byte” in place of “octet,” and “value” in place of “contents.” It uses “serialization” and “encoding” interchangeably.

本文档将首先描述 ASN.1 使用的类型和符号,然后描述如何对使用 ASN.1 定义的对象进行编码。请随意在各个部分之间来回切换,特别是因为 ASN.1 语言的某些特性直接指定了编码细节。本文档更喜欢使用更熟悉的术语,因此使用“byte”代替“octet”,用“value”代替“contents”,它交替使用“serialization”和“encoding”。

The Types

INTEGER

Good old familiar INTEGER. These can be positive or negative. What’s really unusual about ASN.1 INTEGERs is that they can be arbitrarily big. Not enough room in an int64? No problem. This is particularly handy for representing things like an RSA modulus, which is much bigger than an int64 (like 22048 big). Technically there is a maximum integer in DER but it’s extraordinarily large: The length of any DER field can be expressed as a series of up to 126 bytes. So the biggest INTEGER you can represent in DER is 256(2**1008)-1. For a truly unbounded INTEGER you’d have to encode in BER, which allows indefinitely-long fields.

好的旧的常用的整数。它们可以是正值或者负值。ASN.1 整数的真正不寻常之处在于它们可以任意大。int64 没有足够的空间?没问题。这对于表示RSA模之类的东西特别方便,RSA模比int64大得多(比如22048 那么大)。从技术上讲,DER 中有一个最大整数,但它非常大:任何 DER 字段的长度都可以表示为一系列最多126个字节。因此,在 DER 中可以表示的最大整数是256(2**1008)-1。对于真正的无界整数,您必须用 BER 编码,它允许无限长的字段。

Strings

ASN.1 has a lot of string types: BMPString, GeneralString, GraphicString, IA5String, ISO646String, NumericString, PrintableString, TeletexString, T61String, UniversalString, UTF8String, VideotexString, and VisibleString. For the purposes of HTTPS certificates you mostly have to care about PrintableString, UTF8String, and IA5String. The string type for a given field is defined by the ASN.1 module that defines the field. For instance:

ASN.1有很多字符串类型:BMPString、GeneralString、GraphicString、IA5String、ISO646String、NumericString、PrintableString、teletextString、t61String、UniversalString、UTF8String、VideotexString 和 VisibleString。对于 HTTPS 证书,您主要需关心的是 PrintableString、UTF8String 和 IA5String。给定字段的字符串类型由定义字段的ASN.1模块定义。例如:

CPSuri ::= IA5String

PrintableString is a restricted subset of ASCII, allowing alphanumerics, spaces, and a specific handful of punctuation: ' () + , - . / : = ?. Notably it doesn’t include * or @. There are no storage-size benefits to more restrictive string types.

PrintableString 是ASCII的一个受限子集,允许字母数字、空格和一些特定的标点符号:'()+,-。/ : = ?. 值得注意的是,它不包括 * 或 @。限制性更强的字符串类型没有存储大小的好处。

Some fields, like DirectoryString in RFC 5280, allow the serialization code to choose among multiple string types. Since DER encoding includes the type of string you’re using, make sure that when you encode something as PrintableString it really meets the PrintableString requirements.

有些字段(如 RFC 5280 中的 DirectoryString),允许序列化代码在多个字符串类型中进行选择。由于DER编码包含您正在使用的字符串类型,请确保当您将某些内容编码为PrintableString时,它确实可以满足 PrintableString 的要求。

IA5String, based on International Alphabet No. 5, is more permissive: It allows nearly any ASCII character, and is used for email address, DNS names, and URLs in certificates. Note that there are a few byte values where the IA5 meaning of the byte value is different than the US-ASCII meaning of that same value.

IA5String 基于第5号国际字母表,具有更大的权限:它允许几乎任何 ASCII 字符,并用于证书中的电子邮件地址、DNS名称和 url 。注意,有些字节值的 IA5 含义与同一值的 US-ASCII 含义不同。

TeletexString, BMPString, and UniversalString are deprecated for use in HTTPS certificates, but you may see them when parsing older CA certificates, which are long-lived and may predate the deprecation.

Teletextstring、BMPString 和 UniversalString 不推荐在HTTPS证书中使用,但是在解析较旧的CA证书时可能会看到它们,这些证书是长期存在的,并且可能早于不推荐使用的时间。

Strings in ASN.1 are not null-terminated like strings in C and C++. In fact, it’s perfectly legal to have embedded null bytes. This can cause vulnerabilities when two systems interpret the same ASN.1 string differently. For instance, some CAs used to be able to be tricked into issuing for “example.com\0.evil.com” on the basis of ownership of evil.com. Certificate validation libraries at the time treated the result as valid for “example.com”. Be very careful handling ASN.1 strings in C and C++ to avoid creating vulnerabilities.

ASN.1 中的字符串不是以null 结尾的,这和C和C++中的字符串不一样。事实上,嵌入空字节是完全合法的。当两个系统以不同的方式解释同一个ASN.1字符串时,这可能会导致漏洞。例如,有些CA曾经可以被诱骗到基于 evil.com 所拥有的 “example.com\0.evil.com”. 当时的证书验证库认为该结果对 “example.com” 是有效的. 请小心处理C和C++中的ASN.1字符串,以免造成漏洞。

Dates and Times

Again, lots of time types: UTCTime, GeneralizedTime, DATE, TIME-OF-DAY, DATE-TIME and DURATION. For HTTPS certificates you only have to care about UTCTime and GeneralizedTime.

同样,有很多时间类型:UTCTime、GeneralizedTime、DATE、time-of-DAY、DATE-time 和 DURATION。对于HTTPS证书,您只需关心UTCTime 和 GeneralizedTime。

UTCTime represents a date and time as YYMMDDhhmm[ss], with an optional timezone offset or “Z” to represent Zulu (aka UTC aka 0 timezone offset). For instance the UTCTimes 820102120000Z and 820102070000-0500 both represent the same time: January 2nd, 1982, at 7am in New York City (UTC-5) and at 12pm in UTC.

UTCTime 将日期和时间表示为 YYMMDDhhmm[ss],带有可选的时区偏移量或 “Z” 来表示Zulu(也就是 UTC 或者称为 0时区偏移量)。例如,UTCTimes 820102120000Z 和820102070000-0500 都代表同一时间:1982年1月2日,纽约时间(UTC-5)上午7点和 UTC 时间下午12点。

Since UTCTime is ambiguous as to whether it’s the 1900’s or 2000’s, RFC 5280 clarifies that it represents dates from 1950 to 2050. RFC 5280 also requires that the “Z” timezone must be used and seconds must be included.

由于UTCTime对于它是1900年代还是2000年代是模棱两可的,RFC 5280 阐明它代表的是1950年到2050年。RFC 5280 还要求必须使用“Z”时区,并且必须包括秒。

GeneralizedTime supports dates after 2050 through the simple expedient of representing the year with four digits. It also allows fractional seconds (weirdly, with either a comma or a full stop as the decimal separator). RFC 5280 forbids fractional seconds and requires the “Z.”

GeneralizedTime 支持2050年以后的日期,方法是用四位数表示年份。它还允许小数秒(奇怪的是,小数点分隔符可以是逗号或句号)。RFC 5280禁止分数秒,并要求使用“Z”。

OBJECT IDENTIFIER

Object identifiers are globally unique, hierarchical identifiers made of a sequence of integers. They can refer to any kind of “thing,” but are commonly used to identify standards, algorithms, certificate extensions, organizations, or policy documents. As an example: 1.2.840.113549 identifies RSA Security LLC. RSA can then assign OIDs starting with that prefix, like 1.2.840.113549.1.1.11, which identifies sha256WithRSAEncryption, as defined in RFC 8017.

对象标识符是由一系列整数组成的全局唯一的分层标识符。它们可以引用任何类型的“东西”,但通常用于标识标准、算法、证书扩展、组织或策略文档。例如:1.2.840.113549 标识 RSA Security LLC。然后RSA可以分配以该前缀开头的OID,如 RFC 8017 中定义的1.2.840.113549.1.1.11,它标识使用 RSA 加密的 SHA256。

Similarly, 1.3.6.1.4.1.11129 identifies Google, Inc. Google assigned 1.3.6.1.4.1.11129.2.4.2 to identify the SCT list extension used in Certificate Transparency (which was initially developed at Google), as defined in RFC 6962.

类似地,1.3.6.1.4.1.11129 标识了 Google,Inc. 如RFC 6962中所定义的,Google分配了1.3.6.1.4.1.11129.2.4.2以识别证书透明性中使用的 SCT 列表扩展(最初由Google开发)。

The set of child OIDs that can exist under a given prefix is called an “OID arc.” Since the representation of shorter OIDs is smaller, OID assignments under shorter arcs are considered more valuable, particularly for formats where that OID will have to be sent a lot. The OID arc 2.5 is assigned to “Directory Services,” the series of specifications that includes X.509, which HTTPS certificates are based on. A lot of fields in certificates begin with that conveniently short arc. For instance, 2.5.4.6 means “countryName,” while 2.5.4.10 means “organizationName.” Since most certificates have to encode each of those OIDs at least once, it’s handy that they are short.

在给定前缀下可以存在的子 OID 集称为 “OID arc”。由于较短 OID 的表示更小,因此在较短的arc下分配OID被认为更有价值,特别是对于那些需要大量发送OID的格式。OID arc 2.5 被分配给“目录服务”,这是一系列规范,包括HTTPS证书所基于的X.509。证书中的许多字段都是以这个方便的短 arc 开头的。例如,2.5.4.6 表示 “countryName”,而 2.5.4.10 表示 “organizationName”。由于大多数证书必须对每个OID至少编码一次,所以它们很短很方便。

OIDs in specifications are commonly represented with a human-readable name for convenience, and may be specified by concatenation with another OID. For instance from RFC 8017:

为了方便起见,规范中的OID通常用人类可读的名称表示,并且可以通过与另一个OID连接来指定。例如RFC 8017:

   pkcs-1    OBJECT IDENTIFIER ::= {
       iso(1) member-body(2) us(840) rsadsi(113549) pkcs(1) 1
   }
   ...

   sha256WithRSAEncryption      OBJECT IDENTIFIER ::= { pkcs-1 11 }

NULL

NULL is just NULL, ya know?

NULL 就是 NULL,你知道吗?

SEQUENCE and SEQUENCE OF

Don’t let the names fool you: These are two very different types. A SEQUENCE is equivalent to “struct” in most programming languages. It holds a fixed number of fields of different types. For instance, see the Certificate example below.

别让名字骗了你:这是两种截然不同的类型。SEQUENCE在大多数编程语言中等同于 “struct”。它包含固定数量的不同类型的字段。例如,请参阅下面的证书示例。

A SEQUENCE OF, on the other hand, holds an arbitrary number of fields of a single type. This is analogous to an array or a list in a programming language. For instance:

从另一方面来讲,一个 SEQUENCE OF 有任意数量的单一类型的字段。这类似于编程语言中的数组或列表。例如:

   RDNSequence ::= SEQUENCE OF RelativeDistinguishedName

That could be 0, 1, or 7,000 RelativeDistinguishedNames, in a specific order.

这可能是0、1或7000个按特定顺序排列的 RelativeDistinguishedNames。

It turns out SEQUENCE and SEQUENCE OF do have one similarity - they are both encoded the same way! More on that in the Encoding section.

结果发现SEQUENCE和SEQUENCE OF确实有一个相似之处-它们的编码方式都是一样的!在编码部分有更多信息。

SET and SET OF

These are pretty much the same as SEQUENCE and SEQUENCE OF, except that there are intentionally no semantics attached to the ordering of elements in them. However, in encoded form they must be sorted. An example:

它们与 SEQUENCE 和 SEQUENCE OF 基本相同,只是它们中的元素顺序没有任何语义附加。然而,在编码模式中,他们必须排序。例如:

RelativeDistinguishedName ::=
  SET SIZE (1..MAX) OF AttributeTypeAndValue

Note: This example uses the SIZE keyword to additionally specify that RelativeDistinguishedName must have at least one member, but in general a SET or SET OF is allowed to have a size of zero.

注意:此示例使用SIZE关键字额外指定 RelativedStingUIshedName 必须至少有一个成员,但通常一个 SET 或 SET OF 允许其大小为零。

BIT STRING and OCTET STRING

These contain arbitrary bits or bytes respectively. These can be used to hold unstructured data, like nonces or hash function output. They can also be used like a void pointer in C or the empty interface type (interface{}) in Go: A way to hold data that does have a structure, but where that structure is understood or defined separately from the type system. For instance, the signature on a certificate is defined as a BIT STRING:

它们分别包含任意位或字节。它们可用于保存非结构化数据,如 nonces 或 散列函数输出。它们也可以像 C 中的空指针或 Go 中的空接口类型(interface{})一样使用:一种保存具有结构的数据的方法,但该结构是与类型系统分开理解或定义的。例如,证书上的签名定义为一个 BIT STRING:

Certificate  ::=  SEQUENCE  {
     tbsCertificate       TBSCertificate,
     signatureAlgorithm   AlgorithmIdentifier,
     signature            BIT STRING  }

Later versions of the ASN.1 language allow more detailed specification of the contents inside the BIT STRING (and the same is true of OCTET STRINGs).

ASN.1语言的更高版本允许更详细地说明位字符串中的内容(OCTET STRING也是如此)。

CHOICE and ANY

CHOICE is a type that can contain exactly one of the types listed in its definition. For instance, Time can contain exactly one of a UTCTime or a GeneralizedTime:

CHOICE是一种类型,它可以正好包含其定义中列出的一种类型。例如,Time 可以正好包含 UTCTime 或 GeneralizedTime 中的一个:

Time ::= CHOICE {
     utcTime        UTCTime,
     generalTime    GeneralizedTime }

This is particularly useful for extensions, where you want to leave room for additional fields to be defined separately after the main specification is published, so you have a way to register new types (object identifiers), and allow the definitions for those types to specify what the structure of the new fields should be.

这对于扩展特别有用,因为您希望在发布主规范之后为单独定义其他字段留出空间,因此可以注册新类型(object identifiers),并允许这些类型的定义指定新字段的结构。

Note that ANY is a relic of the 1988 ASN.1 notation. In the 1994 edition, ANY was deprecated and replaced with Information Object Classes, which are a fancy, formalized way of specifying the kind of extension behavior people wanted from ANY. The change is so old by now that the latest ASN.1 specifications (from 2015) don’t even mention ANY. But if you look at the 1994 edition you can see some discussion of the switchover. I include the older syntax here because that’s still what RFC 5280 uses. RFC 5912 uses the 2002 ASN.1 syntax to express the same types from RFC 5280 and several related specifications.

请注意,ANY都是1988年ASN.1符号的遗物。在1994年的版本中,ANY被弃用,取而代之的是Information Object Class,这是一种奇特的、形式化的方式,用于指定人们希望从 ANY 获得的扩展行为。这种变化已经很老了,最新的ASN.1规范(从2015年开始)甚至都没有提到。但如果你看看1994年的版本,你可以看到一些关于转换的讨论。我在这里包含了旧的语法,因为 RFC 5280 仍然使用这个语法。RFC 5912使用2002 ASN.1 语法来表示 RFC 5280和几个相关规范中的相同类型。

Other Notation

Comments begin with --. Fields of a SEQUENCE or SET can be marked OPTIONAL, or they can be marked DEFAULT foo, which means the same thing as OPTIONAL except that when the field is absent it should be considered to contain “foo.” Types with a length (strings, octet and bit strings, sets and sequences OF things) can be given a SIZE parameter that constrains their length, either to an exact length or to a range.

注释以--开头。序列或集合的字段可以标记为可选,也可以标记为 DEFAULT foo,这与可选的含义相同,只是当字段不存在时,它应该被视为包含“foo”。具有长度的类型(strings, octet and bit strings, sets and sequences OF things)可以给定一个 SIZE 参数来约束它们的长度,可以是精确的长度,也可以是一个范围。

Types can be constrained to have certain values by using curly braces after the type definition. This example defines that the Version field can have three values, and assigns meaningful names to those values:

通过在类型定义之后使用大括号,可以将类型约束为具有特定值。此示例定义 Version 字段可以有三个值,并为这些值指定有意义的名称:

Version ::= INTEGER { v1(0), v2(1), v3(2) }

This is also often used in assigning names to specific OIDs (note this is a single value, with no commas indicating alternate values). Example from RFC 5280.

这也经常用于为特定 OID 分配名称(注意,这是一个单一值,没有逗号表示替代值)。RFC 5280中的示例:

id-pkix  OBJECT IDENTIFIER  ::=
         { iso(1) identified-organization(3) dod(6) internet(1)
                    security(5) mechanisms(5) pkix(7) }

You’ll also see [number], IMPLICIT, EXPLICIT, UNIVERSAL, and APPLICATION. These define details of how a value should be encoded, which we’ll talk about below.

您还将看到[number]、IMPLICIT, EXPLICIT, UNIVERSAL, and APPLICATION。这些定义了一个值应该如何编码的细节,我们将在下面讨论。

The Encoding

ASN.1 is associated with many encodings: BER, DER, PER, XER, and more. Basic Encoding Rules (BER) are fairly flexible. Distinguished Encoding Rules (DER) are a subset of BER with canonicalization rules so there is only one way to express a given structure. Packed Encoding Rules (PER) use fewer bytes to encode things, so they are useful when space or transmission time is at a premium. XML Encoding Rules (XER) are useful when for some reason you want to use XML.

ASN.1与许多编码相关:BER、DER、PER、XER等等。基本编码规则(BER)相当灵活。区分编码规则(DER)是BER的一个子集,具有规范化规则,因此只有一种方法来表示给定的结构。压缩编码规则(PER)使用较少的字节来编码内容,因此当空间或传输时间很昂贵时,它们非常有用。当出于某种原因需要使用XML时,XML编码规则(XER)非常有用。

HTTPS certificates are generally encoded in DER. It’s possible to encode them in BER, but since the signature value is calculated over the equivalent DER encoding, not the exact bytes in the certificate, encoding a certificate in BER invites unnecessary trouble. I’ll describe BER, and explain as I go the additional restrictions provided by DER.

HTTPS证书通常按 DER 编码。可以用BER编码它们,但由于签名值是通过等效的 DER 编码计算的,而不是证书中的确切字节,所以用 BER 编码证书会带来不必要的麻烦。我将描述 BER,并解释 DER 提供的附加限制。

I encourage you to read this section with this decoding of a real certificate open in another window.

我鼓励您阅读本节时在另一个窗口打开一个真正的证书的解码。

Type-Length-Value

BER is a type-length-value encoding, just like Protocol Buffers and Thrift. That means that, as you read bytes that are encoded with BER, first you encounter a type, called in ASN.1 a tag. This is a byte, or series of bytes, that tells you what type of thing is encoded: an INTEGER, or a UTF8String, or a structure, or whatever else.

BER 是一种类型长度值(type-length-value TLV)编码,就像协议缓冲区(Protocol Buffers)和节约(Thrift)。这意味着,当您读取用 BER 编码的字节时,首先会遇到一个类型,在 ASN.1 中称为标记(tag)。这是一个字节或一系列字节,它告诉你编码的是什么类型的东西:INTEGER、UTF8String、结构或其他任何东西。

type length value
02 03 01 00 01

Next you encounter a length: a number that tells you how many bytes of data you’re going to need to read in order to get the value. Then, of course, comes the bytes containing the value itself. As an example, the hex bytes 02 03 01 00 01 would represent an INTEGER (tag 02 corresponds to the INTEGER type), with length 03, and a three-byte value consisting of 01 00 01.

接下来,您将遇到一个长度:一个数字,它告诉您需要读取多少字节的数据才能获得该值。当然,接下来是包含值本身的字节。例如,十六进制字节组 02 03 01 00 01将表示长度为03的 INTEGER(tag 02对应于 INTEGER 类型),以及由01 00 01组成的三字节值。

Type-length-value is distinguished from delimited encodings like JSON, CSV, or XML, where instead of knowing the length of a field up front, you read bytes until you hit the expected delimiter (e.g. } in JSON, or in XML).

类型长度值(Type-length-value)与 JSON、CSV或XML等分隔编码不同,它们不预先知道字段的长度,而是读取字节,直到达到预期的分隔符(例如JSON中的 },或 XML 中的 </some tag>)。

Tag

The tag is usually one byte. There is a means to encode arbitrarily large tag numbers using multiple bytes (the “high tag number” form), but this is not typically necessary.

标签通常是一个字节。有一种方法,可以使用多个字节对任意大的标记号进行编码(“高标记号(high tag number)”形式),但通常情况下这是不必要的。

Here are some example tags:

以下是一些标记示例:

Tag(decimal) Tag(hex) Type
2 02 INTEGER
3 03 BIT STRING
4 04 OCTET STRING
5 05 NULL
6 06 OBJECT IDENTIFIER
12 0C UTF8String
16 10 (and 30)* SEQUENCE and SEQUENCE OF
17 11 (and 31)* SET and SET OF
19 13 PrintableString
22 16 IA5String
23 17 UTCTime
24 18 GeneralizedTime

These, and a few others I’ve skipped for being boring, are the “universal” tags, because they are specified in the core ASN.1 specification and mean the same thing across all ASN.1 modules.

这些,还有一些我因为无聊而跳过的标签是“通用(universal)”标签,因为它们是在核心 ASN.1 规范中指定的,在所有 ASN.1 模块中都有相同的含义。

These tags all happen to be under 31 (0x1F), and that’s for a good reason: Bits 8, 7, and 6 (the high bits of the tag byte) are used to encode extra information, so any universal tag numbers higher than 31 would need to use the “high tag number” form, which takes extra bytes. There are a small handful of universal tags higher than 31, but they’re quite rare.

这些标记碰巧都在31(0x1F)以下,这是一个很好的理由:位8、7和6(标记字节的高位)用于编码额外信息,因此任何高于31的通用标记号都需要使用“高位位号”形式,这需要额外的字节。有少数通用标签高于31,但他们是相当罕见的。

The two tags marked with a * are always encoded as 0x30 or 0x31, because bit 6 is used to indicate whether a field is Constructed vs Primitive. These tags are always Constructed, so their encoding has bit 6 set to 1. See the Constructed vs Primitive section for details.

用 * 标记的两个标记总是编码为 0x30 或 0x31,因为 位6 用于指示字段是构造的(Constructed)还是原始的(Primitive)。这些标记总是构造的,因此它们的编码将位6设置为1。有关详细信息,请参见构造 vs 原始部分。

Tag Classes

Just because the universal class has used up all the “good” tag numbers, that doesn’t mean we’re out of luck for defining our own tags. There are also the “application”, “private”, and “context-specific” classes. These are distinguished by bits 8 and 7:

仅仅因为 universal 类已经用完了所有“好”的标签号,这并不意味着,在定义我们自己的标签时,就没有好运气了。这里还有“application”、“private”和“context-specific”类。它们使用第8和第7位来进行区分:

Class Bit 8 Bit 7
Universal 0 0
Application 0 1
Context-specific 1 0
Private 1 1

Specifications mostly use tags in the universal class, since they provide the most important building blocks. For instance, the serial number in a certificate is encoded in a plain ol’ INTEGER, tag number 0x02. But sometimes a specification needs to define tags in the context-specific class to disambiguate entries in a SET or SEQUENCE that defines optional entries, or to disambiguate a CHOICE with multiple entries that have the same type. For instance, take this definition:

规范主要使用通用类(universal class)中的标记,因为它们提供了最重要的构建块。例如,证书中的序列号编码为普通 INTEGER,标记号0x02。但有时规范需要在上下文特定的(context-specific)类中定义标记,以消除定义可选项的 SET 或 SEQUENCE 中的条目的歧义,或者对使用具有相同类型多个条目的 CHOICE 消除歧义。例如,以这个定义为例:

Point ::= SEQUENCE {
  x INTEGER OPTIONAL,
  y INTEGER OPTIONAL
}

Since OPTIONAL fields are omitted entirely from the encoding when they’re not present, it would be impossible to distinguish a Point with only an x coordinate from a Point with only a y coordinate. For instance you’d encode a Point with only an x coordinate of 9 like so (30 means SEQUENCE here):

因为当可选(OPTIONAL)字段不存在时,编码时会完全忽略这些字段,因此不可能区分只有 x 坐标的点(Point)和只有 y 坐标的点。例如,你可以用一个只有 9 的 x 坐标来编码一个点(这里 30 表示 SEQUENCE):

30 03 02 01 09

That’s a SEQUENCE of length 3 (bytes), containing an INTEGER of length 1, which has the value 9. But you’d also encode a Point with a y coordinate of 9 exactly the same way, so there is ambiguity.

这是一个长度为 3(字节)的序列,包含一个长度为 1 的整数,其值为 9。但是你也用同样的方式编码一个 y 坐标为 9 的点,所以存在歧义。

Encoding Instructions

To resolve this ambiguity, a specification needs to provide encoding instructions that assign a unique tag to each entry. And because we’re not allowed to stomp on the UNIVERSAL tags, we have to use one of the others, for instance APPLICATION:

为了解决这种歧义,规范需要提供为每个条目分配唯一标记的编码指令。并且由于我们不被允许乱用通用标记(UNIVERSAL tag),所以我们必须使用其他标记之一,例如 APPLICATION:

Point ::= SEQUENCE {
  x [APPLICATION 0] INTEGER OPTIONAL,
  y [APPLICATION 1] INTEGER OPTIONAL
}

Though for this use case, it’s actually much more common to use the context-specific class, which is represented by a number in brackets by itself:

尽管在这个用例中,使用特定于上下文(context-specific )的类实际上更常见,该类由括号中的数字表示:

Point ::= SEQUENCE {
  x [0] INTEGER OPTIONAL,
  y [1] INTEGER OPTIONAL
}

So now, to encode a Point with just an x coordinate of 9, instead of encoding x as a UNIVERSAL INTEGER, you’d sets bit 8 and 7 of the encoded tag to (1, 0) to indicate the context specific class, and set the low bits to 0, giving this encoding:

所以现在,要用一个 x 坐标 9 来编码一个点,而不是将 x 编码为一个通用整数(UNIVERSAL INTEGER),你需要将编码标记的第 8 位和第 7 位设置为(1,0),以指示上下文特定的类,并将低位设置为0,给出以下编码:

30 03 80 01 09

And to represent a Point with just a y coordinate of 9, you’d do the same thing, except you’d set the low bits to 1:

为了用一个 y 坐标 9 来表示一个点,你可以做同样的事情,只是你把低位设为1:

30 03 81 01 09

Or you could represent a Point with x and y coordinate both equal to 9:

你可以同时代表一个点,也可以同时代表一个坐标为9的点:

30 06 80 01 09 81 01 09

Length

The length in the tag-length-value tuple always represents the total number of bytes in the object including all sub-objects. So a SEQUENCE with one field doesn’t have a length of 1; it has a length of however many bytes the encoded form of that field take up.

标记长度值(tag-length-value)元组中的长度始终表示对象(包括所有子对象)中的总字节数。因此,一个包含一个字段的序列的长度不是1;它的长度是该字段的编码形式占用的字节数。

The encoding of length can take two forms: short or long. The short form is a single byte, between 0 and 127.

长度的编码可以有两种形式:短的或长的。缩写是0到127之间的单字节。

The long form is at least two bytes long, and has bit 8 of the first byte set to 1. Bits 7-1 of the first byte indicate how many more bytes are in the length field itself. Then the remaining bytes specify the length itself, as a multi-byte integer.

长格式至少有两个字节长,第一个字节的第8位设置为1。第一个字节的1-7位表示长度字段本身有多少个字节。然后剩余的字节将以多字节整数的形式表示长度本身。

As you can imagine, this allows very long values. The longest possible length would start with the byte 254 (a length byte of 255 is reserved for future extensions), specifying that 126 more bytes would follow in the length field alone. If each of those 126 bytes was 255, that would indicate 21008 - 1 bytes to follow in the value field.

正如您可以想象的,这允许表示非常长的值。最长的可能长度将从字节254开始(长度字节255是为将来的扩展保留的),指定仅在长度字段后面还有126个字节。如果这126个字节中的每一个都是255,这就表示在value字段中有21008 - 1个字节。

The long form allows you to encode the same length multiple ways - for instance by using two bytes to express a length that could fit in one, or by using long form to express a length that could fit in the short form. DER says to always use the smallest possible length representation.

长格式允许您以多种方式对相同长度进行编码—例如,使用两个字节来表示一个字节就可以容纳的长度,或者使用长格式来表示短格式即可以容纳的长度。DER总是使用尽可能小的长度表示法。

Safety warning: Don’t fully trust the length values that you decode! For instance, check that the encoded length is less than the amount of data available from the stream being decoded.

安全警告:不要完全相信你解码的长度值!例如,检查编码长度是否小于解码流中可用的数据量。

Indefinite length

It’s also possible, in BER, to encode a string, SEQUENCE, SEQUENCE OF, SET, or SET OF where you don’t know the length in advance (for instance when streaming output). To do this, you encode the length as a single byte with the value 80, and encode the value as a series of encoded objects concatenated together, with the end indicated by the two bytes 00 00 (which can be considered as a zero-length object with tag 0). So, for instance, the indefinite length encoding of a UTF8String would be the encoding of one or more UTF8Strings concatenated together, and concatenated finally with 00 00.

在 BER 中,也可以对 string, SEQUENCE, SEQUENCE OF, SET 或者 SET OF 进行编码,他们的长度是事先不知道的(例如,当流式输出时)。为此,将长度编码为一个值为80的单字节,并将该值编码为一系列串联在一起的编码对象,结尾由两个字节 00 00 表示(可以认为是带有标记0的零长度对象)。因此,例如,UTF8String 的不定长度编码是将一个或多个 UTF8String 串联在一起,最后 用00 00 连接起来的编码。

Indefinite-ness can be arbitrarily nested! So, for example, the UTF8Strings that you concatenate together to form an indefinite-length UTF8String can themselves be encoded either with definite length or indefinite length.

不定可以任意嵌套!因此,例如,连接在一起形成不定长 UTF8Strings 的 UTF8Strings 本身可以用定长或不定长进行编码。

A length byte of 80 is distinguishing because it’s not a valid short form or long form length. Since bit 8 is set to 1, this would normally be interpreted as the long form, but the remaining bits are supposed to indicate the number of additional bytes that make up the length. Since bits 7-1 are all 0, that would indicate a long-form encoding with zero bytes making up the length, which is not allowed.

长度字节80是有区别的,因为它不是有效的短格式或长格式长度。由于位8被设置为1,这通常被解释为长形式,但是剩余的位应该表示构成长度的额外字节的数量。由于位7-1都是0,这表示长度为零字节的长格式编码,这是不允许的。

DER forbids indefinite length encoding. You must use the definite length encoding (that is, with the length specified at the beginning).

DER 禁止无限长编码。必须使用定长编码(即在开头指定长度)。

Constructed vs Primitive 构造 vs 原始

Bit 6 of the first tag byte is used to indicate whether the value is encoded in primitive form or constructed form. Primitive encoding represents the value directly - for instance, in a UTF8String the value would consist solely of the string itself, in UTF-8 bytes. Constructed encoding represents the value as a concatenation of other encoded values. For instance, as described in the “Indefinite length” section, a UTF8String in constructed encoding would consist of multiple encoded UTF8Strings (each with a tag and length), concatenated together. The length of the overall UTF8String would be the total length, in bytes, of all those concatenated encoded values. Constructed encoding can use either definite or indefinite length. Primitive encoding always uses definite length, because there’s no way to express indefinite length without using constructed encoding.

第一个标记(tag)字节的第6位用于指示值是以原始(primitive)形式还是构造(constructed)形式编码的。原始编码直接表示值 - 例如,在UTF8String中,值将仅由字符串本身组成,单位为UTF-8字节。构造编码将值表示为其他编码值的串联。例如,如“不确定长度”一节所述,构造编码中的 UTF8String 将由多个编码的 UTF8String(每个都有一个标记和长度),连接在一起。整个 UTF8String 的长度将是所有连接的编码值的总长度(以字节为单位)。构造编码可以使用定长或不定长。原始编码总是使用定长,因为不使用构造编码就无法表示不定长度。

INTEGER, OBJECT IDENTIFIER, and NULL must use primitive encoding. SEQUENCE, SEQUENCE OF, SET, and SET OF must use constructed encoding (because they are inherently concatenations of multiple values). BIT STRING, OCTET STRING, UTCTime, GeneralizedTime, and the various string types can use either primitive encoding or constructed encoding, at the sender’s discretion -- in BER. However, in DER all types that have an encoding choice between primitive and constructed must use the primitive encoding.

INTEGER, OBJECT IDENTIFIER 和 NULL 必须使用原始编码。SEQUENCE、SEQUENCE OF、SET 和 SET OF 必须使用构造编码(因为它们本质上是多个值的串联)。BIT STRING, OCTET STRING, UTCTime, GeneralizedTime 和各种字符串类型可以使用原始编码或构造编码,在BER中具体由发送方决定。但是,在 DER 中 primitive 和 constructed 之间有编码类型可选择的所有类型都必须使用原始编码。

EXPLICIT vs IMPLICIT 显式 vs 隐式

The encoding instructions described above, e.g. [1], or [APPLICATION 8], can also include the keyword EXPLICIT or IMPLICIT (example from RFC 5280):

上述编码指令,例如 [1] 或 [APPLICTION 8],也可以包括关键字 EXPLICIT 或 IMPLICIT(来自 RFC 5280的示例):

TBSCertificate  ::=  SEQUENCE  {
     version         [0]  Version DEFAULT v1,
     serialNumber         CertificateSerialNumber,
     signature            AlgorithmIdentifier,
     issuer               Name,
     validity             Validity,
     subject              Name,
     subjectPublicKeyInfo SubjectPublicKeyInfo,
     issuerUniqueID  [1]  IMPLICIT UniqueIdentifier OPTIONAL,
                          -- If present, version MUST be v2 or v3
     subjectUniqueID [2]  IMPLICIT UniqueIdentifier OPTIONAL,
                          -- If present, version MUST be v2 or v3
     extensions      [3]  Extensions OPTIONAL
                          -- If present, version MUST be v3 --  }

This defines how the tag should be encoded; it doesn’t have to do with whether the tag number is explicitly assigned or not (since both IMPLICIT and EXPLICIT always go alongside a specific tag number). IMPLICIT encodes the field just like the underlying type, but with the tag number and class provided in the ASN.1 module. EXPLICIT encodes the field as the underlying type, and then wraps that in an outer encoding. The outer encoding has the tag number and class from the ASN.1 module and additionally has the Constructed bit set.

这定义了标记应该如何编码;它与是否显式分配标记号无关(因为隐式和显式总是与特定的标记号一起出现)。隐式(IMPLICIT)对字段进行编码,就像底层类型一样,但是使用ASN.1模块中提供的标记号和类。显式(EXPLICIT)将字段编码为基础类型,然后将其包装为外部编码。外部编码有来自ASN.1模块的标签号和类,另外还有构造的位集。

Here’s an example ASN.1 encoding instruction using IMPLICIT:

下面是一个使用隐式的ASN.1编码指令的示例:

[5] IMPLICIT UTF8String

This would encode “hi” as:

这将把“hi”编码为:

85 02 68 69

Compare to this ASN.1 encoding instruction using EXPLICIT:

和显式使用此ASN.1编码指令相比较:

[5] EXPLICIT UTF8String

This would encode “hi” as:

这将把“hi”编码为:

A5 04 0C 02 68 69

When the IMPLICIT or EXPLICIT keyword is not present, the default is EXPLICIT, unless the module sets a different default at the top with “EXPLICIT TAGS,” “IMPLICIT TAGS,” or “AUTOMATIC TAGS.” For instance, RFC 5280 defines two modules, one where EXPLICIT tags are the default, and a second one that imports the first, and has IMPLICIT tags as the default. Implicit encoding uses fewer bytes than explicit encoding.

当 IMPLICIT 或 EXPLICIT 关键字不存在时,默认值为 EXPLICIT,除非模块在顶部用“EXPLICIT TAGS”、“IMPLICIT TAGS”或“AUTOMATIC TAGS”设置了不同的默认值。例如,RFC 5280定义了两个模块,一个的默认值是 EXPLICIT 标记,而导入了第一个的第二个,则将 IMPLICIT 标记作为默认值。隐式编码比显式编码使用更少的字节。

AUTOMATIC TAGS is the same as IMPLICIT TAGS, but with additional property that tag numbers ([0], [1], etc) are automatically assigned in places that need them, like SEQUENCEs with optional fields.

自动标记(AUTOMATIC TAGS)与隐式标记(IMPLICIT TAGS)相同,但具有附加属性,即标记号([0]、[1]等)会自动分配到需要它们的位置,例如带有可选字段的序列(SEQUENCE)。

Encoding of specific types 特定类型的编码

In this section we’ll talk about how the value of each type is encoded, with examples.

在本节中,我们将通过示例讨论如何对每种类型的值进行编码。

INTEGER encoding 整数编码

Integers are encoded as one or more bytes, in two’s complement with the high bit (bit 8) of the leftmost byte as the sign bit. As the BER specification says:

整数(Integers)被编码为一个或多个字节,两个字节的补码是最左边字节的高位(第8位)作为符号位。正如BER规范所述:

The value of a two's complement binary number is derived by numbering the bits in the contents octets, starting with bit 1 of the last octet as bit zero and ending the numbering with bit 8 of the first octet. Each bit is assigned a numerical value of 2N, where N is its position in the above numbering sequence. The value of the two's complement binary number is obtained by summing the numerical values assigned to each bit for those bits which are set to one, excluding bit 8 of the first octet, and then reducing this value by the numerical value assigned to bit 8 of the first octet if that bit is set to one.

两个补码二进制数的值是通过对内容八位字节中的位进行编号得到的,从最后一个八位字节的第1位开始作为位0,以第一个八位字节的位8结束编号。每一位被赋予一个 2N 的数值,其中 N 是其在上述编号序列中的位置。两个补码二进制数的值是通过对那些被设置为1的位(不包括第一个八位字节的第8位)分配给每个位的数值求和,然后将该值减去分配给第一个八位字节的第8位的数值(如果该位设置为1)。

So for instance this one-byte value (represented in binary) encodes decimal 50:

例如,这个一字节值(用二进制表示)编码十进制数 50:

00110010 (== decimal 50)

This one-byte value (represented in binary) encodes decimal -100:

这个一字节值(以二进制表示)编码十进制数 -100:

10011100 (== decimal -100)

This five-bytes value (represented in binary) encodes decimal -549755813887 (i.e. -239 + 1):

这个5字节值(以二进制表示)编码十进制数 -549755813887(即-239 + 1):

10000000 00000000 00000000 00000000 00000001 (== decimal -549755813887)

BER and DER both require that integers be represented in the shortest form possible. That is enforced with this rule:

BER和DER都要求整数以尽可能短的形式表示。这是由以下规则强制执行的:

... the bits of the first octet and bit 8 of the second octet:

1.  shall not all be ones; and
2.  shall not all be zero.

Rule (2) roughly means: if there are leading zero bytes in the encoding you could just as well leave them off and have the same number. Bit 8 of the second byte is important here too because if you want to represent certain values, you must use a leading zero byte. For instance, decimal 255 is encoded as two bytes:

规则(2)大致意思是:如果在编码中有前导的零字节,你也可以不使用它们,并使用相同的数字。第二个字节的第8位在这里也很重要,因为如果要表示某些值,必须使用前导零字节。例如,十进制255编码为两个字节:

00000000 11111111

That’s because a single-byte encoding of 11111111 by itself means -1 (bit 8 is treated as the sign bit).

这是因为 11111111 的单字节编码本身意味着 -1(第8位被视为符号位)。

Rule (1) is best explained with an example. Decimal -128 is encoded as:

规则(1)最好用一个例子来解释。十进制 -128 编码为:

10000000 (== decimal -128)

However, that could also be encoded as:

但是,也可以编码为:

11111111 10000000 (== decimal -128, but an invalid encoding)

Expanding that out, it’s -215 + 214 + 213 + 212 + 211 + 210 + 29 + 28 + 27 == -27 == -128. Note that the 1 in “10000000” was a sign bit in the single-byte encoding, but means 27 in the two-byte encoding.

扩展到 -215+214+213+212+211+210+29+28+27==-27==-128。注意,“10000000”中的1在单字节编码中是符号位,但在双字节编码中是27。

This is a generic transform: For any negative number encoded as BER (or DER) you could prefix it with 11111111 and get the same number. This is called sign extension. Or equivalently, if there’s a negative number where the encoding of the value begins with 11111111, you could remove that byte and still have the same number. So BER and DER require the shortest encoding.

这是一个通用的转换:对于任何编码为 BER(或 DER)的负数,您可以在其前面加上 11111111 并得到相同的数字。这叫做符号扩展。或者等效地说,如果有一个负数,值的编码以 11111111 开头,您可以删除该字节,但仍然具有相同的数字。因此 BER 和 DER 需要最短的编码。

The two’s complement encoding of INTEGERs has practical impact in certificate issuance: RFC 5280 requires that serial numbers be positive. Since the first bit is always a sign bit, that means serial numbers encoded in DER as 8 bytes can be at most 63 bits long. Encoding a 64-bit positive serial number requires a 9-byte encoded value (with the first byte being zero).

这两种整数(INTEGERs)的补码编码在证书颁发中有实际的影响:RFC 5280 要求序列号为正。由于第一位总是符号位,这意味着以8字节的顺序编码的序列号最多可以有63位长。编码64位正序列号需要9字节的编码值(第一个字节为零)。

Here’s the encoding of an INTEGER with the value 263 + 1 (which happens to be a 64-bit positive number):

下面是一个值为 263 + 1(正好是一个64位正数)的整数的编码:

02 09 00 80 00 00 00 00 00 00 01

String encoding 字符串编码

Strings are encoded as their literal bytes. Since IA5String and PrintableString just define different subsets of acceptable characters, their encodings differ only by tag.

字符串被编码为它们的文字字节。由于 IA5String 和 PrintableString 只是定义了可接受字符的不同子集,它们的编码只因标记而异。

A PrintableString containing “hi”:

包含“hi”的一个 PrintableString:

13 02 68 69

An IA5String containing “hi”:

包含“hi”的 IA5String:

16 02 68 69

UTF8Strings are the same, but can encode a wider variety of characters. For instance, this is the encoding of a UTF8String containing U+1F60E Smiling Face With Sunglasses (😎):

UTF8Strings 也一样,但可以编码更多的字符。例如,这是一个 UTF8Strings 的编码,包含带太阳镜的 U+1F60E 笑脸(😎):

0c 04 f0 9f 98 8e

Date and Time encoding 日期和时间编码

UTCTime and GeneralizedTime are actually encoded like strings, surprisingly! As described above in the “Types” section, UTCTime represents dates in the format YYMMDDhhmmss. GeneralizedTime uses a four-digit year YYYY in place of YY. Both have an optional timezone offset or “Z” (Zulu) to indicate no timezone offset from UTC.

令人惊讶的是,UTCTime 和 GeneralizedTime 实际上像字符串一样编码!如上文“类型(Typed)”部分所述,UTCTime 以 YYMMDDhhmmss 格式表示日期。GeneralizedTime 使用四位数的年份 YYYY 代替 YY。两者都有一个可选的时区偏移量或“Z”(Zulu),表示与 UTC 之间没有时区偏移。

For instance, December 15, 2019 at 19:02:10 in the PST time zone (UTC-8) is represented in a UTCTime as: 191215190210-0800. Encoded in BER, that’s:

例如,2019年12月15日19:02:10在 PST 时区(UTC-8)中,在 UTC 时间中表示为:191215190210-0800。使用 BER 编码,即:

17 11 31 39 31 32 31 35 31 39 30 32 31 30 2d 30 38 30 30

For BER encoding, seconds are optional in both UTCTime and GeneralizedTime, and timezone offsets are allowed. However, DER (along with RFC 5280) specify that seconds must be present, fractional seconds must not be present, and the time must be expressed as UTC with the “Z” form.

对于 BER 编码,UTCTime 和 GeneralizedTime 中的秒都是可选的,并且允许时区偏移。但是,DER(与 RFC 5280 一起)指定秒必须存在,分数秒不能存在,并且时间必须用“Z”形式表示为 UTC 时间。

The above date would be encoded in DER as:

上述日期的 DER 编码如下:

17 0d 31 39 31 32 31 36 30 33 30 32 31 30 5a

OBJECT IDENTIFIER encoding 对象标识符编码

As described above, OIDs are conceptually a series of integers. They are always at least two components long. The first component is always 0, 1, or 2. When the first component is 0 or 1, the second component is always less than 40. Because of this, the first two components are unambiguously represented as 40*X+Y, where X is the first component and Y is the second.

如上所述,OID 在概念上是一系列整数。它们总是至少有两个部件长。第一个组件总是0、1或2。当第一个分量为0或1时,第二个分量始终小于40。因此,前两个分量明确表示为40*X+Y,其中X是第一个分量,Y是第二个分量。

So, for instance, to encode 2.999.3, you would combine the first two components into 1079 decimal (40*2 + 999), which would give you “1079.3”.

例如,要对2.999.3进行编码,需要将前两个分量合并成十进制数1079(40*2+999),这将得到“1079.3”。

After applying that transform, each component is encoded in base 128, with the most significant byte first. Bit 8 is set to “1” in every byte except the last in a component; that’s how you know when one component is done and the next one begins. So the component “3” would be represented simply as the byte 0x03. The component “129” would be represented as the bytes 0x81 0x01. Once encoded, all the components of an OID are concatenated together to form the encoded value of the OID.

在应用该转换之后,每个组件都以128为基数进行编码,最重要的字节放在第一位。除组件中的最后一个字节外,每个字节的第8位都设置为“1”;这就是您知道一个组件何时完成,下一个组件何时开始的方式。因此,组件“3”将简单地表示为字节0x03。组件“129”将表示为字节0x81 0x01。编码后,OID的所有组件都连接在一起,形成OID的编码值。

OIDs must be represented in the fewest bytes possible, whether in BER or DER. So components cannot begin with the byte 0x80.

OID必须用尽可能少的字节表示,无论是BER还是DER。因此组件不能以字节0x80开头。

As an example, the OID 1.2.840.113549.1.1.11 (representing sha256WithRSAEncryption) is encoded like so:

例如,OID 1.2.840.113549.1.1.11(表示 sha256WithRSAEncryption)编码如下:

06 09 2a 86 48 86 f7 0d 01 01 0b

NULL encoding 空编码

The value of an object containing NULL is always zero-length, so the encoding of NULL is always just the tag and a length field of zero:

包含 NULL 的对象的值的长度始终为零,因此 NULL 的编码始终只是标记和一个长度字段0:

05 00

SEQUENCE encoding 序列编码

The first thing to know about SEQUENCE is that it always uses Constructed encoding because it contains other objects. In other words, the value bytes of a SEQUENCE contain the concatenation of the encoded fields of that SEQUENCE (in the order those fields were defined). This also means that bit 6 of a SEQUENCE’s tag (the Constructed vs Primitive bit) is always set to 1. So even though the tag number for SEQUENCE is technically 0x10, its tag byte, once encoded, is always 0x30.

关于 SEQUENCE 首先要知道的是它总是使用构造(Constructed)编码,因为它包含其他对象。换句话说,SEQUENCE 的值(value)字节包含该 SEQUENCE 的编码字段的串联(按照这些字段的定义顺序)。这也意味着 SEQUENCE 标记(tag)的位6(Constructed vs Primitive 位)总是设置为1。因此,尽管 SEQUENCE 的标签号在技术上讲是0x10,但它的标签字节一旦编码,总是0x30。

When there are fields in a SEQUENCE with the OPTIONAL annotation, they are simply omitted from the encoding if not present. As a decoder processes elements of the SEQUENCE, it can figure out which type is being decoded based on what’s been decoded so far, and the tag bytes it reads. If there is ambiguity, for instance when elements have the same type, the ASN.1 module must specify encoding instructions that assign distinct tag numbers to the elements.

当 SEQUENCE 中有带有可选注释的字段时,如果不存在,则从编码中简单地省略这些字段。当解码器处理 SEQUENCE 中的元素时,它可以根据到目前为止解码的内容和它读取的标记字节来判断正在解码的类型。如果存在歧义,例如当元素具有相同的类型时,ASN.1 模块必须指定为元素分配不同标记号的编码指令。

DEFAULT fields are similar to OPTIONAL ones. If a field’s value is the default, it may be omitted from the BER encoding. In the DER encoding, it MUST be omitted.

默认字段与可选字段类似。如果字段的值是默认值,则可以从 BER 编码中省略该值。在 DER 编码中,必须省略它。

As an example, RFC 5280 defines AlgorithmIdentifier as a SEQUENCE:

例如,RFC 5280 将 AlgorithmIdentifier 定义为一个 SEQUENCE:

   AlgorithmIdentifier  ::=  SEQUENCE  {
        algorithm               OBJECT IDENTIFIER,
        parameters              ANY DEFINED BY algorithm OPTIONAL  }

Here’s the encoding of the AlgorithmIdentifier containing 1.2.840.113549.1.1.11. RFC 8017 says “parameters” should have the type NULL for this algorithm.

这是包含1.2.840.113549.1.1.11的算法标识符的编码。RFC 8017表示,对于这个算法,“parameters”的类型应该为NULL。

30 0d 06 09 2a 86 48 86 f7 0d 01 01 0b 05 00

SEQUENCE OF encoding

A SEQUENCE OF is encoded in exactly the same way as a SEQUENCE. It even uses the same tag! If you’re decoding, the only way you can tell the difference between a SEQUENCE and a SEQUENCE OF is by reference to the ASN.1 module.

SEQUENCE OF 与 SEQUENCE 的编码方式完全相同。它甚至使用相同的标签!如果您正在解码,唯一能区分 SEQUENCE 和 SEQUENCE OF 的方法是引用 ASN.1 模块。

Here is the encoding of a SEQUENCE OF INTEGER containing the numbers 7, 8, and 9:

以下是包含数字7、8和9的 SEQUENCE OF INTEGER 的编码:

30 09 02 01 07 02 01 08 02 01 09

SET encoding 集合编码

Like SEQUENCE, a SET is Contructed, meaning that its value bytes are the concatenation of its encoded fields. Its tag number is 0x11. Since the Constructed vs Primitive bit (bit 6) is always set to 1, that means it’s encoded with a tag byte of 0x31.

与 SEQUENCE 一样,SET 也是构造的,这意味着它的值字节是其编码字段的串联。它的标签号是0x11。由于 Constructed vs Primitive 位(位6)始终设置为1,这意味着它使用标记字节0x31进行编码。

The encoding of a SET, like a SEQUENCE, omits OPTIONAL and DEFAULT fields if they are absent or have the default value. Any ambiguity that results due to fields with the same type must be resolved by the ASN.1 module, and DEFAULT fields MUST be omitted from DER encoding if they have the default value.

SET 的编码,就像 SEQUENCE 一样,如果可选字段和默认字段不存在或具有默认值,则会省略这些字段。由于具有相同类型的字段而导致的任何歧义都必须由 ASN.1 模块解决,并且如果默认字段具有默认值,则必须从 DER 编码中忽略。

In BER, a SET may be encoded in any order. In DER, a SET must be encoded in ascending order by tag.

在 BER 中,一个 SET 可以按任何顺序编码。在 DER 中,SET 必须按标签的升序对其进行编码。

SET OF encoding

A SET OF items is encoded the same way as a SET, including the tag byte of 0x31. For DER encoding, there is a similar requirement that the SET OF must be encoded in ascending order. Because all elements in the SET OF have the same type, ordering by tag is not sufficient. So the elements of a SET OF are sorted by their encoded values, with shorter values treated as if they were padded to the right with zeroes.

SET OF 编码方式与 SET 相同,包括标记字节 0x31。对于DER编码,有一个类似的要求,即 SET OF 必须按升序编码。因为 SET OF 中的所有元素都具有相同的类型,因此仅按标记排序是不够的。因此,SET OF 的元素按其编码值排序,较短的值被视为在右侧填充了0。

BIT STRING encoding 位串编码

A BIT STRING of N bits is encoded as N/8 bytes (rounded up), with a one-byte prefix that contains the “number of unused bits,” for clarity when the number of bits is not a multiple of 8. For instance, when encoding the bit string 011011100101110111 (18 bits), we need at least three bytes. But that’s somewhat more than we need: it gives us capacity for 24 bits total. Six of those bits will be unused. Those six bits are written at the rightmost end of the bit string, so this is encoded as:

N位的 BIT STRING 被编码为 N/8 字节(向上取整),一个字节前缀包含“未使用的位个数”,以便在位的个数不是8的倍数时更清楚。例如,在编码位串01101101101101101101111(18位)时,我们至少需要三个字节。但这比我们需要的要多:它给了我们总共24位的容量。其中六个位将不被使用。这六个位写入位串的最右端,因此编码如下:

03 04 06 6e 5d c0

In BER, the unused bits can have any value, so the last byte of that encoding could just as well be c1, c2, c3, and so on. In DER, the unused bits must all be zero.

在 BER 中,未使用的位可以有任何值,因此该编码的最后一个字节也可以是c1、c2、c3等等。在 DER 中,未使用的位必须全部为零。

OCTET STRING encoding

An OCTET STRING is encoded as the bytes it contains. Here’s an example of an OCTET STRING containing the bytes 03, 02, 06, and A0:

OCTET STRING 编码为它所包含的字节。下面是一个包含字节03、02、06和A0的 OCTET STRING 的示例:

04 04 03 02 06 A0

CHOICE and ANY encoding

A CHOICE or ANY field is encoded as whatever type it actually holds, unless modified by encoding instructions. So if a CHOICE field in an ASN.1 specification allows an INTEGER or a UTCTime, and the specific object being encoded contains an INTEGER, then it is encoded as an INTEGER.

除非通过编码指令进行修改,否则 CHOICE or ANY 字段都将被编码为它实际持有的任何类型。因此,如果 ASN.1 规范中的 CHOICE 字段允许使用 INTEGER 或 UTCTime,并且正在编码的特定对象包含 INTEGER,则将其编码为 INTEGER。

In practice, CHOICE fields very often have encoding instructions. For instance, consider this example from RFC 5280, where the encoding instructions are necessary to distinguish rfc822Name from dNSName, since they both have the underlying type IA5String:

实际上,CHOICE 字段通常有编码指令。例如,参考 RFC 5280 中的这个示例,其中的编码指令是区分 rfc822Name 和 dNSName 所必需的,因为它们都有底层类型 IA5String:

   GeneralName ::= CHOICE {
        otherName                       [0]     OtherName,
        rfc822Name                      [1]     IA5String,
        dNSName                         [2]     IA5String,
        x400Address                     [3]     ORAddress,
        directoryName                   [4]     Name,
        ediPartyName                    [5]     EDIPartyName,
        uniformResourceIdentifier       [6]     IA5String,
        iPAddress                       [7]     OCTET STRING,
        registeredID                    [8]     OBJECT IDENTIFIER }

Here’s an example encoding of a GeneralName containing the rfc822Name [email protected] (recalling that [1] means to use tag number 1, in the tag class “context-specific” (bit 8 set to 1), with the IMPLICIT tag encoding method):

这里有一个示例,包含 rfc822Name [email protected] 的 GeneralName (回顾一下, [1] 意味着,在标记类 “context-specific”(位8置为1)中, 使用 IMPLICIT 标记编码方法,使用标记数1):

81 0d 61 40 65 78 61 6d 70 6c 65 2e 63 6f 6d

Here’s an example encoding of a GeneralName containing the dNSName “example.com”:

下面是一个包含 dNSName “example.com”的 GeneralName 的编码示例:

82 0b 65 78 61 6d 70 6c 65 2e 63 6f 6d

Safety 安全

It’s important to be very careful decoding BER and DER, particularly in non-memory-safe languages like C and C++. There’s a long history of vulnerabilities in decoders. Parsing input in general is a common source of vulnerabilities. The ASN.1 encoding formats in particular seem to be particular vulnerability magnets. They are complicated formats, with many variable-length fields. Even the lengths have variable lengths! Also, ASN.1 input is often attacker-controlled. If you have to parse a certificate in order to distinguish an authorized user from an unauthorized one, you have to assume that some of the time you will be parsing, not a certificate, but some bizarre input crafted to exploit bugs in your ASN.1 code.

对 BER 和 DER 进行仔细的解码非常重要,尤其是在如C和C++这些非内存安全语言中。解码器中的漏洞由来已久。一般来说,解析输入是漏洞的常见来源。尤其是 ASN.1 编码格式似乎是一个特别的漏洞磁铁。它们是有许多可变长度的字段的复杂的格式。甚至长度都是可变的!另外,ASN.1 输入通常由攻击者控制。如果为了区分授权用户和未授权用户而必须解析一个证书,那么您必须假设在某些时候解析的不是证书,而是一些奇怪的输入,这些输入是为了利用 ASN.1 代码中的错误而精心设计的。

To avoid these problems, it is best to use a memory-safe language whenever possible. And whether you can use a memory-safe language or not, it’s best to use an ASN.1 compiler to generate your parsing code rather than writing it from scratch.

为了避免这些问题,最好尽可能使用内存安全语言。不管您是否可以使用内存安全语言,最好使用ASN.1编译器来生成解析代码,而不是从头开始编写。

Acknowledgements 致谢

I owe a significant debt to A Layman’s Guide to a Subset of ASN.1, DER, and BER, which is a big part of how I learned these topics. I’d also like to thank the authors of A warm welcome to DNS, which is a great read and inspired the tone of this document.

我欠了 A Layman’s Guide to a Subset of ASN.1, DER, and BER 一个很大的债,这是我学习这些主题的一个重要部分。我还要感谢 A warm welcome to DNS 的作者,这是一个启发了本文基调的伟大篇章。

A Little Bonus 一点奖励

Have you ever noticed that a PEM-encoded certificate always starts with “MII”? For instance:

您是否注意到 PEM 编码的证书总是以“MII”开头?例如:

-----BEGIN CERTIFICATE-----

MIIFajCCBFKgAwIBAgISA6HJW9qjaoJoMn8iU8vTuiQ2MA0GCSqGSIb3DQEBCwUA
...

Now you know enough to explain why! A Certificate is a SEQUENCE, so it will start with the byte 0x30. The next bytes are the length field. Certificates are almost always more than 127 bytes, so the length field has to use the long form of the length. That means the first byte will be 0x80 + N, where N is the number of length bytes to follow. N is almost always 2, since that’s how many bytes it takes to encode lengths from 128 to 65535, and almost all certificates have lengths in that range.

现在你已经知道的足够多来解释为什么了!一个 Certificate 就是一个 SEQUENCE,因此它将以字节0x30开头。接下来的字节是长度字段。证书几乎总是超过127个字节,因此长度字段必须使用长度的长形式。这意味着第一个字节将是0x80 + N,其中 N 是后面的长度字节数。N 几乎总是2,因为这是编码128到65535长度所需的字节数,而且几乎所有证书的长度都在这个范围内。

So now we know that the first two bytes of the DER encoding of a certificate are 0x30 0x82. PEM encoding uses base64, which encodes 3 bytes of binary input into 4 ASCII characters of output. Or, to put it differently: base64 turns 24 bits of binary input into 4 ASCII characters of output, with 6 bits of the input assigned to each character. We know what the first 16 bits of every certificate will be. To prove that the first characters of (almost) every certificate will be “MII”, we need two to look at the next 2 bits. Those will be the most significant bits of the most significant byte of the two length bytes. Will those bits ever be set to 1? Not unless the certificate is more than 16,383 bytes long! So we can predict that the first characters of a PEM certificate will always be the same. Try it yourself:

现在我们知道证书的 DER 编码的前两个字节是0x30 0x82。PEM 编码使用base64,它将3个字节的二进制输入编码为4个ASCII字符的输出。或者,换一种说法:base64将24位二进制输入转换成4个ASCII字符的输出,每个字符分配6位输入。我们知道每个证书的前16位是什么。为了证明(几乎)每个证书的前几个字符都是“MII”,我们需要两个字符来查看接下来的2个位。这些将是两个长度字节中最有效字节的最有效位。这些位会被设置为1吗?除非证书的长度超过16383字节!所以我们可以预测 PEM 证书的第一个字符总是相同的。你自己试试吧:

xxd -r -p <<<308200 | base64

About

A Warm Welcome to ASN.1 and DER 的中文翻译。

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published