proposal: x/net/html: ParseOption to set maxBuf #68101

Jarcis-cy · 2024-06-21T07:42:11Z

Proposal Details

Abstract

This proposal suggests introducing an option to set the MaxBuf parameter in the html.Parse function to control memory usage when parsing large HTML documents.

Background

Currently, html.Parse in the Go standard library calls ParseWithOptions internally, leading to a chain of function calls: html.Parse -> ParseWithOptions -> p.parse() -> p.tokenizer.Next() -> readByte(). Within readByte(), there is a logic block:

if z.maxBuf > 0 && z.raw.end-z.raw.start >= z.maxBuf {
    z.err = ErrBufferExceeded
    return 0
}

This logic is activated only if maxBuf is set. However, there is no way to set MaxBuf when using html.Parse or ParseWithOptions.

Problem

When parsing very large HTML documents, such as this page, memory usage can increase significantly due to the inability to set MaxBuf.

Solution

To address this, I propose introducing a function similar to ParseOptionEnableScripting to allow users to set MaxBuf.

Implementation

A sample implementation using reflection is provided below. This implementation, though functional, uses unsafe methods and reflection, which are not ideal for production code:

func ParseOptionSetMaxBuf(maxBuf int) html.ParseOption {
    funcValue := reflect.MakeFunc(
        reflect.FuncOf([]reflect.Type{reflect.TypeOf((*html.ParseOption)(nil)).Elem().In(0)}, nil, false),
        func(args []reflect.Value) (results []reflect.Value) {
            parserValue := args[0].Elem()
            tokenizerField := parserValue.FieldByName("tokenizer")
            tokenizerPtr := reflect.NewAt(tokenizerField.Type(), unsafe.Pointer(tokenizerField.UnsafeAddr())).Elem().Interface()
            if tokenizer, ok := tokenizerPtr.(interface { SetMaxBuf(int) }); ok {
                tokenizer.SetMaxBuf(maxBuf)
            }
            return nil
        },
    )
    var option html.ParseOption
    reflect.ValueOf(&option).Elem().Set(funcValue)
    return option
}

This implementation can be used as follows:

html.ParseWithOptions(bytes.NewReader(data), util.ParseOptionSetMaxBuf(len(data)*3))

To properly address the issue, I propose the following function to be added to the standard library:

func ParseOptionSetTokenizerMaxBuf(maxBuf int) ParseOption {
    return func(p *parser) {
        p.tokenizer.SetMaxBuf(maxBuf)
    }
}

Testing has shown that setting maxBuf to at least 1.04 times the body length ensures normal operation.

Feasibility

Adding a function similar to ParseOptionEnableScripting to allow users to set MaxBuf would provide a safe and efficient way to control memory usage when parsing large HTML documents, avoiding the use of unsafe methods and reflection.

Environment

Go version: 1.21
OS: Tested on Ubuntu 22.04 and Windows 11

The text was updated successfully, but these errors were encountered:

seankhliao · 2024-06-21T11:37:27Z

Related: #63177 to set the entire Tokenizer

ianlancetaylor · 2024-06-21T18:47:16Z

CC @neild @bradfitz

Jarcis-cy added the Proposal label Jun 21, 2024

gopherbot added this to the Proposal milestone Jun 21, 2024

seankhliao changed the title ~~Proposal: x/net/html: Add Option to Set MaxBuf in Parse~~ proposal: x/net/html: ParseOption to set maxBuf Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: x/net/html: ParseOption to set maxBuf #68101

proposal: x/net/html: ParseOption to set maxBuf #68101

Jarcis-cy commented Jun 21, 2024

seankhliao commented Jun 21, 2024

ianlancetaylor commented Jun 21, 2024

proposal: x/net/html: ParseOption to set maxBuf #68101

proposal: x/net/html: ParseOption to set maxBuf #68101

Comments

Jarcis-cy commented Jun 21, 2024

Proposal Details

Abstract

Background

Problem

Solution

Implementation

Feasibility

Environment

seankhliao commented Jun 21, 2024

ianlancetaylor commented Jun 21, 2024