Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/net/html: ParseOption to set maxBuf #68101

Open
Jarcis-cy opened this issue Jun 21, 2024 · 2 comments
Open

proposal: x/net/html: ParseOption to set maxBuf #68101

Jarcis-cy opened this issue Jun 21, 2024 · 2 comments
Labels
Milestone

Comments

@Jarcis-cy
Copy link

Proposal Details

Abstract

This proposal suggests introducing an option to set the MaxBuf parameter in the html.Parse function to control memory usage when parsing large HTML documents.

Background

Currently, html.Parse in the Go standard library calls ParseWithOptions internally, leading to a chain of function calls: html.Parse -> ParseWithOptions -> p.parse() -> p.tokenizer.Next() -> readByte(). Within readByte(), there is a logic block:

if z.maxBuf > 0 && z.raw.end-z.raw.start >= z.maxBuf {
    z.err = ErrBufferExceeded
    return 0
}

This logic is activated only if maxBuf is set. However, there is no way to set MaxBuf when using html.Parse or ParseWithOptions.

Problem

When parsing very large HTML documents, such as this page, memory usage can increase significantly due to the inability to set MaxBuf.

Solution

To address this, I propose introducing a function similar to ParseOptionEnableScripting to allow users to set MaxBuf.

Implementation

A sample implementation using reflection is provided below. This implementation, though functional, uses unsafe methods and reflection, which are not ideal for production code:

func ParseOptionSetMaxBuf(maxBuf int) html.ParseOption {
    funcValue := reflect.MakeFunc(
        reflect.FuncOf([]reflect.Type{reflect.TypeOf((*html.ParseOption)(nil)).Elem().In(0)}, nil, false),
        func(args []reflect.Value) (results []reflect.Value) {
            parserValue := args[0].Elem()
            tokenizerField := parserValue.FieldByName("tokenizer")
            tokenizerPtr := reflect.NewAt(tokenizerField.Type(), unsafe.Pointer(tokenizerField.UnsafeAddr())).Elem().Interface()
            if tokenizer, ok := tokenizerPtr.(interface { SetMaxBuf(int) }); ok {
                tokenizer.SetMaxBuf(maxBuf)
            }
            return nil
        },
    )
    var option html.ParseOption
    reflect.ValueOf(&option).Elem().Set(funcValue)
    return option
}

This implementation can be used as follows:

html.ParseWithOptions(bytes.NewReader(data), util.ParseOptionSetMaxBuf(len(data)*3))

To properly address the issue, I propose the following function to be added to the standard library:

func ParseOptionSetTokenizerMaxBuf(maxBuf int) ParseOption {
    return func(p *parser) {
        p.tokenizer.SetMaxBuf(maxBuf)
    }
}

Testing has shown that setting maxBuf to at least 1.04 times the body length ensures normal operation.

Feasibility

Adding a function similar to ParseOptionEnableScripting to allow users to set MaxBuf would provide a safe and efficient way to control memory usage when parsing large HTML documents, avoiding the use of unsafe methods and reflection.

Environment

  • Go version: 1.21
  • OS: Tested on Ubuntu 22.04 and Windows 11
@gopherbot gopherbot added this to the Proposal milestone Jun 21, 2024
@seankhliao seankhliao changed the title Proposal: x/net/html: Add Option to Set MaxBuf in Parse proposal: x/net/html: ParseOption to set maxBuf Jun 21, 2024
@seankhliao
Copy link
Member

Related: #63177 to set the entire Tokenizer

@ianlancetaylor
Copy link
Contributor

CC @neild @bradfitz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants