Skip to content
techengine edited this page May 1, 2024 · 9 revisions

GoScrapy: Wiki

Goscrapy cli.

Goscrapy provides the goscrapy cli tool to help you scaffold a goscrapy project.

Usage

  • Install
go install github.com/tech-engine/goscrapy@latest
  • Verify installation
goscrapy -v
  • Create a project
goscrapy startproject scrapejsp
  • Create a custom pipeline
goscrapy pipeline export_2_DB

Base Concepts

GoScrapy operates around the below three concepts.

  • Job: Describes an input to your spider.
  • Record: Represents an output produced by your spider.
  • Spider: Contains the main logic of your scraper.

Job

Job represents an input to goscrapy spider and must implement core.IJob interface.

type IJob interface {
    Id() string
}

job.go

type Job struct {
    id string
    // add your own fields here
}

func (j *Job) Id() string {
    return j.id
}

Record

A Record represents an output produced by a spider(via yield) and must implement core.IOutput.

type IOutput interface {
    Record() *Record
    RecordKeys() []string
    RecordFlat() []any
    Job() IJob
}

record.go

type Record struct {
    J    *Job   `json:"-" csv:"-"`
}

func (r *Record) Record() *Record {
    return r
}

func (r *Record) RecordKeys() []string {
    ....
    keys := make([]string, numFields)
    ....
    return keys
}

func (r *Record) RecordFlat() []any {
    ....
    return slice
}

func (r *Record) Job() core.IJob {
    return r.J
}

Spider

Encapsulates the main logic of a goscrapy spider. We embed gos.ICoreSpider to make our spider work.

spider.go

type Spider struct {
  gos.ICoreSpider[*Record]
}

func New(ctx context.Context) (*Spider, <-chan error) {

  // use proxies
  // proxies := core.WithProxies("proxy_url1", "proxy_url2", ...)
  // core := gos.New[*Record]().WithClient(
  // 	gos.DefaultClient(proxies),
  // )

  core := gos.New[*Record]()

  // Add middlewares
  core.MiddlewareManager.Add(MIDDLEWARES...)
  // Add pipelines
  core.PipelineManager.Add(PIPELINES...)

  errCh := make(chan error)

  go func() {
 	errCh <- core.Start(ctx)
  }()

  return &Spider{
	core,
  }, errCh
}

// This is the entrypoint to the spider
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
  // for each request we must call NewRequest() and never reuse it
  req := s.NewRequest()

  var headers http.Header

  /* GET is the request method, method chaining possible
  req.Url("<URL_HERE>").
  Meta("MY_KEY1", "MY_VALUE").
  Meta("MY_KEY2", true).
  Header(headers)
  */
    
  /* POST
  req.Url(<URL_HERE>)
  req.Method("POST")
  req.Body(<BODY_HERE>)
  */
    
  // call the next parse method
  s.Request(req, s.parse)
}

// can be called when spider exits
func (s *Spider) Close(ctx context.Context) {
}

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
  // response.Body()
  // response.StatusCode()
  // response.Header()
  // response.Bytes()
  // response.Meta("MY_KEY1")
	
  // yielding output pushes output to be processed by pipelines, also check output.go for the fields
  var data Record

  err := json.Unmarshal(resp.Bytes(), &data)
  if err != nil {
    log.Panicln(err)
  }

  // s.Yield(&data)
}

Settings

In addition to all the files discussed, we also have settings.go where we can import all middlewares and pipelines we want to use in our project.

settings.go

...
// Middlewares here
var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.Retry(),
	middlewares.MultiCookieJar,
	middlewares.DupeFilter,
}

var export2CSV = pipelines.Export2CSV[*Record](pipelines.Export2CSVOpts{
	Filename: "itstimeitsnowornever.csv",
})

// Pipelines here
var PIPELINES = []pm.IPipeline[*Record]{
	export2CSV,
	// export2Json,
}
...

Examples

More examples coming...

Usage

main.go.

func main() {
  ctx, cancel := context.WithCancel(context.Background())

  var wg sync.WaitGroup
  wg.Add(1)

  spider, errCh := test1.New(ctx)
  go func() {
	defer wg.Done()

	err := <-errCh

	if err != nil && errors.Is(err, context.Canceled) {
		return
	}

	fmt.Printf("failed: %q", err)
  }()

  // start the scraper with a job, currently nil is passed but you can pass your job here
  spider.StartRequest(ctx, nil)

  OnTerminate(func() {
	fmt.Println("exit signal received: shutting down gracefully")
	cancel()
	wg.Wait()
  })

}

Customize the Default client.

Option Description Default
WithProxies Accepts multiple proxy url strings. by default client uses proxy from enviroment
WithTimeout Http client timeout. 10 seconds
WithMaxIdleConns Controls the max no. of idle(keep-alive) conns. across all hosts. 0 means unlimited. 100
WithMaxIdleConnsPerHost Same as WithMaxIdleConns but per host. 100
WithMaxConnsPerHost Limits the total no. of conns. per host. 0 mean unlimited. 100
WithProxyFn Accepts a custom proxy function for transport. Round robin

[spider.go]

func New(ctx context.Context) (*Spider, <-chan error) {
    // default client options
    // proxies := gos.WithProxies("proxy_url1", "proxy_url2", ...)
     
    // core := gos.New[*Record]().WithClient(
    // 	  gos.DefaultClient(proxies),
    // )

    // we can also provide in our custom client
    // core := gos.New[*Record]().WithClient(myCustomHTTPClient)
}

Pipelines

Pipelines help in managing, transforming, and fine-tuning the scraped data.

Built-in Pipelines

Use Pipelines

We can add pipelines using coreSpider.PipelineManager.Add().

[settings.go]

// use export 2 csv pipeline
export2Csv := pipelines.Export2CSV[*scrapejsp.Record](pipelines.Export2CSVOpts{
	Filename: "itstimeitsnowornever.csv",
})

// use export 2 json pipeline
export2Json := pipelines.Export2JSON[*scrapejsp.Record](pipelines.Export2JSONOpts{
	Filename:  "itstimeitsnowornever.json",
	Immediate: true,
})

Pipeline Group

A Group allows us to execute multiple pipelines concurrently. All pipelines in a group behave as one single pipeline. This can be useful in scenarios we may want to export our data both to multiple destinations. Instead of exporting sequentially, we can bundle them together in a group.

Pipelines in a group shouldn't be used for data transformation but for independent tasks like data exporting to a database etc.

[settings.go]

func myCustomPipelineGroup() *pm.Group[*Record] {
  // create a group
  pipelineGroup := pm.NewGroup[*Record]()

  pipelineGroup.Add(export2CSV)
  // pipelineGroup.Add(export2Json)
  return pipelineGroup
}

// Pipelines here
// Executed in the order they appear.
var PIPELINES = []pm.IPipeline[*Record]{
  export2CSV,
  // export2Json,
  // myCustomPipelineGroup(), // use group as if it were a single pipeline
}

Middlewares

GoScrapy also support inbuilt and custom middlewares for manipulation outgoing request.

Built-in Middlewares

  • MultiCookieJar - used for maintaining different cookie sessions while scraping.
  • DupeFilter - filters duplicate requests
  • Retry - retry request with exponential back-off upon failure or with http status codes 500, 502, 503, 504, 522, 524, 408, 429
Option Description Default
MaxRetries Additional retries after failure. 3
Codes Http code to trigger retry. 500, 502, 503, 504, 522, 524, 408, 429
BaseDelay Exponential Backoff multiplier. 1 second
Cb Callback executed after every retry. If callback returns false, further retry is skipped. nil

Use Middlewares

We can add middlewares using gos.MiddlewareManager.Add().

[settings.go]

var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.Retry(),
	middlewares.MultiCookieJar,
	middlewares.DupeFilter,
}

Custom Pipelines

GoScrapy supports custom pipelines. To create one, you can use goscrapy cli.

abc\go\go-test-scrapy>scrapejsp> goscrapy pipeline export_2_DB

✔️  pipelines\export_2_DB.go

✨ Congrates, export_2_DB created successfully.

Custom middlewares

To create one, you can use goscrapy cli. Custom middlewares must have the below function signature.

func MultiCookieJar(next http.RoundTripper) http.RoundTripper {
	return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) {
		// you middleware custom code here
	})
}

Contact

Discord