Getgo is a concurrent, simple and extensible web scraping framework written in Go.
Quick start
Get Getgo
go get -u github.com/h12w/getgo
Define a task
This example is under the examples/goblog directory. To use Getgo to scrap structured data from a web page, just define the structured data as a Go struct (golangBlogEntry), and define a corresponding task (golangBlogIndexTask).
type golangBlogEntry struct {
Title string
URL string
Tags *string
}
type golangBlogIndexTask struct {
// Variables in task URL, e.g. page number
}
func (t golangBlogIndexTask) Request() *http.Request {
return getReq(`http://blog.golang.org/index`)
}
func (t golangBlogIndexTask) Handle(root *query.Node, s getgo.Storer) (err error) {
root.Div(_Id("content")).Children(_Class("blogtitle")).For(func(item *query.Node) {
title := item.Ahref().Text()
url := item.Ahref().Href()
tags := item.Span(_Class("tags")).Text()
if url != nil && title != nil {
store(&golangBlogEntry{Title: *title, URL: *url, Tags: tags}, s, &err)
}
})
return
}
Run the task
Use util.Run to run the task and print all the result to standard output.
util.Run(golangBlogIndexTask{})
To store the parsed result to a database, a storage backend satisfying getgo.Tx interface should be provided to the getgo.Run method.
Understand Getgo
A getgo.Task is an interface to represent an HTTP crawler task that provides an HTTP request and a method to handle the HTTP response.
type Task interface {
Requester
Handle(resp *http.Response) error
}
type Requester interface {
Request() *http.Request
}
A getgo.Runner is responsible to run a getgo.Task. There are two concrete runners provided: SequentialRunner and ConcurrentRunner.
type Runner interface {
Run(task Task) error // Run runs a task
Close() // Close closes the runner
}
A task that stores data into a storage backend should satisfy getgo.StorableTask interface.
type StorableTask interface {
Requester
Handle(resp *http.Response, s Storer) error
}
A storage backend is simply an object satisfying getgo.Tx interface.
type Storer interface {
Store(v interface{}) error
}
type Tx interface {
Storer
Commit() error
Rollback() error
}
See getgo.Run method to understand how a StorableTask is combined with a storage backend and adapted to become a normal Task to allow a Runner to run it.
There are currently a PostgreSQL storage backend provided by Getgo, and it is not hard to support more backends (See getgo/db package for details).
The easier way to define a task for an HTML page is to define a task satisfying getgo.HTMLTask rather than getgo.Task, there are adapters to convert internally an HTMLTask to a Task so that a Runner can run an HTMLTask. The Handle method of HTMLTask provides an already parsed HTML DOM object (by html-query package).
type HTMLTask interface {
Requester
Handle(root *query.Node, s Storer) error
}
Similarly, a task for retrieving a JSON page should satisfy getgo.TextTask interface. An io.Reader is provided to be decoded by the encoding/json package.
type TextTask interface {
Requester
Handle(r io.Reader, s Storer) error
}