Getgo: a concurrent, simple and extensible web scraping framework

Getgo is a concurrent, simple and extensible web scraping framework written in Go. Quick start Get Getgo go get -u github.com/h12w/getgo Define a task This example is under the examples/goblog directory. To use Getgo to scrap structured data from a web page, just define the structured data as a Go struct (golangBlogEntry), and define a corresponding task (golangBlogIndexTask). type golangBlogEntry struct { Title string URL string Tags *string } type golangBlogIndexTask struct { // Variables in task URL, e.g. page number } func (t golangBlogIndexTask) Request() *http.Request { return getReq(`http://blog.golang.org/index`) } func (t golangBlogIndexTask) Handle(root *query.Node, s getgo.Storer) (err error) { root.Div(_Id("content")).Children(_Class("blogtitle")).For(func(item *query.Node) { title := item.Ahref().Text() url := item.Ahref().Href() tags := item.Span(_Class("tags")).Text() if url != nil && title != nil { store(&golangBlogEntry{Title: *title, URL: *url, Tags: tags}, s, &err) } }) return } Run the task Use util.Run to run the task and print all the result to standard output. ...

June 2, 2014