webBotparseR

Parse search engine results

David Schoch

3/31/23

webbotparseR allows to parse search engine results that where scraped with the WebBot browser extension. A similar python library is also available.

Installation

You can install the development version of webbotparseR like so:

remotes::install_github("schochastics/webbotparseR")

The package contains an example html from a google search on climate change.

library(webbotparseR)
ex_file <- system.file("www.google.com_climatechange_text_2023-03-16_08_16_11.html", package = "webbotparseR")

Such search results can be parsed via the function parse_search_results(). The parameter engine is used to specify the search engine and the search type.

output <- parse_search_results(path = ex_file,engine = "google text")
output
#> # A tibble: 10 × 10
#>    title link  text  image page  posit…¹ searc…² type  query date               
#>    <chr> <chr> <chr> <chr> <chr>   <int> <chr>   <chr> <chr> <dttm>             
#>  1 What… http… Clim… data… 1           1 www.go… text  clim… 2023-03-16 08:16:11
#>  2 Home… http… Vita… data… 1           2 www.go… text  clim… 2023-03-16 08:16:11
#>  3 Vita… http… “Cli… data… 1           3 www.go… text  clim… 2023-03-16 08:16:11
#>  4 Clim… http… In c… data… 1           4 www.go… text  clim… 2023-03-16 08:16:11
#>  5 IPCC… http… The … data… 1           5 www.go… text  clim… 2023-03-16 08:16:11
#>  6 Clim… http… Comp… data… 1           6 www.go… text  clim… 2023-03-16 08:16:11
#>  7 Clim… http… Clim… <NA>  1           7 www.go… text  clim… 2023-03-16 08:16:11
#>  8 UNFC… http… What… data… 1           8 www.go… text  clim… 2023-03-16 08:16:11
#>  9 Clim… http… Clim… data… 1           9 www.go… text  clim… 2023-03-16 08:16:11
#> 10 Caus… http… This… data… 1          10 www.go… text  clim… 2023-03-16 08:16:11
#> # … with abbreviated variable names ¹​position, ²​search_engine

Note that images are always returned base64 encoded.

output$image[1]
#> [1] "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAIAAACQkWg2AAAABnRSTlMAAAAAAABupgeRAAAAMklEQVR4AWMAgYYG4hEdNJAHGoCIABvBJayhgcYaIAwaakCwydUA52MKYeeSCgZh4gMAXrJ9ASggqqAAAAAASUVORK5CYII="

The function base64_to_img() can be used to decode the image and save it in an appropriate format.