Scraper API – Extract Data from Websites

Scraper API provides a developer-friendly API for extracting HTML from websites. Here is a simple example using their async API which does the scraping in the background so you can retrieve the results later. I am using httpie as my API client.

Submit the scraping job:

https --verbose POST async.scraperapi.com/jobs  Content-Type:application/json apiKey=replace-with-your-secret-api-key url=https://example.com

POST /jobs HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 76
Content-Type: application/json
Host: async.scraperapi.com
User-Agent: HTTPie/3.2.1

{
    "apiKey": "replace-with-your-secret-api-key",
    "url": "https://example.com"
}


HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 177
Content-Type: application/json; charset=utf-8
Date: Sat, 16 Jul 2022 12:08:35 GMT
ETag: W/"b1-MgeoBu7BLMfqgBHAIDkp5K6yohc"
x-request-id: 36fba713a01d6fcfb115a41d764ce814

{
    "id": "7915eb6f-f109-42f3-80cc-27ba0398e2a2",
    "status": "running",
    "statusUrl": "https://async.scraperapi.com/jobs/7915eb6f-f109-42f3-80cc-27ba0398e2a2",
    "url": "https://example.com"
}

Now let’s get the result and pipe it through jq to format it a bit:

https GET https://async.scraperapi.com/jobs/7915eb6f-f109-42f3-80cc-27ba0398e2a2 | jq .
{
  "id": "7915eb6f-f109-42f3-80cc-27ba0398e2a2",
  "status": "finished",
  "statusUrl": "https://async.scraperapi.com/jobs/7915eb6f-f109-42f3-80cc-27ba0398e2a2",
  "url": "https://example.com",
  "response": {
    "headers": {
      "date": "Sat, 16 Jul 2022 12:08:37 GMT",
      "content-type": "text/html; charset=utf-8",
      "content-length": "1256",
      "connection": "close",
      "x-powered-by": "Express",
      "access-control-allow-origin": "undefined",
      "access-control-allow-headers": "Origin, X-Requested-With, Content-Type, Accept",
      "access-control-allow-methods": "HEAD,GET,POST,DELETE,OPTIONS,PUT",
      "access-control-allow-credentials": "true",
      "x-robots-tag": "none",
      "sa-final-url": "https://example.com/",
      "sa-statuscode": "200",
      "etag": "W/\"4e8-Sjzo7hHgkd15I/TYxuW15B7HwEc\"",
      "vary": "Accept-Encoding"
    },
    "body": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset=\"utf-8\" />\n    <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n    <style type=\"text/css\">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n</div>\n</body>\n</html>\n",
    "statusCode": 200
  }
}

Additionally, we can parse the HTML with Pup using CSS selectors. (Note: at time of writing, on mac, you can install pup with brew install pup). Let’s extract the <p> tag.

https GET https://async.scraperapi.com/jobs/7915eb6f-f109-42f3-80cc-27ba0398e2a2 | jq .response.body | pup --color 'body div p' 
<p>
 This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.
</p>
<p>
 <a href="\&#34;https://www.iana.org/domains/example\&#34;">
  More information...
 </a>
</p>

Scraper API handles the complexities around proxies and CAPTCHAS making it easy for developers to focus on extracting the required data. It also provides good documentation to get developers get started quickly and a free trial.

Leave a Reply