Skip to main content

Running Puppeteer Scenarios by REST API for Web scraping

This feature is used for more advanced ways of web scraping. Interact with the webpage you want to scrape using the Puppeteer automation tool.

This method allows running full-featured JavaScript scenarios with Puppeteer wrapped into a single POST request which can be made in any language from your side.

Open Scenario Builder - a convenient builder that helps create and convert Puppeteer scrips to a POST request

What can be made with Puppeteer:

  • Automated form submission
  • Keyboard input
  • Authorization / Login
  • Mouse clicks
  • Custom JavaScript execution
  • Waiting for CSS elements to appear
  • Extracting data by CSS selector
  • Page scrolling
  • XHR/AJAX requests interception

The /scenario endpoint takes data from context variable and and runs code from code variable.

Puppeteer Example

Script:

export default async function ({ page, context }) {
const { url } = context; // Read the `url` from context

await page.goto( // Docs: https://pptr.dev/api/puppeteer.page.goto
url,
{waitUntil: 'domcontentloaded'}
);

const data = await page.content();
return {
data,
type: 'application/html',
};
};

Context (variables passed to the script):

{
"url": "https://en.wikipedia.org"
}

The code above was minified with the online babel repl or jscompress.com, so being unable to have multi-line strings in JSON, you still can use it in the following curl call:

Final POST request

curl -X POST \
'https://chrome-v2.browsercloud.io/scenario?token=API_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"code": "export default async function ({ page, context }) { const { url } = context; await page.goto( url, {waitUntil: \"domcontentloaded\"} ); const data = await page.content(); return { data, type: \"application\/html\",};};",
"context": {
"url": "https://en.wikipedia.org/"
}
}'

Example: Logging in and getting data as JSON

As an example, we took bw-bank.de demo page to show how to log in to an account to parse data. The 'context' variable is an empty object (because the URL is inside the code)

puppeteer scenario:

export default async function ({ page, context }) {
await page.goto(
'https://www.bw-bank.de/en/home/login-online-banking/demo-online-banking-pushtan.html',
{waitUntil: 'domcontentloaded'}
);
// Fill inputs & click Login
await page.type('input[autocomplete="username"]', 'pushDEMO');
await page.type('input[type=password]', '12345');
await page.click('[title="Log in"]');

// Waiting page load
await page.waitForSelector('.mkp-card-group');

// Fetching data using JavaScript exec on the page
let payments = await page.evaluate(() => {
let result = [];
let elements = document.querySelectorAll('span.offscreen'); // get elements by selector
for (i=0; i<elements.length; i++) { // iterate over elements
result.push(elements[i].innerText);
}
return result; // returning data to 'payments' variable
})

return {
data: payments,
type: 'application/json',
};
};

Scenario wrapped into request

curl --request POST 'https://chrome-v2.browsercloud.io/scenario?token=API_TOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{"code":"export default async function({page,context}){await page.goto(\"https://www.bw-bank.de/en/home/login-online-banking/demo-online-banking-pushtan.html\",{waitUntil:\"domcontentloaded\"});await page.type('\''input[autocomplete=\"username\"]'\'',\"pushDEMO\");await page.type(\"input[type=password]\",\"12345\");await page.click('\''[title=\"Log in\"]'\'');await page.waitForSelector(\".mkp-card-group\");let payments=await page.evaluate(()=>{let result=[];let elements=document.querySelectorAll(\"span.offscreen\");for(i=0;i<elements.length;i++){result.push(elements[i].innerText)}return result});return{data:payments,type:\"application/json\"}}","context":{"url":"https://wikipedia.org/"}}'

Result

{
"data": [
"23.825,53 EUR",
"1.000,00 EUR",
"-125,50 EUR",
"18.235,00 EUR",
"1.000,00 USD",
"-880,00 EUR",
"1.897,45 EUR",
"2.378,90 EUR",
"97.458,32 USD",
"558,91 EUR",
"26,12 EUR",
"23.825,53 EUR",
"52.000,00 EUR",
"2.000,00 EUR",
"15.000,00 EUR",
"35.000,00 EUR",
"52.000,00 EUR",
"145.550,80 EUR",
"47.473,85 EUR",
"98.076,95 EUR",
"145.550,80 EUR",
"-36.613,93 EUR",
"-36.613,93 EUR",
"-9.922,44 EUR",
"3.172,56 EUR",
"-13.095,00 EUR",
"-9.922,44 EUR",
"510.000,00 EUR",
"450.000,00 EUR",
"60.000,00 EUR",
"510.000,00 EUR",
"684.839,96 EUR"
],
"type": "application/json"
}

Session Timeout

Puppeteer / Playwright

By default session timeout is 30 seconds. You can set your value (&timeout=<VALUE> in milliseconds) if it is needed for your script

// 60-second limit:
https://chrome-v2.browsercloud.io/scenario?token=API_TOKEN&timeout=60000