Skip to main content

Running Puppeteer Scenarios for Web scraping

This feature is used for more advanced ways of web scraping. Interact with the webpage you want to scrape using the Puppeteer automation tool.

This method allows running full-featured JavaScript scenarios with Puppeteer wrapped into a single POST request which can be made in any language from your side.

What can be made with Puppeteer:

  • Automated form submission
  • Keyboard input
  • Authorization / Login
  • Mouse clicks
  • Custom JavaScript execution
  • Waiting for CSS elements to appear
  • Extracting data by CSS selector
  • Page scrolling
  • XHR/AJAX requests interception

The /scenario endpoint takes data from context variable and and runs code from code variable.

Getting page content

context:

{
"url": "https://en.wikipedia.org"
}

puppeteer scenario:

module.exports = async function ({ page, context }) {
const { url } = context; // Read the `url` from context

await page.goto( // Docs: https://pptr.dev/api/puppeteer.page.goto
url,
{waitUntil: 'domcontentloaded'}
);

const data = await page.content();
return {
data,
type: 'application/html',
};
};

The code above was minified with the online babel repl or jscompress.com, so being unable to have multi-line strings in JSON, you still can use it in the following curl call:

cURL (with an API token)

curl -X POST \
'https://chrome.browsercloud.io/scenario?token=API_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"code": "module.exports = async ({ page, context }) => { const { url } = context; await page.goto( url, {waitUntil: \"domcontentloaded\"} ); const data = await page.content(); return { data, type: \"application\/html\",};};",
"context": {
"url": "https://en.wikipedia.org/"
}
}'

Example: Logging in and getting data as JSON

As an example, we took bw-bank.de demo page to show how to log in to an account to parse data. The 'context' variable is an empty object (because the URL is inside the code)

puppeteer scenario:

module.exports = async ({ page, context }) => {
await page.goto(
'https://www.bw-bank.de/en/home/login-online-banking/demo-online-banking-pushtan.html',
{waitUntil: 'domcontentloaded'}
);
// Fill inputs & click Login
await page.type('.block.login input', 'pushDEMO');
await page.type('.block.login [type=password]', '12345');
await page.click('[title="Login securely"]');

// Waiting page load
await page.waitForSelector('.btableblock');

// Fetching data using JavaScript exec on the page
let payments = await page.evaluate(() => {
let result = [];
let elements = document.querySelectorAll('.balance .offscreen'); // get elements by selector
for (i=0; i<elements.length; i++) { // iterate over elements
result.push(elements[i].innerText);
}
return result; // returning data to 'payments' variable
})

return {
data: payments,
type: 'application/json',
};
};

Scenario wrapped into request

curl --location --request POST 'https://chrome.browsercloud.io/scenario?token=API_TOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{"code":"module.exports=async({page:a,context:b})=>{await a.goto(\"https:\/\/www.bw-bank.de\/en\/home\/login-online-banking\/demo-online-banking-pushtan.html\",{waitUntil:\"domcontentloaded\"}),await a.type(\".block.login input\",\"pushDEMO\"),await a.type(\".block.login [type=password]\",\"12345\"),await a.click(\"[title=\\\"Login securely\\\"]\"),await a.waitForSelector(\".btableblock\");let c=await a.evaluate(()=>{let a=[],b=document.querySelectorAll(\".balance .offscreen\");for(i=0;i<b.length;i++)a.push(b[i].innerText);return a});return{data:c,type:\"application\/json\"}};","context":{}}'

Result

["24.705,53 EUR","1.000,00 EUR","1.000,00 EUR","-125,50 EUR","-125,50 EUR","18.235,00 EUR","18.235,00 EUR","1.000,00 USD","1.000,00 USD","1.897,45 EUR","1.897,45 EUR","2.378,90 EUR","2.378,90 EUR","97.458,32 USD","97.458,32 USD","558,91 EUR","558,91 EUR","26,12 EUR","26,12 EUR","24.705,53 EUR","52.000,00 EUR","2.000,00 EUR","2.000,00 EUR","15.000,00 EUR","15.000,00 EUR","35.000,00 EUR","35.000,00 EUR","52.000,00 EUR","145.550,80 EUR","47.473,85 EUR","47.473,85 EUR","98.076,95 EUR","98.076,95 EUR","145.550,80 EUR","-36.613,93 EUR","-36.613,93 EUR","-36.613,93 EUR","-880,00 EUR","-880,00 EUR","-880,00 EUR","510.000,00 EUR","450.000,00 EUR","450.000,00 EUR","60.000,00 EUR","60.000,00 EUR","510.000,00 EUR","694.762,40 EUR"]

Session Timeout

Puppeteer / Playwright

By default session timeout is 30 seconds. You can set your value (&timeout=<VALUE> in milliseconds) if it is needed for your script

// 60-second limit:
https://chrome.browsercloud.io/scenario?token=API_TOKEN&timeout=60000

Proxies

Add parameter &--proxy-server=browsercloud-proxies to API endpoint string to use built-in proxies

https://chrome.browsercloud.io/scenario?token=API_TOKEN&--proxy-server=browsercloud-proxies