13 posts tagged with "webscraping"

How to Set a Proxy in Playwright 2024

April 23, 2024 · 5 min read

Web Scraping & Automation Expert

In this article, we will tell you how to set up a proxy in Playwright (Playwright version for Node.js).

Playwright is an excellent solution for browser automation. It has modern browsers such as Chromium, Firefox and WebKit. Playwright uses the open source CDP protocol to send commands. Playwright is mainly used for user interface testing and web scraping. In both cases, setting up proxy servers is an important part of the scenario.

There are two ways to configure the Playwright proxy server:

in a global browser instance,
in the browser context.

The First Way to Set a Proxy in Playwright

Even if you are an experienced developer, there is a simple way to reduce the time you spend working and extend your free time, which you can spend even more usefully!

const browser = await chromium.launch({
  proxy: {
    server: 'http://myproxy.com:3128',
    username: 'usr',
    password: 'pwd'
  }
});

Note! The credentials, which are stored in the standard http://username:pw@host:port syntax, need to be adapted to the Playwright format. Let's take a closer look.

Another Way of Setting up a Proxy for Playwright

You can, of course, use Javascript or Typescript, but you will have to commit sensitive credentials to the git repository. You can keep it simple and use environment variables to use the .env file and the standard one-line proxy syntax:

PROXY_URL=http://user:pw@proxy-host:port

The next step is to initialize dotenv in your project's main file. Pre-install dotenv using the npm i dotenv command. Here's what happens:

import dotenv from 'dotenv'
dotenv.config()

function convertProxyToPlaywrightFormat(proxyUrl) {
    const url = new URL(proxyUrl);
    return {
        server: `${url.protocol}//${url.host}`,
        username: url.username,
        password: url.password
    };
}

const proxyOptions = convertProxyToPlaywrightFormat(proxyUrl);

With this, you can avoid having three proxy-only env variables (username, password, host) and replace them with just one variable.

Now let's look at the complete code for this proxy in Playwright:

import 'dotenv/config';
import { chromium } from 'playwright';

function convertProxyToPlaywrightFormat(proxyUrl) {
    const url = new URL(proxyUrl);
    return {
        server: `${url.protocol}//${url.host}`,
        username: url.username,
        password: url.password
    };
}

async function main() {
    const proxyUrl = process.env.PROXY_URL;

    if (!proxyUrl) {
        console.error('Proxy URL not found in .env file');
        process.exit(1);
    }

    const proxyOptions = convertProxyToPlaywrightFormat(proxyUrl);

    const browser = await chromium.launch({
        proxy: proxyOptions,
    });

    const page = await browser.newPage();
    await page.goto('http://example.com');

    await browser.close();
}

main();

Web Scraping Playwright: Proxy Rotation and Retries

If you use rotating proxies, then the previous code version will not work for you. Proxies, although effective, make the connection to the target site less reliable, and there is always a risk that the connection will simply be interrupted. The code above will fail more often when using a proxy than it would without a proxy. However, there is a solution. You need to repeat the Playwright request. Here's how to do it:

import 'dotenv/config';
import { chromium } from 'playwright';

function convertProxyToPlaywrightFormat(proxyUrl) {
    const url = new URL(proxyUrl);
    return {
        server: `${url.protocol}//${url.host}`,
        username: url.username,
        password: url.password
    };
}

async function tryNavigate(page, url, maxRetries = 3) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            await page.goto(url);
            return; // If successful, return without throwing an error
        } catch (error) {
            console.error(`Attempt ${attempt} failed: ${error.message}`);
            if (attempt === maxRetries) {
                throw error; // Rethrow the last error if all retries fail
            }
        }
    }
}

async function main() {
    const proxyUrl = process.env.PROXY_URL;

    if (!proxyUrl) {
        console.error('Proxy URL not found in .env file');
        process.exit(1);
    }

    const proxyOptions = convertProxyToPlaywrightFormat(proxyUrl);
    const browser = await chromium.launch({
        proxy: proxyOptions,
    });

    try {
        const page = await browser.newPage();
        await tryNavigate(page, 'http://example.com');
    } catch (error) {
        console.error(`Failed to navigate: ${error.message}`);
    } finally {
        await browser.close();
    }
}

main();

In addition to retrying the request, you should reduce the number of resources you download. For example, limit the number of loaded pages. Then Playwright's work with a proxy will be more stable.

How to Set Different Proxies for one Playwright instance

Playwright requires large system resources, such as a significant amount of RAM. Therefore, sometimes it is necessary to reduce the load on the system. This can be done by using different proxies for different requests. This will reduce the number of browser instances that load the system. In the case of Playwright, Playwright contexts are used: BrowserContexts. They allow you to manage multiple independent browser sessions. At the same time, note that for the website, the two contexts look like two different browser sessions, although they are launched using the same platform, Playwright. Let's imagine we have a .env file:

PROXY_URL=http://user:pw@proxy-host:port
PROXY2_URL=http://user2:pw@proxy-host2:port

And this is an example of how you can use two different proxies in one Playwright session:

import 'dotenv/config';
import { chromium } from 'playwright';

function convertProxyToPlaywrightFormat(proxyUrl) {
    const url = new URL(proxyUrl);
    return {
        server: `${url.protocol}//${url.host}`,
        username: url.username,
        password: url.password
    };
}

async function main() {
    const proxyUrl = process.env.PROXY_URL;
    const proxy2Url = process.env.PROXY2_URL;

    if (!proxyUrl || !proxy2Url) {
        console.error('One or both proxy URLs not found in .env file');
        process.exit(1);
    }

    const proxyOptions = convertProxyToPlaywrightFormat(proxyUrl);
    const proxy2Options = convertProxyToPlaywrightFormat(proxy2Url);

    const browser = await chromium.launch();

    // Create two different contexts with different proxies
    const context1 = await browser.newContext({ proxy: proxyOptions });
    const context2 = await browser.newContext({ proxy: proxy2Options });

    const page1 = await context1.newPage();
    const page2 = await context2.newPage();

    // Do something with both pages. 
    // Cookies and sessions are not shared between page1 and page2
    await page1.goto('http://example.com');
    await page2.goto('http://example.com');

    // Close the browser contexts
    await context1.close();
    await context2.close();

    // Close the browser
    await browser.close();
}

main();

Puppeteer vs Playwright Comparison 2024

April 23, 2024 · 6 min read

Mark

Web Scraping & Automation Expert

The effectiveness of web scraping depends on the tools you use. Among these tools, Puppeteer and Playwright stand out due to their browser automation capabilities. In this article, we will compare these two tools to understand which one is more effective and easier to use.

Puppeteer vs. Playwright

Now we'll take a closer look at the difference between Puppeteer and Playwright. We'll talk about what languages they support, what browsers they support, what convenient features they offer, how fast they work, and what browsers they integrate with. At the end, you will have a complete understanding of the differences between these two libraries. So, let's begin!

Language support

Puppeteer provides support for Node.js. That is, it is optimized for JavaScript and TypeScript. If you are already working with them, then Puppeteer will be a suitable choice. It integrates seamlessly into your workflows, and you can get up to speed quickly. Playwright provides APIs for more languages, including JavaScript, Python, C#, and Java. This attracts more developers to Playwright and increases its audience reach.

Browser Support

Puppeteer works best with Chrome and Chromium-based browsers. Because of this, Puppeteer's uses are quite limited. Although a version of Puppeteer for Firefox has recently been released, it is still being improved and needs serious improvements. For example, if you use Puppeteer for Firefox with parallel operations, you will experience system resource overload. Playwright can do the same thing as Puppeteer, but for more browsers. It is cross-browser compatible and works immediately with Chromium, WebKit and Firefox. You can combine Playwright with Google Chrome, Microsoft Edge or Safari. This is a great tool for checking that web applications work correctly under any conditions.

Ease of use for web scraping

Puppeteer has a built-in auto-sleep feature. This reduces the likelihood of errors caused by the asynchronous loading of web elements. At the same time, Puppeteer uses intelligent selectors that speed up and simplify the search for the necessary elements. Playwright, like Puppeteer, also supports auto-waiting functions. This makes Playwright require shorter scripts. In addition to this, Playwright offers built-in proxy support and advanced debugging capabilities.

Speed

The Puppeteer library is not cross-browser compatible, but it often benefits in speed since commands go directly to the browser. On the other hand, the speed of Puppeteer is affected by the complexity of web pages and your own code. Here's an example of the code for scraping a website using Puppeteer in JavaScript:

const puppeteer = require('puppeteer');

async function main() {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://example.com');

    const content = await page.content();
    console.log(content);

    await browser.close();
}

main();

In this code snippet, the puppeteer library adds integration with Puppeteer functionality to your script. Next in the code, through the asynchronous main function, the offline browser is launched, and a new page opens. After this, there is a transition to https://example.com . The contents of the pages are then retrieved from them and transferred to the console. After this, you close the browser to free up system resources.

Playwright has a speed advantage. For example, in real-world end-to-end (E2E) testing scenarios, Playwright reduces test suite execution time and speeds up monitoring verification. Additionally, Playwright supports cross-browser testing. Here's an example of the code for scraping a website using Playwright in JavaScript:

const { chromium } = require('playwright');

async function main() {
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto('https://example.com');

    const content = await page.content();
    console.log(content);

    await browser.close();
}

main();

The first step is to add a chromium object from the playwright library to integrate Chromium functionality into your script. The following steps repeat those of the previous example. In the code, through the asynchronous main function, the offline browser is launched and a new page is opened. After this, there is a transition to https://example.com. The contents of the pages are then retrieved from them and transferred to the console. After this, you close the browser to free up system resources. To complete the script, you must call the main function. It will start the task.

Automatic standby mechanism

Both Puppeteer and Playwright support the auto-standby feature, but there are differences in how they work. Puppeteer is best for you if you use JavaScript and Chrome and need to solve a simple web page scraping task. Playwright simplifies the handling of asynchronous events, making it suitable for complex tasks and rendering heavy web pages. However, let's look a little more closely:

In Playwright's case, auto-waiting is used to perform a series of sanity checks before performing an action (such as a mouse click). If any of the checks do not complete within the specified timeout, the action fails with a TimeoutError.
In the case of Puppeteer, it offers not only automatic waiting, but also the ability to pause the script until certain conditions are met. For example, the script will not be executed until the web page has fully loaded. To do this, special wait methods are used: page.waitForNavigation(),page.waitForSelector() and page.waitForFunction().

Selector Engine

Playwright's selector engine allows you to register custom selector engines. They are optimized for specific tasks. For example, querying tag names or setting custom attributes. Puppeteer's selector engine is also effective, but limited in customization. Unlike Playwright, Puppeteer cannot provide more granular control over element selection or an additional layer of customization for complex web page scraping scripts. In other words, Puppeteer is suitable for simple tasks and simple parsing scenarios, while Playwright is suitable for highly specialized needs.

Integration with other tools

Puppeteer is limited to Chromium browsers and, apart from them, only integrates with Jest and Lighthouse. If you need integration with proxy services, you will have to use additional extensions and plugins. Playwright is cross-browser compatible, meaning it integrates directly with Google Chrome, Microsoft Edge, Safari, WebKit and Firefox. However, Playwright has built-in proxy server support.

Conclusion

Puppeteer and Playwright equally seamlessly integrate with Bright Data Scraping Browser, a platform for improving web scraping efficiency through built-in website access features. Therefore, the choice between Puppeteer and Playwright depends solely on the complexity of the tasks and your needs.

Puppeteer vs Playwright Which is Better in 2024

April 23, 2024 · 7 min read

Mark

Web Scraping & Automation Expert

If you want to create a script for scraping pages in JavaScript, then Selenium and Playwright are suitable tools for you. In this article, we will take a closer look at their differences and identify their advantages.

Playwright

Playwright is a Node.js library for automating Chromium, Firefox and WebKit with a single API. This framework for E2E testing of applications integrates well with any applications and sites and also offers a wide range of customization.

The main advantages of Playwright include:

Cross-browser compatibility: Playwright supports all modern render engines, including Chromium, Firefox and WebKit.
Cross-platform: You can test under Windows, Linux, macOS, locally or on a continuous integration server, with or without a graphical interface.
Multilingual: Use the Playwright API in TypeScript, JavaScript, Python, .NET or Java.
Automatic waits: Before running the test, Playwright waits for the required elements to become available. This eliminates the need for artificial timeouts.
Advanced Features: Playwright supports multi-page scripts, network capture, and other features that make your workflows faster and easier.

Playwright, although it has clear advantages, has several disadvantages:

High Resource Waste: While performing complex tasks or rendering, Playwright may require significant system resources. Especially if you need to run multiple instances.
Browser Specifics: Use of some features varies depending on the browser you are running.
Need to Know Node.js: If you are not familiar with Node.js asynchronous patterns, you will need some time to get the hang of them. Without knowing the patterns, Playwright is quite difficult to use.
Risk of Detection: Even if you use the most advanced technology, your actions may be detected by bot detection systems. Unfortunately, such a risk cannot be excluded.

Selenium

Selenium is an open source test automation framework for web applications. It is designed to test the functional aspects of web applications across different browsers and platforms.

Selenium has several advantages:

Cross-Browser Compatibility: Selenium scripts can run in multiple browsers, such as Chrome, Firefox, IE and Opera, allowing for easy cross-browser testing.
Multi-language support: Provides the flexibility to write tests in multiple programming languages. These include Java, C#, Python, JavaScript, Ruby and Kotlin.
Open Source and Community Support: Being an open source product, it is free to use and supported by a large community, providing extensive resources and assistance.
Flexibility: Test scripts easily integrate with other tools such as TestNG, JUnit for test case management, and Maven, Jenkins for continuous integration.

Selenium, while having advantages, still has several disadvantages:

Handling dynamic content: Difficulty testing web applications with dynamic content, requiring additional tools or frameworks to effectively manage such scenarios.
Depends on web drivers: To use Selenium when parsing pages, you need to install additional drivers that are optimized for the browser you are running.
Slowness: In comparison, headless browsers are slower due to the difficulty of operating a full-fledged browser.
Resource-Intensive: Like Playwright, Selenium is resource-intensive when running complex tasks and multi-page scripts.

Setup and ease of use

Both Playwright and Selenium are multilingual. They support multiple programming languages using a single API. Before you start using these frameworks, you need to download the binding library for the programming language in which you write your scripts. For example, in Python, Playwright uses the pytest-playwright library, or when using Selenium, the selenium library. However, Selenium requires one more step: downloading the WebDriver for the browser you are using. For example, ChromeDriver for Chrome. Playwright, unlike Selenium, has one driver and supports all browsers equally effectively. In Playwright, you only need to use one command: playwright install, and all the necessary files will be installed at the same time. After installation and configuration, the operating principle of both libraries is almost the same. However, in the early stages, Playwright will still be more understandable to beginners. It offers extensive customization options that will help you write some simple scripts. Well, the Playwright documentation is more complete and easier to understand than the Selenium documentation.

Suggested Features

Both Playwright and Selenium offer all the necessary functions for finding the location of basic elements. You can find elements using CSS or XPath selectors:

# Playwright
heading = page.locator('h1')
accept_button = page.locator('//button[text()="Accept"]')

# Selenium
heading = driver.find_element(By.CSS_SELECTOR, 'h1')
accept_button = driver.find_element(By.XPATH, '//button[text()="Accept"]')

Playwright supports additional locators that allow you to retrieve data such as text, placeholder, title, and role. Locators help both experienced developers and beginners who cannot yet obtain these locators using selectors:

accept_button = page.get_by_text("Accept")

When parsing web applications, it is important to correctly calculate the specified time for executing the script. Actions should not be performed on elements that have not yet appeared, and elements should not take too long to load. To control this, Selenium uses explicit wait statements. For example, they will tell the script to wait until the page has fully loaded before allowing it to complete the task:

el = WebDriverWait(driver, timeout=3).until(lambda x: x.find_element(By.TAG_NAME,"button"))
el.click()

Playwright is a little more thoughtful. Before performing any action on elements, Playwright performs a series of sanity checks. That is, you won't be able to click on an element that isn't already visible:

page.get_by_role("button").click()

In addition to the main functions, Playwright and Selenium also offer several additional ones. For example, Playwright Inspector allows you to review scripts and see where they go wrong. This means you won't have to re-run it several times in a row. Playwright also offers a code generator that allows you to write scripts without searching for selectors in HTML. The generator records the sequence of actions you perform and immediately writes the code. For beginners, this is a way to quickly become familiar with Playwright's functionality. Experienced developers can use code to customize the actions that happen before parsing. For example, log into your account. Selenium has a playback and recording tool called Selenium IDE. It is available as an additional browser extension for Chrome and Firefox. Selenium IDE combines both the capabilities of the Playwright Inspector and the capabilities of a code generator.

Flexibility and performance

Multilingualism has already been mentioned among the advantages of Playwright and Selenium. Playwright supports JavaScript/TypeScript, Java, Python and C#. Selenium supports Java, C#, Python, JavaScript, Ruby and Kotlin. And these are only the languages that Playwright and Selenium officially support. However, you can use unofficial binding libraries. In this, we note that Selenium has surpassed Playwright. Most programming languages have unofficial binding libraries for them. Playwright is considered faster than Selenium. The developers have done serious work and optimized Playwright. Thanks to this, Playwright allows for fast script execution and simplifies parallelization. Both Playwright and Selenium support contexts, which replicate the principle of incognito mode in the browser. That is, through the bottom, you can launch several independent sessions in Chrome or any other browser. However, in Playwright you can run multiple contexts in parallel, and the parsing will be faster than through Selenium.

Conclusion

The comparison shows that Playwright is better for beginners. It is easy to learn and helps in writing your first scripts. However, in other respects, Playwright and Selenium are almost equally effective. For example, both libraries easily integrate with the Bright Data proxy, which is used for web scraping.

Running browser scripts without using Zapier

April 21, 2024 · 4 min read

Mark

Web Scraping & Automation Expert

With BrowserCloud, you can run your automation entirely in the cloud and gain the ability to work without a browser. However, you still need to run the script that connects to our browsers. To do this, you can use the Zapier integration instead of Node.

Using Zapier for no-code integrations

Zapier is a service integration platform that allows you to set up workflows between services. It is Zapier that we use for no-code or low-code integrations of BrowserCloud in the cloud. By using BrowserCloud in Zapier, you can connect our service to other apps by dragging and dropping. You will be able to integrate our service into Gmail, Google Drive, GitHub, ChatGPT, and many other apps. In Zapier, you can create posts of any complexity and length.

An alternative to libraries and APIs

To set up automation, you need to connect to a compatible library or request it from your own app with REST APIs. Such operations will not cause difficulties for developers or advanced users. However, beginners are likely to encounter difficulties. In addition, connecting the compatible library and using REST APIs will take time. Zapier saves time and makes it easy to set up automation. Thanks to this, the platform is useful not only for beginners but also for advanced users who simply do not want to deal with lengthy setups.

What actions are supported when used without a browser in Zapier?

When you use BrowserCloud in Zapier, HTTP requests to our REST APIs happen in the background. Our platform currently supports several use cases in Zapier:

Take a screenshot (/screenshot)
Extract data (/scrape)
Get the HTML code of the site (/content)
Create PDF file (/pdf)
Measure website performance (/stats)

Setup steps

[1]. Register with BrowserCloud and copy your API key [2]. Sign up for Zapier [3]. Go to your Zaps and click on Create then New Zap. [4]. Choose a Trigger [5]. Click on Action and select BrowserCloud. [6]. Once the right panel opens, you need to select the event that you want to execute. For example, try taking a screenshot. Next, click “Continue.” [7]. Zapier will then ask you to connect your account. In the pop-up window, you need to enter your account name and API key from step 1. [8]. At this stage, select you need to set up the action configuration. To check the setting, run a test. Provide the URL where you want to take a screenshot. Next, click “Continue.” [9]. The screenshot is ready and now you can integrate it with another application. To check, integrate the file with Gmail. [10]. Once you select Gmail, click “Send Email” and connect your account. [11]. Fill out the required sections and go to the last field called attachment. Find and click on BrowserCloud and select the section labeled “File”. Next, attach the file to the letter. [12]. Click Continue. If the setup was successful, an email with a screenshot will appear in your inbox.

Advanced settings

In addition to the quick options provided by Zapier, Advanced Options are available to you on our platform. By feeding a RAW JSON text, you can use all fields within our platform's schema. This way, you can quickly exchange data and the results of web services. In the Launch flags field section, you can add command-line arguments and BrowserCloud flags when using Zapier. All flags that our platform supports are collected in our Launch Options docs.

Conclusion

Automation will be challenging for beginners, but using BrowserCloud in Zaps will help you quickly set up the platform and integrate it with other web applications. At the same time, you do not have to spend time on lengthy infrastructure setup. To get acquainted with our platform, you can create a free account and do a few trials.

How to Optimize Puppeteer Automations 2024

April 20, 2024 · 4 min read

Mark

Web Scraping & Automation Expert

Puppeteer is a web analysis, data mining, and test automation tool. Puppeteer uses Headless Chrome. The more complex the task, the more time Puppeteer needs to complete it. In this article, we will look at the main reasons for the slowdown in automation and ways to optimize it.

Why does automation speed matter?

It is necessary to identify the factor that slows down the execution speed of automation scripts and eliminate it. This is supported by two main advantages of optimized automation: [1]. On-demand automations provide users with a faster response. [2]. Scheduled automations reduce resource consumption. Developers often face increasing hosting bills due to a lack of automation on their own servers or choosing a slow competitor.

What's slowing down your Puppeteer scripts?

The execution speed of Puppeteer scripts is affected by various factors, including:

Network Latency in Loading Resources: There are various resources, like images or CSS and JavaScript files, present on web pages. The more resources a web page contains, the longer it takes to load them. At the same time, the time that the automation script spends on completing the task increases.
Proxies: proxy servers help with web scraping and bypassing bot detection mechanisms. However, despite their benefits, they often cause delays in the automation script. For example, the proxy server is too slow or is located far from the target website's server.
Headful Chrome: Using Chrome's offline mode (no GUI) usually speeds up the process and saves memory, but may be detected as a bot action and cause incorrect display. It is preferable to use the Headful browser, but it requires more processing power.
Geolocation: If you have enabled automation through a proxy or on the server, geolocation may contribute to delays in loading web pages due to the need to transmit information over long distances.

Speeding Up Your Puppeteer Scripts: Practical Solutions

Now that we're familiar with the reasons why automation is slow, let's look at practical solutions to improve your Puppeteer scripts:

Reusing Browser Instances: While you are using our keepalive flag, to save time, it is recommended to use the same browser instance for different tasks instead of starting a new one each time.
Reusing Cache and Cookies: With Puppeteer you can reuse cache and cookies. That is, resources that have already been opened during previous sessions will load faster. To reuse cache and cookies, you need to point Puppeteer to the user data directory.
Going Headless for Text-Based Tasks: Running Headless Chrome is another way to increase productivity. However, this option is only suitable for you if you do not use a graphical interface during the session and do not monitor bot activity.
Intercepting the Network to Skip Unnecessary Resources: Puppeteer supports the function of intercepting network requests. Thanks to this, you can block requests to various resources that the web page contains, such as photos or CSS. At the same time, the loading of services is accelerated since capacity is freed up that would otherwise be spent on loading resources.
GPU Hardware Acceleration: If you accelerate the GPU while headful mode is running, it will speed up loading web pages with large numbers of images.
Server and Proxy Location: To improve efficiency and comply with GDPR, it is recommended to optimize the geolocation of the servers where your scripts are hosted and the proxy servers used to access the target website.

Conclusion

By identifying the slowdown factor in Puppeteer scripts and using a strategy to eliminate it, you can achieve fast and smooth headless Chrome automation. The speed of task completion is the most important factor, so we pay special attention to it when setting up BrowserCloud. If you have problems with slow automation, you can create a trial account and see the effectiveness of our platform.

Migrating from Selenium to Playwright 2024

April 20, 2024 · 11 min read

Mark

Web Scraping & Automation Expert

Those with testing experience are likely familiar with tools like Selenium and Playwright. Although Selenium has its strengths, we generally recommend that our users use Playwright.

In this article, we’ll look at:

Why Playwright is so much faster Advantages of Migrating How each one handles waiting Header control differences Considerations before migrating Key differences to understand Step-by-step conversion process Mapping popular methods

By the end, you will understand the fundamental differences in approaches between the two libraries and how to approach converting the code.

Reasons for Playwright’s Speed

Browser Contexts and Pages: Using Playwright, you can run parallel tests in a single browser window. Unlike Playwright, Selenium is slower. When using Selenium, you will have to open a new browser window for each new test.
Native Automation Capabilities: Unlike Selenium, it interacts with browser APIs and protocols. Therefore, automation is more native. Selenium also uses the WebDriver protocol, which slows things down and increases the execution time of operations.
Handling of Modern Web Applications: Playwright is designed for optimized handling of modern web applications that run on complex JavaScript frameworks and perform asynchronous operations. It offers a more efficient experience with AJAX and Single Page Applications (SPAs).
Built-in Waits: Playwright supports automatic waiting for items to be ready, reducing the need to use explicit wait or sleep commands. Unlike Playwright, Selenium uses explicit commands regularly. This practice slows down testing execution. Moreover, if you refuse to use explicit commands, the tests will be unstable.

Advantages of Migrating from Selenium to Playwright

Improved Performance and Efficiency: Playwright enables faster and more efficient test execution. At the same time, it better allocates and utilizes resources, resulting in faster development and testing cycles.
Enhanced Features: Playwright provides access to various browser engines, so one API is enough to run the test in different browsers. Playwright also supports offline browsers and provides continuous integration capabilities.
Better Handling of Modern Web Technologies: Playwright is optimized to work with modern JavaScript frameworks and complex front-end technologies.
Simplified Test Scripts: Playwright's API is intuitive and great for developing simpler, more maintainable test scripts.
Advanced Features: Playwright supports various features such as network interception, geolocation testing, and mobile device emulation. At the same time, an intuitive interface and clear scripts make using these functions easier and more accessible.

Differences in waiting for selectors when migrating from Selenium to Playwright

The main difference between these frameworks is how Selenium and Playwright wait for selectors or elements to appear in order to perform actions.

Differences between Wait with Selenium and Playwright

There are challenges when using Selenium to handle the dynamic content of AJAX elements or elements that change states multiple times. Therefore, solving a problem often requires combining several waiting strategies.

Waiting with Selenium

Selenium supports three elemental waiting strategies: implicit, explicit, and free. [1]. With an implicit wait, Selenium, if it cannot find an element, throws an exception after a certain period of time. [2]. Explicit waiting is individual for each element and requires writing additional code for each condition. [3]. With free waiting, you can choose the maximum time to wait for a condition and how often it is checked. However, some types of exceptions can be ignored.

Waiting with Playwright

Using Playwright, you can turn to simpler waiting strategies, such as: [1]. Playwright automatically waits until the elements are ready to interact before executing an action. [2]. You don't have to write extra code for each one while you wait for the elements to be ready to interact. [3]. Playwright locators automatically wait until the elements they refer to become available. This makes scripts more concise. [4]. Playwright effectively interacts with web applications whose elements are loaded asynchronously or depend on JavaScript execution.

Impact on test writing and reliability

Selenium: When writing tests in Selenium, you need to have a good understanding of the various wait conditions and be able to use them effectively. Because of this, test scripts have become more complex and longer. If the waiting strategies do not meet the test execution conditions, it will cause the tests to be unstable due to a lack of time.
Playwright: Since the waiting process is automated, there is no need to write additional code for each condition. The code will be simpler and shorter. This reduces the likelihood of errors in the synchronization and visibility of elements.

Thus, Playwright's automatic waiting and simpler scripts make it a more reliable and efficient way to handle downloaded content.

Network manipulation and Header control comparison

Header management, especially when dealing with proxies and authentication, is one aspect where Selenium and Playwright differ. These relate to their underlying architecture and how they interact with browsers.

Limitations of Selenium and Headers

Limited Manipulation of Network Traffic: The main protocol for Selenium is WebDriver. However, its use limits the framework's ability to manipulate network traffic. In particular, Selenium makes it difficult to set custom headers because it prevents you from modifying network requests and responses.
Proxy Authentication Challenges: Selenium does not support changing custom headers. Because of this, using Selenium, the user cannot work with some types of secure proxy servers. In addition, the user cannot use scripts that require header manipulation.
Workarounds: Developers are forced to turn to external tools or browser extensions to compensate for the limitations of using Selenium.

Playwright and Advanced Header Control

Advanced Network Interception and Modification: Unlike Selenium, Playwright allows you to intercept and modify network requests and responses. Specifically, by using Playwright, the developer has the ability to change the header as before it is sent or received from the server.
Authenticating Proxies via Headers: Unlike Selenium, Playwright allows you to customize authentication headers for proxies. This is why application developers with authenticated proxy servers choose Playwright.
Built-in Support for Different Scenarios: The Playwright API is self-contained and provides extensive network settings. Therefore, developers do not have to look for external tools or use additional browser extensions.

Impact on Testing Capabilities

Selenium may require additional tools and browser extensions as it provides limited ability to manipulate network traffic. This is especially true for the lack of the ability to change the custom title.
Playwright provides extensive and efficient network processing customization options. It supports changes to custom headers and proxy authentication, so it is a more universal framework.

Considerations Before Migrating from Selenium to Playwright

The transition from Selenium to Playwright should be a conscious step on the part of developers. Playwright offers advanced functionality, but may be overkill for some web applications. There are a few things to consider when migrating from Selenium to Playwright.

Learning Curve: Moving from one framework to another will require time to learn their differences and functionality. The same will happen when switching from Selenium to Playwright. It will take time for developers to learn the new API.
Codebase Overhaul: Scripts adapted for Selenium will have to be rewritten, taking into account the capabilities of Playwright. The process will take some time.
Compatibility and Integration: Before moving to a new framework, it is better to make sure that it is suitable for your application. Playwright must meet the requirements and integrate smoothly with the technology stack and CI/CD pipeline.

You’re migrating from Selenium to Playwright, but where to start?

Rewriting Selenium scripts in Playwright involves several steps due to the differences in syntax and methods between the two tools. Although there are AI converters such as Rayrun and The Python Code, it is important to always carefully check the resulting code. This requires understanding the differences, processes, and comparisons between the two platforms.

Understanding Key Differences

[1]. Syntax and API Differences: Selenium and Playwright offer different approaches to browser automation. Therefore, first of all, you need to trace the differences in the API and syntax of the frameworks. If you approach this issue thoroughly, then there will be no problems when switching to a new framework. [2]. Async/Await Pattern: Playwright uses JavaScript with an asynchronous API. This means that you will need to use the async/await pattern in your scripts. [3]. Browser Contexts and Pages: Selenium and Playwright handle browser windows and tabs differently. This aspect should be given special attention. [4]. Selector Engine: Playwright supports text selectors, CSS, and XPath selectors, making it more convenient and efficient for interacting with dynamic content.

Step-by-Step Conversion Process

[1]. Set Up Playwright: Install Node.js. You will need it for stable operation. Playwright is a Node.js library. After installing Node.js, install Playwright and set up your environment. [2]. Create a Basic Playwright Script: To become familiar with the basic structure and commands of Playwright, create a basic script. Write a simple script that will open a browser, navigate to a page, and perform a few actions. [3]. Map Selenium Commands to Playwright: Your scripts that were written for Selenium need to be adapted for Playwright. That is, define the commands and find their equivalents in the new framework. [4]. Handle Waits and Asynchrony: Adapt your scripts to Playwright's asynchronous API by replacing Selenium's explicit waits with the new framework's automatic waits. [5]. Implement Advanced Features: If your scripts use advanced features such as file uploads, you'll need to know how Playwright handles those scripts. [6]. Run and Debug: After running the Playwright script, you need to track and fix problems that arise when migrating to a new framework. In particular, problems with synchronization or element selectors.

Mapping Selenium Commands to Playwright

Action Description	Selenium Method	Playwright Method
Click on an Element	const clickable = await driver.findElement(By.id(‘clickable’));<br>await driver.actions().<br>move({ origin: clickable }).<br>pause(1000).<br>press().<br>pause(1000).<br>sendKeys(‘abc’).<br>perform();	await page.getByRole(‘button’).click();
Double Click on an Element	Similar to Click, but use doubleClick() method in Selenium actions chain.	await page.getByText(‘Item’).dblclick();
Right Click on an Element	Similar to Click, but specify the right button in the Selenium actions chain.	await page.getByText(‘Item’).click({ button: ‘right’ });
Shift Click on an Element	Similar to Click, but add a shift key action in the Selenium actions chain.	await page.getByText(‘Item’).click({ modifiers: [‘Shift’] });
Hover Over an Element	Use moveToElement() method in Selenium actions chain.	await page.getByText(‘Item’).hover();
Fill Text Input	Use sendKeys() method on the element found in Selenium.	await page.getByRole(‘textbox’).fill(‘Peter’);
Check/Uncheck Checkboxes and Radio Buttons	Use click() method on the element in Selenium for checking. For unchecking, conditionally use click() if checked.	await page.getByLabel(‘I agree to the terms above’).check();
Select Options in Dropdown	Use Select class in Selenium and methods like selectByVisibleText() or selectByValue().	await page.getByLabel(‘Choose a color’).selectOption(‘blue’);
Type Characters	Use sendKeys() in Selenium.	await page.locator(‘#area’).pressSequentially(‘Hello World!’);
Upload Files	Use sendKeys() on file input element in Selenium with the file path.	await page.getByLabel(‘Upload file’).setInputFiles(path.join(__dirname, ‘myfile.pdf’));
Focus on an Element	Use WebElement‘s sendKeys(Keys.TAB) in Selenium to navigate to the element.	await page.getByLabel(‘Password’).focus();
Drag and Drop	Use dragAndDrop() method in Selenium actions chain.	await page.locator(‘#item-to-be-dragged’).dragTo(page.locator(‘#item-to-drop-at’));

Tools and Resources

Documentation and Guides: You can use the official documentation and guides from the Playwright community. They have sections for new users who are migrating to Playwright from older frameworks, including Selenium.
Playwright Test Runner: If you are using Playwright, go to Playwright and try out its Playwright Test Runner. It is optimized for Playwright scripts and will improve the network experience.
Refactoring Tools: To refactor and debug your code, you can use helper tools such as Visual Studio Code.

Considerations on converting code from Selenium to Playwright

No Direct Conversion: You will have to adapt Selenium scripts to Playwright yourself, as there are no tools that can automatically convert entire scripts.
Learning Curve: There may be a learning curve, especially regarding the asynchronous nature of Playwright and its different approach to browser automation.

Deploying playwright-core and separate browsers

Selenium WebDriver is included with FireFox, while Playwright is included with Chrome. We recommend hosting your scripts and browsers on separate servers. This improves security and load balancing. For Playwright you can do this using playwright-core. Using playwright-core, you can open and manage browsers yourself. Another option is to use our group of hosted offline browsers. They are ready to run any of your scenarios.

Conclusion

Converting Selenium scripts to Playwright is a manual process that will require time and an understanding of the differences between the two platforms. However, the benefits of Playwright's modern approach and features allow you to streamline network interactions and improve the performance of your web applications.

Intelligent Data Extraction and How to Use It

April 19, 2024 · 11 min read

Mark

Web Scraping & Automation Expert

Emails, files, images, documents, and social networks are all sources of data. And modern businesses spend huge amounts of money on... manually extracting this data and analyzing it. Meanwhile, modern technologies make it possible to completely automate this process. Intelligent data extraction is a trained algorithm that sifts through data. It can instantly prepare a thesis report, compile numbers, catalog personal information, and much more. In this article, we will tell you what data extraction is, how to implement it, and in which areas it is already revolutionizing.

What is Intelligent Data Extraction?

Intelligent data extraction is the result of combining artificial intelligence and machine learning. AI capabilities allow it to thoroughly examine various sources: scanned images, various electronic file formats, articles on websites, threads and photos on social networks, and so on. After examining the source, the intelligent data extraction system selects nuggets of information that it uses to improve various workflows or answer user queries. Intelligent data extraction systems are used in many areas, from finance to medicine. They identify patterns in documentation, analyze customer feedback, promptly provide information upon request, and so on. Intelligent data extraction helps reduce the risk of human errors. Imagine a mountain of documents. Imagine how much time an employee will spend studying it. Imagine how tired he will be, and because of fatigue, he may miss something important. An established intelligent data extraction system will spend less time on analysis. But the main thing is that she won’t get tired and won’t miss anything. This achieves several things:

Increased workflow efficiency
Reduced number of errors
And you can save a lot of money

##How Intelligent Data Extraction Works We have already learned what data extraction is. So let's figure out how it works! We'll follow all the steps from start to finish!

Step #1: Receive data

The setup begins with selecting the information source. This could be anything: scanned images, various electronic file formats, articles on websites, threads and photos on social networks, and so on. But let's take a specific example. Let’s imagine that a conventional bank needs to attract a new client. The sources of raw data in the case of a bank will be digital forms, scanned documents, transaction histories, and more from various channels such as online applications, email, and mobile banking platforms.

Step #2: Pre-treatment

Highlight relevant sources. Get rid of unnecessary sources, the data from which seems redundant to you. You may need to convert your scanned forms into a more convenient digital format. By doing this, you will ensure data consistency.

Step #3: Training the Algorithm

Machine learning models “learn” to interact with data. By analyzing sources of information, they learn to recognize patterns and relationships. Let's remember the example with a conditional bank. To train the algorithms, the bank can provide past loan applications that are in its database. The algorithm will study these applications and learn to recognize data fields such as "Name" and "Annual Income".

Step #4: Extraction

At this stage, the algorithms extract relevant data points. Let's look at the example of a bank. The trained algorithm will extract personal data or amounts from the transaction history on the application form. Note that the algorithm can process huge deposits of data in a short time but will not lose the accuracy of its extraction.

Step #5: Check

Trust, but check. Before accepting an algorithm as fully trained, check how successfully and efficiently it interacts with the data. At this stage, validation will help you confirm the correctness of the extracted data. Let's say, in the case of a bank, when rechecking, you need to pay attention to deleted data using predefined rules.

Step #6: Continuous Improvement

Algorithms learn and improve as they interact with data. Therefore, the accuracy and reliability of their work increase with each request processed. For example, a bank implemented data extraction into its workflow. And after some time, the bank introduced new conditions. No problem! The trained algorithm adapts to them with amazing speed.

What is the effectiveness of Intelligent Data Extraction?

Businesses waste money, time and effort manually extracting data. However, modern technologies and trained algorithms are much more effective. But let's look at the points:

Feature Manual	Data Extraction Intelligent	Data Extraction
Time Consumption	High (Hours to days)	Minimal (Minutes to hours)
Error Rate	Prone to human errors	Significantly reduced
Cost Higher (Labor-intensive)	Lower (Automation savings)
Scalability	Limited	Highly scalable
Data Consistency & Quality	Variable	Consistent and high-quality
Adaptability	Rigid processes	Adapts to varying data forms

Applications of Intelligent Data Extraction

Intelligent data extraction helps you automate and improve the handling of user requests or your entire workflow. Let's find out which industries benefited most from this innovation:

1. Healthcare

Precision in healthcare depends on the well-being of patients. Intelligent data extraction simplifies tasks such as managing patient records and transferring information from handwritten prescriptions to electronic medical records. On a busy day with a large influx of patients, a doctor can make a request, and an automated algorithm will fulfill it. For example, it will save test data and attach it to the patient’s medical record. The doctor will leave the task to the system and return to work. But his request will be accurately executed, and the data will be saved in the right place. In addition to administrative tasks, the data extraction system can be assigned research functions. For example, she can be entrusted with the study of medical literature and its thesis retelling. All this makes the work of medical institutions more efficient! And all thanks to modern technologies—that is, an intelligent data extraction system!

2. Surveillance tools

Surveillance tools collect data for digital systems. This data is then processed by application performance monitoring tools. And in this case, intelligent data extraction will provide significant help:

Log management: A trained algorithm will reduce the volume of log files several times. It will identify inconsistencies and patterns that indicate system errors. Thanks to this, it will take a couple of hours to find this error, but a couple of seconds!
Optimization of metrics: From a huge volume of data, the algorithm will identify relevant metrics that will give a clear picture of the performance of the digital system. Well, then you can carry out timely optimization!
Real-time alerts: The algorithm can detect critical incidents and trigger immediate alerts. Thanks to this, the reaction will be quick, and the digital system will be protected from a potential threat.
Analysis of user behavior: The algorithm studies user requests, based on which it can suggest improvements to the interface or system responsiveness. Well, the user experience will become more pleasant!

3. Legal service providers

In the legal field, meticulousness is important. Of course, accurate data extraction improves the legal service delivery process. And here's how exactly:

Document Review: An automatic algorithm quickly scans the entire volume of data and then extracts the relevant articles, dates, or names of the parties involved. After reviewing any document, the algorithm will identify key points and provide a summary report.
Contract analysis: Having studied the conditions specified in various documents, the algorithm will identify possible risks and options for revising any clauses. The algorithm will transmit the information to the specialist, and he, in turn, will be able to advise the client.
Case Study: To strategize a case, you need to find a precedent. The algorithm will be able to do this much faster than a human, crawling through a huge amount of data in a matter of moments.
Client data management: The algorithm can study clients’ personal files, catalog them, and update and supplement them. So all important information will be available at the right time.

4. Accounting and taxation

Come tax season, data extraction can help accountants easily sort through countless stacks of receipts, financial statements, and transaction records. The algorithm will identify the most important points and present them in the form of a report, and the accountant will be able to save time and effort. Intelligent data extraction will allow you to quickly reconcile records, identify inconsistencies, and make all necessary payments in a timely manner. Additionally, the trained algorithm can be used to analyze data from previous financial years. It will quickly identify deviations and shortcomings and help correct them in a timely manner.

5. Banking and finance

The bank is at any time inundated with numerous inquiries, applications, and demands for immediate consultation. To understand this flow, you need accuracy as well as a quick reaction. And intelligent data extraction will help with this. The client who contacted the bank will provide his data, and the algorithm will instantly analyze the most important points. For example, to approve a loan application, it is necessary to verify the client’s solvency. This means that the algorithm will reveal the client’s credit score, employment records, and asset valuations. In addition, the intelligent data extraction system can notice unusual actions in the client’s personal account and immediately report them. And now the client is freed from problems with scammers. Additionally, a trained automated algorithm is useful in analyzing market reports. It will quickly identify stock trends or key economic signals.

Techniques for Intelligent Data Extraction

For intelligent data extraction (IDE) to reach its full potential, you need to ensure that the data is not only accurate but also useful. To do this, you should use several methods that will help filter the data and protect it:

Quality over quantity: Determine what data you and your customers need. Load only relevant and up-to-date data. The total amount of data will be reduced, but the remaining data will be reliable, and their analysis will give extremely accurate results.
Update your algorithms regularly: Algorithms need constant training and updating; otherwise, they will become outdated and useless. Provide algorithms with relevant data on which they can improve.
Data Verification: Data verification ensures that the data is accurate. However, it is best to carry out the verification in two stages. Primary and secondary verification will help identify inconsistencies and errors, if any. This way, you will save yourself from possible problems and risks.
Structured data storage: organize received data. Then the algorithm will be able to retrieve them faster based on your request. If the data is not systematized, the algorithm will have to spend additional time searching and analyzing it.
Keep your data private: Nothing is more important than protecting your data! This includes personal or confidential information about you and your customers that scammers can use. Therefore, make sure that this type of data is encrypted.
Feedback Loop: Give your users the opportunity to provide feedback. Then they can alert you if your data is inaccurate or out of date. Ultimately, this will show them that you care about them and that their opinions are important to you.
Integration with other systems: Check if your IDE system integrates with other business systems. If the integration is broken, there will be problems with data transfer and compatibility.
Regular audits: Don't stop at two-step verification before loading data. Extracted data should also be regularly checked for accuracy and consistency. And all this is already in the process of being used. This way, you can identify and fix any system problems early.

Want to Use Intelligent Data Extraction?

Intelligent data extraction helps you explore raw sources of information and turn them into tools to improve workflows and user experiences. However, before you implement trained algorithms, determine exactly how they will benefit you and what problems they will help solve. Intelligent data extraction is constantly being improved; it is rebuilt to new conditions and adapted to your requirements. Modern business has the opportunity not only to collect data and engage in long, very long manual analysis... No, modern business can use the full potential of data for successful activities!

Why Puppeteer is more popular than Selenium

April 19, 2024 · 5 min read

Mark

Web Scraping & Automation Expert

Is Selenium worth using? Or is it better to use a more modern library like Puppeteer? Our platform supports both Selenium and Puppeteer. The choice of our clients is influenced by various factors: company culture, tools and languages, and more. However, we always recommend using Selenium. Now we'll tell you why.

Selenium is an HTTP-based JSON API

It's better to choose puppeteer or any CDP-based (Chrome DevTools Protocol) library. But what exactly makes Selenium so inconvenient? Let's find out with a specific example. Let's take the example of loading a website, clicking a button, and getting the title. Puppeteer allows you to do this over a single socket connection. It starts and ends as soon as we connect, and then closes. However, Selenium 6+ HTTP JSON payloads.

Selenium routes every HTTP call through a standard TCP handshake, boosts its speed, and forwards it to the final location. You need to check if the settings are set; otherwise, it will take extra time.
Selenium makes a lot of API calls. Each of these calls has its own "batch" patterns, which are difficult to rate limits on.
Selenium makes it difficult to execute load balancing and round robin queries. To make the request complete, you will need sticky sessions. This is an algorithm that distributes the load not only by the number of connections to servers but also by the IP addresses of network elements. It is possible that you will have to create this algorithm yourself.
Selenium does have a binary somewhere that simply sends a CDP message to Chrome. So why do users have to interact with all this HTTP stuff? As time passes, you will have to put in some serious effort to understand how Selenium works. Well, if you use puppeteer, then you don’t have to learn all its capabilities and operating principles from scratch. You can immediately use almost any load balancer (nginx, apache, envoy, etc.). In general, Selenium requires some specialized knowledge, and libraries like puppeteer and playwright allow you to get up and running quickly.

Selenium Requires more Binaries to Track

Both puppeteer and playwright launch with the corresponding version of their browser. All you have to do is start using them... and everything will just work. Well, Selenium will complicate your life. You will have to figure out on your own which version of chromedriver corresponds to which version of Chrome... which version of Selenium you are using can work with. That is, at least three stages at which your integration can break down. Finally, let's remember Selenium Grid, which will also give you headaches if you don't keep an eye on it. In general, all this is a clear disadvantage of Selenium in comparison with more universal and accessible tools.

In Selenium Basic Things are more Complicated

If you are using Selenium, you will have to face problems with basic things as well. For example, you want to add headers to the browser. You'll need this to load test your site or to apply a header to certain authenticated network requests. So you run Selenium, and in order for the proxy to work or to use a proxy with authentication, you need additional drivers or a plugin. And so you spend extra time searching for and installing them. Both puppeteer and playwright simply have drivers or plugins in their libraries. That is, once again, Selenium loses convenience to more universal libraries.

You Can Customize a lot more in Selenium

It's difficult to set up a simple script in Selenium to do something. This is because Selenium caters to many browsers. Here's an example: Selenium's retrieval of the name example.com in NodeJS looks like this:

const { Builder, Capabilities } = require('selenium-webdriver');

(async function example() {
    const chromeCapabilities = Capabilities.chrome();
    chromeCapabilities.set(
    'goog:chromeOptions', {
    'prefs': {
        'homepage': 'about:blank',
    },
    args: [
        '--headless',
        '--no-sandbox',
    ],
    }
    );


 let driver = new Builder()
    .forBrowser('chrome')
   .withCapabilities(chromeCapabilities)
   .usingServer('http://localhost:3000/webdriver')
    .build();

 try {
    await driver.get('http://www.example.com/');
    console.log(await driver.getTitle());
 } catch(e) {
    console.log('Error', e.message);
 } finally {
    await driver.quit();
 }
})();

Approximately 35 lines of code. It looks good, but completely loses if you compare the script with another library. Let's take puppeteer. He will need about half as much code time:

const puppeteer = require('puppeteer');

(async function example () {
 let browser;
 try {
    browser = await puppeteer.connect({
    browserWSEndpoint: 'ws://localhost:3000',
    });
    const page = await browser.newPage();
    await page.goto('https://example.com');
    console.log(await page.title());
 } catch(e) {
    console.log('Error', e.message);
 } finally {
    await browser.close();
 }
})();

Novice developers and users who write simple scripts may not be familiar with the features we wrote about above. And this is normal, because the capabilities described above are not needed for simple and basic things. However, if you start using larger deployments with different browsers and their capabilities, then Selenium will become a headache for you.

So Selenium or...?

Of course, we wrote above about the disadvantages of Selenium. In some ways, it is inferior to more modern libraries. Puppeteer and playwright are more extensive, have a simpler configuration algorithm, and are more flexible in their use. To work with them, you do not need specialized software and so on. And they integrate more easily with other technologies. All these are clear advantages. However, Selenium is still popular. All this is because it has a simple and clear API. Thanks to it, Selenium has abstracted all the different browsers, their respective protocols, and integration issues. Many large projects use Selenium but hide it. And the problems associated with its use are solved for you, so that they do not seriously bother you.

Using BrowserCloud to train your LLM

January 8, 2024 · 5 min read

Mark

Web Scraping & Automation Expert

Large language models, or LLMs, are used to collect data from various sources and present it to the end user. There are projects that give you some initial data sets to start your LLM. However, what if you want to compete with big players like OpenAI? BrowserCloud helps you create your own simple API call and implement more complex technologies. In this guide, we will introduce you to the principles of working with LLM and its training.

About BrowserCloud

BrowserCloud is a service that manages Chrome and makes it programmatically available to developers. Most often, you need to use libraries to control Chrome. However, most LLMs are only interested in raw data, so they use browsercloud REST-based APIs, which are less heavy-handed than software APIs. Browsercloud APIs, in particular, are needed for common use cases. There are two APIs in particular that make retrieving website data much easier: Scrape and Content API.

Using web scraping for LLM training

Once the JavaScript is parsed and running, our Scrape API will help you retrieve website data. Scrape's REST-based API therefore uses a JSON body that describes the nature of your request and how the search should be done. A simple example looks like this:

The request above was sent to CNN.com. It waits for JavaScript to parse and run to get the data for the document and return the following (a shortened example below):

curl --request POST \
  --url 'https://chrome.browsercloud.io/scrape?token=YOUR-API-KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://cnn.com",
    "elements": [
        {
            "selector": "body"
        }
    ]
}'

The API generates the "text" of the website inside a JSON structure. It is this “text” that LLM focuses on. In addition, it provides content metadata such as pixel size and position. Such additions expand the model's knowledge of the data and increase its potential importance.

const results = {
    "data": [
        {
            "selector": "body",
            "results": [
                {
                    "text": "Audio\nLive TV\nLog In\nHAPPENING NOW\nAnalysis of Donald Trump's historic arraignment on 37 federal criminal charges. Watch CNN\nLive Updates: Ukraine \nTrump arraignment \nTrending: Tori Bowie autopsy \nInstant Brands bankruptcy \nHatch Act \nFather’s Day gifts \nPodcast: 5 Things\nTrump pleads not guilty to mishandling classified intelligence documents\nAnna Moneymaker/Getty Images\nFACT CHECK\nTrump has responded to his federal indictment with a blizzard of dishonesty. Here are 10 of his claims fact-checked\nOpinion: Trump backers are going bonkers with dangerous threats\nDoug Mills/The New York Times/Redux\nGALLERY\nIn pictures: The federal indictment of Donald Trump\n‘Dejected’: Grisham describes Trump’s demeanor as he headed to court\nTrump didn’t speak during the historic hearing, sitting with his arms crossed and a scowl on his face. Here’s what else happened\nLive Updates: Judge says Trump must not communicate with co-defendant about the case\nTakeaways from Trump’s historic court appearance\nWatch: Hear how Trump acted inside the ‘packed courtroom’ during arraignment\nJudge allows E. Jean Carroll to amend her defamation lawsuit to seek more damages against Trump\nInteractive: Former President Donald Trump’s second indictment, annotated\nDoug Mills/The New York Times/Redux\nVIDEO\nTrump stops at famous Cuban restaurant after his arrest...© 2016 Cable News Network.",
                    "width": 800,
                    "height": 25409,
                    "top": 0,
                    "left": 0,
                    "attributes": [
                        {
                            "name": "class",
                            "value": "layout layout-homepage cnn"
                        },
                        {
                            "name": "data-page-type",
                            "value": "section"
                        }
                    ]
                }
            ]
        }
    ]
}

Content API for LLM training

The Content API is similar to the Scrape API in that it returns content after the JavaScript has been parsed and executed. Additionally, when you use the Content API, as you use it, you send a POST over a JSON body that contains information about the URL you want. However, there are also differences between them. Content API contains only the HTML content of the site itself without additional parsing. Below, you will see an example:

curl --request POST \
  --url 'https://chrome.browsercloud.io/content?token=YOUR-API-KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://cnn.com"
}'

Only the HTML code of the page is returned. If you wish, you can turn to other libraries to help with content extraction, but several LLMs will successfully continue to parse them:

<!DOCTYPE html><html lang="en" data-uri="cms.cnn.com/_pages/clg34ol9u000047nodabud1o2@published" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/homepage-domestic@published" class="userconsent-cntry-us userconsent-reg-ccpa"><head><script type="text/javascript" src="https://cdn.krxd.net/userdata/get?pub=e9eaedd3-c1da-4334-82f0-d7e3ff883c87&amp;technographics=1&amp;callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="https://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script type="text/javascript" src="https://beacon.krxd.net/cookie2json?callback=Krux.ns._default.kxjsonp_3pevents"></script><script type="text/javascript" src="https://consumer.krxd.net/consent/get/e9eaedd3-c1da-4334-82f0-d7e3ff883c87?idt=device&amp;dt=kxcookie&amp;callback=Krux.ns._default.kxjsonp_consent_get_0"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/i/web/release/3.3.1/psm.legacy.min.umd.js"></script><script type="text/javascript" async="" src="//www.i.cdn.cnn.com/zion/zion-mb.min.js"></script><script async="" src="https://cdn.boomtrain.com/p13n/cnn/p13n.min.js"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/brands/cnn/web/release/psm.min.js"></script><script async="" src="//cdn.krxd.net/ctjs/controltag.js.d58f47095e6041e576ee04944cca45da"></script><script type="text/javascript" defer="" async="" src="https://z.cdp-dev.cnn.com/zfm/zfh-3.js"></script><script id="GPTScript" type="text/javascript" src="https://securepubads.g.doubleclick.net/tag/js/gpt.js"></script><script type="text/javascript" src="https://steadfastseat.com/v2svxFVJ-Mg82zHMJUHkQBWwVF721AsFf1Y3MomzEUqIMQlG6f2VaL6ctdsQc2VgA"></script><script type="text/javascript" async="" src="//www.ugdturner.com/xd.sjs"></script><script async="" src="//static.adsafeprotected.com/iasPET.1.js"></script><script async="" src="//c.amazon-adsystem.com/aax2/apstag.js"></script><script async="" src="https://vi.ml314.com/get?eid=64240&amp;tk=GBYTTE9dUG2OqHj1Rk9DPOaLspvMWfLqV236sdkHgf03d&amp;fp="></script><script async="" src="https://cdn.ml314.com/taglw.js"></script><script type="text/javascript" async="" src="https://sb.scorecardresearch.com/beacon.js"></script><script type="text/javascript" async="" src="//s.cdn.turner.com/analytics/comscore/streamsense.5.2.0.160629.min.js"></script><script type="text/javascript" async="" src="//cdn3.optimizely.com/js/geo4.js"></script><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--t......

Why use BrowserCloud for LLM training?

Many sites use JavaScript to fetch additional resources and run a single-page application. Such sites cannot simply be “collapsed” or issue a system call to retrieve content. Even Google uses a headless browser to generate data for its search index. BrowserCloud unblocks these sites and services to ensure your LLM receives relevant and complete content. In addition, you will be able to configure the wait time and ad blocking conditions. At the same time, BrowserCloud monitors chatbots on websites and bypasses them using hidden options and much more. It looks exactly like your own web browser!

Your custom JavaScript function for advanced data scraping

If you don't mind writing a little JavaScript, you will be able to use API functions to remove data. This API will send code that will run on your service and return only the data you need. As an example, let's remove newlines and returns of the first 1000 characters from a CNN:

curl --request POST \
  --url 'https://chrome.browsercloud.io/function?token=YOUR-API-KEY' \
  --header 'Content-Type: application/javascript' \
  --data 'module.exports = async({ page }) => {
  await page.goto('\''https://cnn.com'\'', { timeout: 120000 });
    const data = await page.evaluate(() => document.body.innerText);
    const cleaned = data.replace(/(\r\n|\n|\r)/gm, '\''. '\'');
    const trimmed = cleaned.substring(0, 1000);

    return {
            data: trimmed,
            type: '\''text'\'',
    };
};'

Top 6 Selenium Alternatives for Your Test Automation Goals (2024)

January 8, 2024 · 5 min read

Mark

Web Scraping & Automation Expert

Some Selenium has been used as a test automation framework and has been the standard for DevOps teams. Selenium is open source and its suite includes components such as Selenium WebDriver, Selenium IDE and Selenium Grid.

WebDriver is used to create test scripts.
The IDE and its suite includes components such as Selenium WebDriver, Selenium IDE and Selenium Grid, making it suitable for beginners.
Selenium Grid allows you to run multiple tests simultaneously and offers multiple physical and virtual devices.

Selenium is considered a convenient tool, but today there are alternatives that attract new users. Unlike Selenium, alternatives do not have its disadvantages. Let's clarify exactly what difficulties and difficulties arise when working with Selenium:

Required coding. DevOps engineers focus on coding and innovation, while non-technical tasks are assigned to testers. Selenium alternatives offer automated testing and only require coding.
No built-in image comparison. Selenium doesn't have a built-in diff solution feature, so you have to use third party resources.
Limited reporting. Selenium limits reporting capabilities, so the only way to solve this problem is to use third-party plugins.

No native or mobile web support. Selenium is a portable web application testing framework. However, it does not support the mobile web application testing feature.

Alternatives for Selenium

BrowserCloud

BrowserCloud is a convenient platform for automated testing. It manages a large network of full-featured Chrome browsers running on high-performance AMD cores and speeds up testing with parallel sessions. BrowserCloud allows you to forget about setting dependencies, fonts, memory leaks, and maintaining your own infrastructure. The main features of BrowserCloud:

High-performance and scalable solution for browser automation: run hundreds of browsers simultaneously in the cloud with real-time monitoring and provides web scraping functionality
One tool for many applications: End-to-End Testing, Web Scraping, PDF/Image rendering, ... any web automation!
Allows you to switch your automation scripts to the scalable cloud solution by changing just a few lines of code.

Cypress

Cypress is an open source and automated testing platform. The platform has two types of tools:

Cypress Test Runner to run tests in the browser
Cypress Dashboard for CI Toolkit Cypress's tools follow modern development principles, and its 4 Bundle was released on 15/10/20. Additionally, the platform has some nice features that make it a great alternative to Selenium:
Allows you to write tests while developing an application in real time
Provides test snapshots from the command log
Automatically approves decisions

WebdriverIО

Unlike Cypress, WebdriverO does not offer commercial options. It is a framework for automating web and mobile applications. It is used for end-to-end testing or testing an application using JavaScript/TypeScript. If WebdriverIO is combined with WebdriverProtocol, the functionality of the tool expands and new capabilities appear. For example, a programmer can conduct cross-browser testing. Main features of Webdriver IO

Test suite scales and remains stable
Allows you to expand your installation using built-in plugins or those provided by the community
Includes built-in tools for testing mobile applications
Easy to set up

Playwright

Playwright is developed by Microsoft contributors. Playwright is open source and is used to automate testing of Chromium, Firefox and WebKit based browsers through a single API. The platform supports several programming languages, including Java, Python, C# and NodeJS. Main features of Playwright

Easy setup and configuration
Supports various browsers including Chrome, Edge, Safari and Firefox
Multilingual support
Provides the ability to test browsers in parallel
Provides support for multiple tabs/browsers

Cucumber

Cucumber is an automation tool for behavior-driven development (BDD). Although Cucumber was originally written in the Ruby programming language, it has been translated into Java and JavaScript. Unlike Selenium, the tool allows the user to test scripts without coding using the Gherkin scripting language. Main features of Cucumber:

Helps to attract business stakeholders without reading code skills
Focus on user experience
Enables code reuse thanks to a simple writing style
Easy to set up and implement

NightwatchJS

NightwatchJS is a Node.js based framework that uses the Webdriver protocol. It allows you to conduct various tests, including end-to-end testing, component testing, visual regression testing, and so on. In addition, NightwatchJS is used for native testing of mobile applications. Key features of Nightwatch:

Test scripts are easy to read
They can be used for various types of testing
The framework is easily customizable and expandable
Page object template supported

In conclusion

There are many alternative tools other than Selenium that offer robust features for both testers and developers. Each of them has its pros and cons, and some can even be used with the Selenium framework. When choosing a specific tool, you need to consider the DevOps team's goals, competencies, testing scope, and other factors related to a specific product.

The First Way to Set a Proxy in Playwright​

Another Way of Setting up a Proxy for Playwright​

Web Scraping Playwright: Proxy Rotation and Retries​

How to Set Different Proxies for one Playwright instance​

Puppeteer vs. Playwright​

Language support​

Browser Support​

Ease of use for web scraping​

Speed​

Automatic standby mechanism​

Selector Engine​

Integration with other tools​

Conclusion​

Playwright​

Selenium​

Setup and ease of use​

Suggested Features​

Flexibility and performance​

Conclusion​

Using Zapier for no-code integrations​

An alternative to libraries and APIs​

What actions are supported when used without a browser in Zapier?​

Setup steps​

Advanced settings​

Conclusion​

Why does automation speed matter?​

What's slowing down your Puppeteer scripts?​

Speeding Up Your Puppeteer Scripts: Practical Solutions​

Conclusion​

Reasons for Playwright’s Speed​

Advantages of Migrating from Selenium to Playwright​

Differences in waiting for selectors when migrating from Selenium to Playwright​

Differences between Wait with Selenium and Playwright​

Waiting with Selenium​

Waiting with Playwright​

Impact on test writing and reliability​

Network manipulation and Header control comparison​

Limitations of Selenium and Headers​

Playwright and Advanced Header Control​

Impact on Testing Capabilities​

Considerations Before Migrating from Selenium to Playwright​

You’re migrating from Selenium to Playwright, but where to start?​

Understanding Key Differences​

Step-by-Step Conversion Process​

Mapping Selenium Commands to Playwright​

Tools and Resources​

Considerations on converting code from Selenium to Playwright​

Deploying playwright-core and separate browsers​

Conclusion​

What is Intelligent Data Extraction?​

Step #1: Receive data​

Step #2: Pre-treatment​

Step #3: Training the Algorithm​

Step #4: Extraction​

Step #5: Check​

Step #6: Continuous Improvement​

What is the effectiveness of Intelligent Data Extraction?​

Applications of Intelligent Data Extraction​

1. Healthcare​

2. Surveillance tools​

3. Legal service providers​

4. Accounting and taxation​

5. Banking and finance​

Techniques for Intelligent Data Extraction​

Want to Use Intelligent Data Extraction?​

Selenium is an HTTP-based JSON API​

Selenium Requires more Binaries to Track​

In Selenium Basic Things are more Complicated​

You Can Customize a lot more in Selenium​

So Selenium or...?​

About BrowserCloud

Using web scraping for LLM training​

Content API for LLM training​

Why use BrowserCloud for LLM training?​

Your custom JavaScript function for advanced data scraping​

Alternatives for Selenium​

BrowserCloud​

Cypress​

WebdriverIО​

Playwright​

The First Way to Set a Proxy in Playwright

Another Way of Setting up a Proxy for Playwright

Web Scraping Playwright: Proxy Rotation and Retries

How to Set Different Proxies for one Playwright instance

Puppeteer vs. Playwright

Language support

Browser Support

Ease of use for web scraping

Speed

Automatic standby mechanism

Selector Engine

Integration with other tools

Conclusion

Playwright

Selenium

Setup and ease of use

Suggested Features

Flexibility and performance

Conclusion

Using Zapier for no-code integrations

An alternative to libraries and APIs

What actions are supported when used without a browser in Zapier?

Setup steps

Advanced settings

Conclusion

Why does automation speed matter?

What's slowing down your Puppeteer scripts?

Speeding Up Your Puppeteer Scripts: Practical Solutions

Conclusion

Reasons for Playwright’s Speed

Advantages of Migrating from Selenium to Playwright

Differences in waiting for selectors when migrating from Selenium to Playwright

Differences between Wait with Selenium and Playwright

Waiting with Selenium

Waiting with Playwright

Impact on test writing and reliability

Network manipulation and Header control comparison

Limitations of Selenium and Headers

Playwright and Advanced Header Control

Impact on Testing Capabilities

Considerations Before Migrating from Selenium to Playwright

You’re migrating from Selenium to Playwright, but where to start?

Understanding Key Differences

Step-by-Step Conversion Process

Mapping Selenium Commands to Playwright

Tools and Resources

Considerations on converting code from Selenium to Playwright

Deploying playwright-core and separate browsers

Conclusion

What is Intelligent Data Extraction?

Step #1: Receive data

Step #2: Pre-treatment

Step #3: Training the Algorithm

Step #4: Extraction

Step #5: Check

Step #6: Continuous Improvement

What is the effectiveness of Intelligent Data Extraction?

Applications of Intelligent Data Extraction

1. Healthcare

2. Surveillance tools

3. Legal service providers

4. Accounting and taxation

5. Banking and finance

Techniques for Intelligent Data Extraction

Want to Use Intelligent Data Extraction?

Selenium is an HTTP-based JSON API

Selenium Requires more Binaries to Track

In Selenium Basic Things are more Complicated

You Can Customize a lot more in Selenium

So Selenium or...?

Using web scraping for LLM training

Content API for LLM training

Why use BrowserCloud for LLM training?

Your custom JavaScript function for advanced data scraping

Alternatives for Selenium

BrowserCloud

Cypress

WebdriverIО

Playwright

Cucumber