Skip to main content

4 posts tagged with "proxies"

View All Tags

· 5 min read
Mark

Large language models, or LLMs, are used to collect data from various sources and present it to the end user. There are projects that give you some initial data sets to start your LLM. However, what if you want to compete with big players like OpenAI? BrowserCloud helps you create your own simple API call and implement more complex technologies. In this guide, we will introduce you to the principles of working with LLM and its training.

About BrowserCloud

BrowserCloud is a service that manages Chrome and makes it programmatically available to developers. Most often, you need to use libraries to control Chrome. However, most LLMs are only interested in raw data, so they use browsercloud REST-based APIs, which are less heavy-handed than software APIs. Browsercloud APIs, in particular, are needed for common use cases. There are two APIs in particular that make retrieving website data much easier: Scrape and Content API.

Using web scraping for LLM training

Once the JavaScript is parsed and running, our Scrape API will help you retrieve website data. Scrape's REST-based API therefore uses a JSON body that describes the nature of your request and how the search should be done. A simple example looks like this:

The request above was sent to CNN.com. It waits for JavaScript to parse and run to get the data for the document and return the following (a shortened example below):

curl --request POST \
--url 'https://chrome.browsercloud.io/scrape?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com",
"elements": [
{
"selector": "body"
}
]
}'

The API generates the "text" of the website inside a JSON structure. It is this “text” that LLM focuses on. In addition, it provides content metadata such as pixel size and position. Such additions expand the model's knowledge of the data and increase its potential importance.

const results = {
"data": [
{
"selector": "body",
"results": [
{
"text": "Audio\nLive TV\nLog In\nHAPPENING NOW\nAnalysis of Donald Trump's historic arraignment on 37 federal criminal charges. Watch CNN\nLive Updates: Ukraine \nTrump arraignment \nTrending: Tori Bowie autopsy \nInstant Brands bankruptcy \nHatch Act \nFather’s Day gifts \nPodcast: 5 Things\nTrump pleads not guilty to mishandling classified intelligence documents\nAnna Moneymaker/Getty Images\nFACT CHECK\nTrump has responded to his federal indictment with a blizzard of dishonesty. Here are 10 of his claims fact-checked\nOpinion: Trump backers are going bonkers with dangerous threats\nDoug Mills/The New York Times/Redux\nGALLERY\nIn pictures: The federal indictment of Donald Trump\n‘Dejected’: Grisham describes Trump’s demeanor as he headed to court\nTrump didn’t speak during the historic hearing, sitting with his arms crossed and a scowl on his face. Here’s what else happened\nLive Updates: Judge says Trump must not communicate with co-defendant about the case\nTakeaways from Trump’s historic court appearance\nWatch: Hear how Trump acted inside the ‘packed courtroom’ during arraignment\nJudge allows E. Jean Carroll to amend her defamation lawsuit to seek more damages against Trump\nInteractive: Former President Donald Trump’s second indictment, annotated\nDoug Mills/The New York Times/Redux\nVIDEO\nTrump stops at famous Cuban restaurant after his arrest...© 2016 Cable News Network.",
"width": 800,
"height": 25409,
"top": 0,
"left": 0,
"attributes": [
{
"name": "class",
"value": "layout layout-homepage cnn"
},
{
"name": "data-page-type",
"value": "section"
}
]
}
]
}
]
}

Content API for LLM training

The Content API is similar to the Scrape API in that it returns content after the JavaScript has been parsed and executed. Additionally, when you use the Content API, as you use it, you send a POST over a JSON body that contains information about the URL you want. However, there are also differences between them. Content API contains only the HTML content of the site itself without additional parsing. Below, you will see an example:

curl --request POST \
--url 'https://chrome.browsercloud.io/content?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com"
}'

Only the HTML code of the page is returned. If you wish, you can turn to other libraries to help with content extraction, but several LLMs will successfully continue to parse them:

<!DOCTYPE html><html lang="en" data-uri="cms.cnn.com/_pages/clg34ol9u000047nodabud1o2@published" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/homepage-domestic@published" class="userconsent-cntry-us userconsent-reg-ccpa"><head><script type="text/javascript" src="https://cdn.krxd.net/userdata/get?pub=e9eaedd3-c1da-4334-82f0-d7e3ff883c87&amp;technographics=1&amp;callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="https://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script type="text/javascript" src="https://beacon.krxd.net/cookie2json?callback=Krux.ns._default.kxjsonp_3pevents"></script><script type="text/javascript" src="https://consumer.krxd.net/consent/get/e9eaedd3-c1da-4334-82f0-d7e3ff883c87?idt=device&amp;dt=kxcookie&amp;callback=Krux.ns._default.kxjsonp_consent_get_0"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/i/web/release/3.3.1/psm.legacy.min.umd.js"></script><script type="text/javascript" async="" src="//www.i.cdn.cnn.com/zion/zion-mb.min.js"></script><script async="" src="https://cdn.boomtrain.com/p13n/cnn/p13n.min.js"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/brands/cnn/web/release/psm.min.js"></script><script async="" src="//cdn.krxd.net/ctjs/controltag.js.d58f47095e6041e576ee04944cca45da"></script><script type="text/javascript" defer="" async="" src="https://z.cdp-dev.cnn.com/zfm/zfh-3.js"></script><script id="GPTScript" type="text/javascript" src="https://securepubads.g.doubleclick.net/tag/js/gpt.js"></script><script type="text/javascript" src="https://steadfastseat.com/v2svxFVJ-Mg82zHMJUHkQBWwVF721AsFf1Y3MomzEUqIMQlG6f2VaL6ctdsQc2VgA"></script><script type="text/javascript" async="" src="//www.ugdturner.com/xd.sjs"></script><script async="" src="//static.adsafeprotected.com/iasPET.1.js"></script><script async="" src="//c.amazon-adsystem.com/aax2/apstag.js"></script><script async="" src="https://vi.ml314.com/get?eid=64240&amp;tk=GBYTTE9dUG2OqHj1Rk9DPOaLspvMWfLqV236sdkHgf03d&amp;fp="></script><script async="" src="https://cdn.ml314.com/taglw.js"></script><script type="text/javascript" async="" src="https://sb.scorecardresearch.com/beacon.js"></script><script type="text/javascript" async="" src="//s.cdn.turner.com/analytics/comscore/streamsense.5.2.0.160629.min.js"></script><script type="text/javascript" async="" src="//cdn3.optimizely.com/js/geo4.js"></script><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--t......

Why use BrowserCloud for LLM training?

Many sites use JavaScript to fetch additional resources and run a single-page application. Such sites cannot simply be “collapsed” or issue a system call to retrieve content. Even Google uses a headless browser to generate data for its search index. BrowserCloud unblocks these sites and services to ensure your LLM receives relevant and complete content. In addition, you will be able to configure the wait time and ad blocking conditions. At the same time, BrowserCloud monitors chatbots on websites and bypasses them using hidden options and much more. It looks exactly like your own web browser!

Your custom JavaScript function for advanced data scraping

If you don't mind writing a little JavaScript, you will be able to use API functions to remove data. This API will send code that will run on your service and return only the data you need. As an example, let's remove newlines and returns of the first 1000 characters from a CNN:

curl --request POST \
--url 'https://chrome.browsercloud.io/function?token=YOUR-API-KEY' \
--header 'Content-Type: application/javascript' \
--data 'module.exports = async({ page }) => {
await page.goto('\''https://cnn.com'\'', { timeout: 120000 });
const data = await page.evaluate(() => document.body.innerText);
const cleaned = data.replace(/(\r\n|\n|\r)/gm, '\''. '\'');
const trimmed = cleaned.substring(0, 1000);

return {
data: trimmed,
type: '\''text'\'',
};
};'

· 5 min read
Mark

Some Selenium has been used as a test automation framework and has been the standard for DevOps teams. Selenium is open source and its suite includes components such as Selenium WebDriver, Selenium IDE and Selenium Grid.

  • WebDriver is used to create test scripts.
  • The IDE and its suite includes components such as Selenium WebDriver, Selenium IDE and Selenium Grid, making it suitable for beginners.
  • Selenium Grid allows you to run multiple tests simultaneously and offers multiple physical and virtual devices.

Selenium is considered a convenient tool, but today there are alternatives that attract new users. Unlike Selenium, alternatives do not have its disadvantages. Let's clarify exactly what difficulties and difficulties arise when working with Selenium:

  • Required coding. DevOps engineers focus on coding and innovation, while non-technical tasks are assigned to testers. Selenium alternatives offer automated testing and only require coding.
  • No built-in image comparison. Selenium doesn't have a built-in diff solution feature, so you have to use third party resources.
  • Limited reporting. Selenium limits reporting capabilities, so the only way to solve this problem is to use third-party plugins.

No native or mobile web support. Selenium is a portable web application testing framework. However, it does not support the mobile web application testing feature.

Alternatives for Selenium

BrowserCloud

BrowserCloud is a convenient platform for automated testing. It manages a large network of full-featured Chrome browsers running on high-performance AMD cores and speeds up testing with parallel sessions. BrowserCloud allows you to forget about setting dependencies, fonts, memory leaks, and maintaining your own infrastructure. The main features of BrowserCloud:

  • High-performance and scalable solution for browser automation: run hundreds of browsers simultaneously in the cloud with real-time monitoring and provides web scraping functionality
  • One tool for many applications: End-to-End Testing, Web Scraping, PDF/Image rendering, ... any web automation!
  • Allows you to switch your automation scripts to the scalable cloud solution by changing just a few lines of code.

Cypress

Cypress is an open source and automated testing platform. The platform has two types of tools:

  • Cypress Test Runner to run tests in the browser
  • Cypress Dashboard for CI Toolkit Cypress's tools follow modern development principles, and its 4 Bundle was released on 15/10/20. Additionally, the platform has some nice features that make it a great alternative to Selenium:
  • Allows you to write tests while developing an application in real time
  • Provides test snapshots from the command log
  • Automatically approves decisions

WebdriverIО

Unlike Cypress, WebdriverO does not offer commercial options. It is a framework for automating web and mobile applications. It is used for end-to-end testing or testing an application using JavaScript/TypeScript. If WebdriverIO is combined with WebdriverProtocol, the functionality of the tool expands and new capabilities appear. For example, a programmer can conduct cross-browser testing. Main features of Webdriver IO

  • Test suite scales and remains stable
  • Allows you to expand your installation using built-in plugins or those provided by the community
  • Includes built-in tools for testing mobile applications
  • Easy to set up

Playwright

Playwright is developed by Microsoft contributors. Playwright is open source and is used to automate testing of Chromium, Firefox and WebKit based browsers through a single API. The platform supports several programming languages, including Java, Python, C# and NodeJS. Main features of Playwright

  • Easy setup and configuration
  • Supports various browsers including Chrome, Edge, Safari and Firefox
  • Multilingual support
  • Provides the ability to test browsers in parallel
  • Provides support for multiple tabs/browsers

Cucumber

Cucumber is an automation tool for behavior-driven development (BDD). Although Cucumber was originally written in the Ruby programming language, it has been translated into Java and JavaScript. Unlike Selenium, the tool allows the user to test scripts without coding using the Gherkin scripting language. Main features of Cucumber:

  • Helps to attract business stakeholders without reading code skills
  • Focus on user experience
  • Enables code reuse thanks to a simple writing style
  • Easy to set up and implement

NightwatchJS

NightwatchJS is a Node.js based framework that uses the Webdriver protocol. It allows you to conduct various tests, including end-to-end testing, component testing, visual regression testing, and so on. In addition, NightwatchJS is used for native testing of mobile applications. Key features of Nightwatch:

  • Test scripts are easy to read
  • They can be used for various types of testing
  • The framework is easily customizable and expandable
  • Page object template supported

In conclusion

There are many alternative tools other than Selenium that offer robust features for both testers and developers. Each of them has its pros and cons, and some can even be used with the Selenium framework. When choosing a specific tool, you need to consider the DevOps team's goals, competencies, testing scope, and other factors related to a specific product.

· 4 min read
Mark

Proxies are fundamental to any developer's programming arsenal. They're a way to access data and information that would otherwise be blocked or unavailable. Even though they can be hard to set up, they're a necessary component to web scraping success. Many problems can be avoided or remedied by understanding proper proxy usage. This is why it's important to learn what to look for and what to avoid when choosing a proxy service. Here are some of the most common issues that come with choosing a particular proxy service, and the ways in which each problem can be avoided.

· 6 min read
Mark

Web scraping can be difficult because many websites attempt to block developers from scraping their sites. They do this by detecting IP addresses, inspecting HTTP request headers, using CAPTCHA, inspecting Javascript code and more. In response to these blocks, web scrapers can be made extremely hard to detect. This is because many of the same techniques that developers use to avoid these blocks are also used by them on the site they want to scrape. Here are some tips for making scraping a website easy: