Skip to main content

5 posts tagged with "webscraping"

View All Tags

· 5 min read
Mark

Large language models, or LLMs, are used to collect data from various sources and present it to the end user. There are projects that give you some initial data sets to start your LLM. However, what if you want to compete with big players like OpenAI? BrowserCloud helps you create your own simple API call and implement more complex technologies. In this guide, we will introduce you to the principles of working with LLM and its training.

About BrowserCloud

BrowserCloud is a service that manages Chrome and makes it programmatically available to developers. Most often, you need to use libraries to control Chrome. However, most LLMs are only interested in raw data, so they use browsercloud REST-based APIs, which are less heavy-handed than software APIs. Browsercloud APIs, in particular, are needed for common use cases. There are two APIs in particular that make retrieving website data much easier: Scrape and Content API.

Using web scraping for LLM training

Once the JavaScript is parsed and running, our Scrape API will help you retrieve website data. Scrape's REST-based API therefore uses a JSON body that describes the nature of your request and how the search should be done. A simple example looks like this:

The request above was sent to CNN.com. It waits for JavaScript to parse and run to get the data for the document and return the following (a shortened example below):

curl --request POST \
--url 'https://chrome.browsercloud.io/scrape?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com",
"elements": [
{
"selector": "body"
}
]
}'

The API generates the "text" of the website inside a JSON structure. It is this “text” that LLM focuses on. In addition, it provides content metadata such as pixel size and position. Such additions expand the model's knowledge of the data and increase its potential importance.

const results = {
"data": [
{
"selector": "body",
"results": [
{
"text": "Audio\nLive TV\nLog In\nHAPPENING NOW\nAnalysis of Donald Trump's historic arraignment on 37 federal criminal charges. Watch CNN\nLive Updates: Ukraine \nTrump arraignment \nTrending: Tori Bowie autopsy \nInstant Brands bankruptcy \nHatch Act \nFather’s Day gifts \nPodcast: 5 Things\nTrump pleads not guilty to mishandling classified intelligence documents\nAnna Moneymaker/Getty Images\nFACT CHECK\nTrump has responded to his federal indictment with a blizzard of dishonesty. Here are 10 of his claims fact-checked\nOpinion: Trump backers are going bonkers with dangerous threats\nDoug Mills/The New York Times/Redux\nGALLERY\nIn pictures: The federal indictment of Donald Trump\n‘Dejected’: Grisham describes Trump’s demeanor as he headed to court\nTrump didn’t speak during the historic hearing, sitting with his arms crossed and a scowl on his face. Here’s what else happened\nLive Updates: Judge says Trump must not communicate with co-defendant about the case\nTakeaways from Trump’s historic court appearance\nWatch: Hear how Trump acted inside the ‘packed courtroom’ during arraignment\nJudge allows E. Jean Carroll to amend her defamation lawsuit to seek more damages against Trump\nInteractive: Former President Donald Trump’s second indictment, annotated\nDoug Mills/The New York Times/Redux\nVIDEO\nTrump stops at famous Cuban restaurant after his arrest...© 2016 Cable News Network.",
"width": 800,
"height": 25409,
"top": 0,
"left": 0,
"attributes": [
{
"name": "class",
"value": "layout layout-homepage cnn"
},
{
"name": "data-page-type",
"value": "section"
}
]
}
]
}
]
}

Content API for LLM training

The Content API is similar to the Scrape API in that it returns content after the JavaScript has been parsed and executed. Additionally, when you use the Content API, as you use it, you send a POST over a JSON body that contains information about the URL you want. However, there are also differences between them. Content API contains only the HTML content of the site itself without additional parsing. Below, you will see an example:

curl --request POST \
--url 'https://chrome.browsercloud.io/content?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com"
}'

Only the HTML code of the page is returned. If you wish, you can turn to other libraries to help with content extraction, but several LLMs will successfully continue to parse them:

<!DOCTYPE html><html lang="en" data-uri="cms.cnn.com/_pages/clg34ol9u000047nodabud1o2@published" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/homepage-domestic@published" class="userconsent-cntry-us userconsent-reg-ccpa"><head><script type="text/javascript" src="https://cdn.krxd.net/userdata/get?pub=e9eaedd3-c1da-4334-82f0-d7e3ff883c87&amp;technographics=1&amp;callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="https://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script type="text/javascript" src="https://beacon.krxd.net/cookie2json?callback=Krux.ns._default.kxjsonp_3pevents"></script><script type="text/javascript" src="https://consumer.krxd.net/consent/get/e9eaedd3-c1da-4334-82f0-d7e3ff883c87?idt=device&amp;dt=kxcookie&amp;callback=Krux.ns._default.kxjsonp_consent_get_0"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/i/web/release/3.3.1/psm.legacy.min.umd.js"></script><script type="text/javascript" async="" src="//www.i.cdn.cnn.com/zion/zion-mb.min.js"></script><script async="" src="https://cdn.boomtrain.com/p13n/cnn/p13n.min.js"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/brands/cnn/web/release/psm.min.js"></script><script async="" src="//cdn.krxd.net/ctjs/controltag.js.d58f47095e6041e576ee04944cca45da"></script><script type="text/javascript" defer="" async="" src="https://z.cdp-dev.cnn.com/zfm/zfh-3.js"></script><script id="GPTScript" type="text/javascript" src="https://securepubads.g.doubleclick.net/tag/js/gpt.js"></script><script type="text/javascript" src="https://steadfastseat.com/v2svxFVJ-Mg82zHMJUHkQBWwVF721AsFf1Y3MomzEUqIMQlG6f2VaL6ctdsQc2VgA"></script><script type="text/javascript" async="" src="//www.ugdturner.com/xd.sjs"></script><script async="" src="//static.adsafeprotected.com/iasPET.1.js"></script><script async="" src="//c.amazon-adsystem.com/aax2/apstag.js"></script><script async="" src="https://vi.ml314.com/get?eid=64240&amp;tk=GBYTTE9dUG2OqHj1Rk9DPOaLspvMWfLqV236sdkHgf03d&amp;fp="></script><script async="" src="https://cdn.ml314.com/taglw.js"></script><script type="text/javascript" async="" src="https://sb.scorecardresearch.com/beacon.js"></script><script type="text/javascript" async="" src="//s.cdn.turner.com/analytics/comscore/streamsense.5.2.0.160629.min.js"></script><script type="text/javascript" async="" src="//cdn3.optimizely.com/js/geo4.js"></script><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--t......

Why use BrowserCloud for LLM training?

Many sites use JavaScript to fetch additional resources and run a single-page application. Such sites cannot simply be “collapsed” or issue a system call to retrieve content. Even Google uses a headless browser to generate data for its search index. BrowserCloud unblocks these sites and services to ensure your LLM receives relevant and complete content. In addition, you will be able to configure the wait time and ad blocking conditions. At the same time, BrowserCloud monitors chatbots on websites and bypasses them using hidden options and much more. It looks exactly like your own web browser!

Your custom JavaScript function for advanced data scraping

If you don't mind writing a little JavaScript, you will be able to use API functions to remove data. This API will send code that will run on your service and return only the data you need. As an example, let's remove newlines and returns of the first 1000 characters from a CNN:

curl --request POST \
--url 'https://chrome.browsercloud.io/function?token=YOUR-API-KEY' \
--header 'Content-Type: application/javascript' \
--data 'module.exports = async({ page }) => {
await page.goto('\''https://cnn.com'\'', { timeout: 120000 });
const data = await page.evaluate(() => document.body.innerText);
const cleaned = data.replace(/(\r\n|\n|\r)/gm, '\''. '\'');
const trimmed = cleaned.substring(0, 1000);

return {
data: trimmed,
type: '\''text'\'',
};
};'

· 5 min read
Mark

Some Selenium has been used as a test automation framework and has been the standard for DevOps teams. Selenium is open source and its suite includes components such as Selenium WebDriver, Selenium IDE and Selenium Grid.

  • WebDriver is used to create test scripts.
  • The IDE and its suite includes components such as Selenium WebDriver, Selenium IDE and Selenium Grid, making it suitable for beginners.
  • Selenium Grid allows you to run multiple tests simultaneously and offers multiple physical and virtual devices.

Selenium is considered a convenient tool, but today there are alternatives that attract new users. Unlike Selenium, alternatives do not have its disadvantages. Let's clarify exactly what difficulties and difficulties arise when working with Selenium:

  • Required coding. DevOps engineers focus on coding and innovation, while non-technical tasks are assigned to testers. Selenium alternatives offer automated testing and only require coding.
  • No built-in image comparison. Selenium doesn't have a built-in diff solution feature, so you have to use third party resources.
  • Limited reporting. Selenium limits reporting capabilities, so the only way to solve this problem is to use third-party plugins.

No native or mobile web support. Selenium is a portable web application testing framework. However, it does not support the mobile web application testing feature.

Alternatives for Selenium

BrowserCloud

BrowserCloud is a convenient platform for automated testing. It manages a large network of full-featured Chrome browsers running on high-performance AMD cores and speeds up testing with parallel sessions. BrowserCloud allows you to forget about setting dependencies, fonts, memory leaks, and maintaining your own infrastructure. The main features of BrowserCloud:

  • High-performance and scalable solution for browser automation: run hundreds of browsers simultaneously in the cloud with real-time monitoring and provides web scraping functionality
  • One tool for many applications: End-to-End Testing, Web Scraping, PDF/Image rendering, ... any web automation!
  • Allows you to switch your automation scripts to the scalable cloud solution by changing just a few lines of code.

Cypress

Cypress is an open source and automated testing platform. The platform has two types of tools:

  • Cypress Test Runner to run tests in the browser
  • Cypress Dashboard for CI Toolkit Cypress's tools follow modern development principles, and its 4 Bundle was released on 15/10/20. Additionally, the platform has some nice features that make it a great alternative to Selenium:
  • Allows you to write tests while developing an application in real time
  • Provides test snapshots from the command log
  • Automatically approves decisions

WebdriverIО

Unlike Cypress, WebdriverO does not offer commercial options. It is a framework for automating web and mobile applications. It is used for end-to-end testing or testing an application using JavaScript/TypeScript. If WebdriverIO is combined with WebdriverProtocol, the functionality of the tool expands and new capabilities appear. For example, a programmer can conduct cross-browser testing. Main features of Webdriver IO

  • Test suite scales and remains stable
  • Allows you to expand your installation using built-in plugins or those provided by the community
  • Includes built-in tools for testing mobile applications
  • Easy to set up

Playwright

Playwright is developed by Microsoft contributors. Playwright is open source and is used to automate testing of Chromium, Firefox and WebKit based browsers through a single API. The platform supports several programming languages, including Java, Python, C# and NodeJS. Main features of Playwright

  • Easy setup and configuration
  • Supports various browsers including Chrome, Edge, Safari and Firefox
  • Multilingual support
  • Provides the ability to test browsers in parallel
  • Provides support for multiple tabs/browsers

Cucumber

Cucumber is an automation tool for behavior-driven development (BDD). Although Cucumber was originally written in the Ruby programming language, it has been translated into Java and JavaScript. Unlike Selenium, the tool allows the user to test scripts without coding using the Gherkin scripting language. Main features of Cucumber:

  • Helps to attract business stakeholders without reading code skills
  • Focus on user experience
  • Enables code reuse thanks to a simple writing style
  • Easy to set up and implement

NightwatchJS

NightwatchJS is a Node.js based framework that uses the Webdriver protocol. It allows you to conduct various tests, including end-to-end testing, component testing, visual regression testing, and so on. In addition, NightwatchJS is used for native testing of mobile applications. Key features of Nightwatch:

  • Test scripts are easy to read
  • They can be used for various types of testing
  • The framework is easily customizable and expandable
  • Page object template supported

In conclusion

There are many alternative tools other than Selenium that offer robust features for both testers and developers. Each of them has its pros and cons, and some can even be used with the Selenium framework. When choosing a specific tool, you need to consider the DevOps team's goals, competencies, testing scope, and other factors related to a specific product.

· 8 min read
Mark

Finding a list of free proxies can seem like a huge success, only to find that someone else has already claimed the gold nugget. Many people see proxy lists as their dreams coming true; however, this can quickly become a nightmare.

Sites block new proxies as their usage increases. This means new free web scraping proxy websites are likely too good to be true — they might work for a short time, but then get blocked by all the sites they're used on.

While paid proxies can be blocked, their risk is lowered because they're not listed on a public proxy list. This makes them harder to block since they don't have to use proxy addresses with public IPs. Free proxies are more susceptible to blacklisting due to the lack of control their provider has over them.

Free proxy servers are often banned due to them sharing IP addresses with other web browsers and anonymous users. By using these servers, you’re unintentionally connecting to people who have shared IP addresses with others. This can lead to bans and damage the viability of these servers. The benefit of using a free server is that no one else shares your IP address ban because they were blocked on your server. There are many free proxy lists to choose from. This is why we created the best proxy lists for web scraping and the top free proxy lists. These lists are completely free and provide an alternative to commercial services.

· 3 min read
Mark

1. BrowserCloud.io

Website: https://browsercloud.io

BrowserCloud is a popular proxy service that many people know.

In comparison to other services, BrowserCloud's proxies have extremely high reliability. Thanks to the 99.99% up time and unlimited bandwidth, their proxies are some of the most dependable on the market.

BrowserCloud provides unmatched reliability at scale. Their browser-based cloud service automatically tries multiple methods to complete a single request and can be customized to meet specific needs. Plans start at $19 per month for 120,000 pageviews and go up to $249 per month for 3 million pageviews on a standard plan. If you need more or need to scrape larger volumes, you can create a custom Enterprise Plan. You can begin using their services with free 2000 API credits every month to test their proxies. Plus, they offer a Pay As You Go plan that doesn't require a monthly commitment.

One of the best things about BrowserCloud is the excellent support team. They are able to help with many different ticket issues with other web hosting companies. When a ticket site is blocked by another hosting provider, users aren’t able to unblock it.

2. Luminati

Luminati provides users with the option to mask their location and activity through their proxy pool with more than 70M IPs. These four types of proxies include mobile, residential, static residential, and data center proxies. This provider has a platform that manages cookies, IPs and more. Their proxies have many great features built into them, making them very effective for web scraping. This provider's platform is easy to integrate and use.

3. Oxylabs

Oxylabs provides private extractions and datasets to its clients. They don't offer a dedicated ticket agent, but their service is still a great option for anyone looking to get tickets to an event. The company offers an expedited checkout process and live chat to help customers buy and maintain their proxies. Their prices are on the higher end, but the reliability and high-quality service they offer make them an excellent choice.

4. Storm Proxies

With a wide range of options for proxy needs, Storm Proxies is one of the more unique services on the list. They created their own technology to provide clients with the best service they can offer. One of the best choices for clients looking to get tickets is rotating backconnect proxies. This company provides residential rotating proxies specifically for ticketing. They're a great choice because of this, with packages starting at $160 per month for 20 IPs and going all the way up to $900 per month for 200 IPs. However, these proxies are quite expensive.

Customers can select between the USA, European Union, United States + European Union or worldwide proxy servers.

5. Rotating Proxies

Proxy servers provided by the company Rotating Proxies are used for all purposes. This means their anonymous and secure servers don't fail or get blocked by Ticketmaster. Subscriptions are available for paying more than one server at a time, since this company provides proxies on a per-proxy basis.

6. High Proxies

With High Proxies, users can purchase packages targeted at specific needs. These packages include private proxies for ticketing and classifieds, as well as shared proxies for social media. Specialized proxies cost more than standard ones. They are also available 24/7 and have near 100% uptime. These features make them great for beating scalpers who resell proxies at high prices. High Proxies claims they purchase new servers frequently. If you choose one of their services, you should know the quality of your chosen proxy provider.

· 6 min read
Mark

Web scraping can be difficult because many websites attempt to block developers from scraping their sites. They do this by detecting IP addresses, inspecting HTTP request headers, using CAPTCHA, inspecting Javascript code and more. In response to these blocks, web scrapers can be made extremely hard to detect. This is because many of the same techniques that developers use to avoid these blocks are also used by them on the site they want to scrape. Here are some tips for making scraping a website easy: