Skip to main content

Using BrowserCloud to train your LLM

· 5 min read
Mark

Large language models, or LLMs, are used to collect data from various sources and present it to the end user. There are projects that give you some initial data sets to start your LLM. However, what if you want to compete with big players like OpenAI? BrowserCloud helps you create your own simple API call and implement more complex technologies. In this guide, we will introduce you to the principles of working with LLM and its training.

About BrowserCloud

BrowserCloud is a service that manages Chrome and makes it programmatically available to developers. Most often, you need to use libraries to control Chrome. However, most LLMs are only interested in raw data, so they use browsercloud REST-based APIs, which are less heavy-handed than software APIs. Browsercloud APIs, in particular, are needed for common use cases. There are two APIs in particular that make retrieving website data much easier: Scrape and Content API.

Using web scraping for LLM training

Once the JavaScript is parsed and running, our Scrape API will help you retrieve website data. Scrape's REST-based API therefore uses a JSON body that describes the nature of your request and how the search should be done. A simple example looks like this:

The request above was sent to CNN.com. It waits for JavaScript to parse and run to get the data for the document and return the following (a shortened example below):

curl --request POST \
--url 'https://chrome.browsercloud.io/scrape?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com",
"elements": [
{
"selector": "body"
}
]
}'

The API generates the "text" of the website inside a JSON structure. It is this “text” that LLM focuses on. In addition, it provides content metadata such as pixel size and position. Such additions expand the model's knowledge of the data and increase its potential importance.

const results = {
"data": [
{
"selector": "body",
"results": [
{
"text": "Audio\nLive TV\nLog In\nHAPPENING NOW\nAnalysis of Donald Trump's historic arraignment on 37 federal criminal charges. Watch CNN\nLive Updates: Ukraine \nTrump arraignment \nTrending: Tori Bowie autopsy \nInstant Brands bankruptcy \nHatch Act \nFather’s Day gifts \nPodcast: 5 Things\nTrump pleads not guilty to mishandling classified intelligence documents\nAnna Moneymaker/Getty Images\nFACT CHECK\nTrump has responded to his federal indictment with a blizzard of dishonesty. Here are 10 of his claims fact-checked\nOpinion: Trump backers are going bonkers with dangerous threats\nDoug Mills/The New York Times/Redux\nGALLERY\nIn pictures: The federal indictment of Donald Trump\n‘Dejected’: Grisham describes Trump’s demeanor as he headed to court\nTrump didn’t speak during the historic hearing, sitting with his arms crossed and a scowl on his face. Here’s what else happened\nLive Updates: Judge says Trump must not communicate with co-defendant about the case\nTakeaways from Trump’s historic court appearance\nWatch: Hear how Trump acted inside the ‘packed courtroom’ during arraignment\nJudge allows E. Jean Carroll to amend her defamation lawsuit to seek more damages against Trump\nInteractive: Former President Donald Trump’s second indictment, annotated\nDoug Mills/The New York Times/Redux\nVIDEO\nTrump stops at famous Cuban restaurant after his arrest...© 2016 Cable News Network.",
"width": 800,
"height": 25409,
"top": 0,
"left": 0,
"attributes": [
{
"name": "class",
"value": "layout layout-homepage cnn"
},
{
"name": "data-page-type",
"value": "section"
}
]
}
]
}
]
}

Content API for LLM training

The Content API is similar to the Scrape API in that it returns content after the JavaScript has been parsed and executed. Additionally, when you use the Content API, as you use it, you send a POST over a JSON body that contains information about the URL you want. However, there are also differences between them. Content API contains only the HTML content of the site itself without additional parsing. Below, you will see an example:

curl --request POST \
--url 'https://chrome.browsercloud.io/content?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com"
}'

Only the HTML code of the page is returned. If you wish, you can turn to other libraries to help with content extraction, but several LLMs will successfully continue to parse them:

<!DOCTYPE html><html lang="en" data-uri="cms.cnn.com/_pages/clg34ol9u000047nodabud1o2@published" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/homepage-domestic@published" class="userconsent-cntry-us userconsent-reg-ccpa"><head><script type="text/javascript" src="https://cdn.krxd.net/userdata/get?pub=e9eaedd3-c1da-4334-82f0-d7e3ff883c87&amp;technographics=1&amp;callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="https://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script type="text/javascript" src="https://beacon.krxd.net/cookie2json?callback=Krux.ns._default.kxjsonp_3pevents"></script><script type="text/javascript" src="https://consumer.krxd.net/consent/get/e9eaedd3-c1da-4334-82f0-d7e3ff883c87?idt=device&amp;dt=kxcookie&amp;callback=Krux.ns._default.kxjsonp_consent_get_0"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/i/web/release/3.3.1/psm.legacy.min.umd.js"></script><script type="text/javascript" async="" src="//www.i.cdn.cnn.com/zion/zion-mb.min.js"></script><script async="" src="https://cdn.boomtrain.com/p13n/cnn/p13n.min.js"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/brands/cnn/web/release/psm.min.js"></script><script async="" src="//cdn.krxd.net/ctjs/controltag.js.d58f47095e6041e576ee04944cca45da"></script><script type="text/javascript" defer="" async="" src="https://z.cdp-dev.cnn.com/zfm/zfh-3.js"></script><script id="GPTScript" type="text/javascript" src="https://securepubads.g.doubleclick.net/tag/js/gpt.js"></script><script type="text/javascript" src="https://steadfastseat.com/v2svxFVJ-Mg82zHMJUHkQBWwVF721AsFf1Y3MomzEUqIMQlG6f2VaL6ctdsQc2VgA"></script><script type="text/javascript" async="" src="//www.ugdturner.com/xd.sjs"></script><script async="" src="//static.adsafeprotected.com/iasPET.1.js"></script><script async="" src="//c.amazon-adsystem.com/aax2/apstag.js"></script><script async="" src="https://vi.ml314.com/get?eid=64240&amp;tk=GBYTTE9dUG2OqHj1Rk9DPOaLspvMWfLqV236sdkHgf03d&amp;fp="></script><script async="" src="https://cdn.ml314.com/taglw.js"></script><script type="text/javascript" async="" src="https://sb.scorecardresearch.com/beacon.js"></script><script type="text/javascript" async="" src="//s.cdn.turner.com/analytics/comscore/streamsense.5.2.0.160629.min.js"></script><script type="text/javascript" async="" src="//cdn3.optimizely.com/js/geo4.js"></script><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--t......

Why use BrowserCloud for LLM training?

Many sites use JavaScript to fetch additional resources and run a single-page application. Such sites cannot simply be “collapsed” or issue a system call to retrieve content. Even Google uses a headless browser to generate data for its search index. BrowserCloud unblocks these sites and services to ensure your LLM receives relevant and complete content. In addition, you will be able to configure the wait time and ad blocking conditions. At the same time, BrowserCloud monitors chatbots on websites and bypasses them using hidden options and much more. It looks exactly like your own web browser!

Your custom JavaScript function for advanced data scraping

If you don't mind writing a little JavaScript, you will be able to use API functions to remove data. This API will send code that will run on your service and return only the data you need. As an example, let's remove newlines and returns of the first 1000 characters from a CNN:

curl --request POST \
--url 'https://chrome.browsercloud.io/function?token=YOUR-API-KEY' \
--header 'Content-Type: application/javascript' \
--data 'module.exports = async({ page }) => {
await page.goto('\''https://cnn.com'\'', { timeout: 120000 });
const data = await page.evaluate(() => document.body.innerText);
const cleaned = data.replace(/(\r\n|\n|\r)/gm, '\''. '\'');
const trimmed = cleaned.substring(0, 1000);

return {
data: trimmed,
type: '\''text'\'',
};
};'