I need to parse a string containing HTML content in JavaScript. I tried using the Pure JavaScript HTML Parser library, but it seems to parse the current page’s HTML rather than the string I’m passing in. For example, when I use this code:
var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);
It changes the title of my page. My goal is to extract links from an HTML string representing an external page.
Do you know an API that I can use to javascript parse html from a string without affecting the current page’s content?
From my experience, the easiest way to parse HTML in JavaScript, especially on the browser side, is to use DOMParser
. It’s safe and efficient. Essentially, it creates a virtual document in memory, so nothing on the page is affected, which is perfect if you’re just extracting content or links from an HTML string. Here’s how you can use it:
const htmlString = "<div><a href='test.html'>Link</a></div>";
const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, 'text/html');
const links = doc.querySelectorAll('a');
console.log(links[0].href); // Output: full URL
This method is straightforward and ideal for scenarios where you don’t want to manipulate the existing page content.
Great point! If you want a clever and lightweight approach while keeping things safe, you can also use the <template>
element. It’s a hidden gem for scenarios where you want to parse HTML strings without worrying about the DOM being affected. Here’s how you can do it:
const htmlString = "<div><a href='test.html'>Link</a></div>";
const template = document.createElement('template');
template.innerHTML = htmlString;
const links = template.content.querySelectorAll('a');
console.log(links[0].getAttribute('href'));
This works well for extracting elements without involving an external parser and doesn’t touch the visible page—ideal for “javascript parse html” scenarios.
For server-side JavaScript or Node.js, using a library like cheerio
is a fantastic choice. It’s like jQuery for the backend and gives you the ability to parse HTML without needing a full DOM. Here’s how you can do it with cheerio
const cheerio = require('cheerio');
const htmlString = "<a href='link1'>One</a><a href='link2'>Two</a>";
const $ = cheerio.load(htmlString);
$('a').each((i, el) => console.log($(el).attr('href')));
It’s lightweight, efficient, and great for parsing HTML when you’re working in environments that don’t have a browser DOM available. This is an awesome way to “javascript parse html” on the server.