How can I extract text content from a webpage using document.body.innerHTML in JavaScript?

sakshikuchroo · November 20, 2024, 6:30pm

I want to build a string of the contents of the webpage without any HTML syntax, likely by replacing it with a space to ensure words aren’t conjoined. The expected output is a variable like this:

var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

Could you provide some guidance on how to achieve this? Thanks!

Additionally, could you show how to implement this using javascript try if document.body.innerHTML var a?

joe-elmoufak · November 25, 2024, 5:41am

You can use a regular expression to remove HTML tags and unwanted punctuation, replacing them with spaces. Here’s how to implement it:

try {
    var a = document.body.innerHTML; // Get the HTML content
    var content = a.replace(/<[^>]*>/g, ' ') // Remove HTML tags
                    .replace(/[^\w\s]/g, '') // Remove punctuation
                    .replace(/\s+/g, ' ') // Replace multiple spaces with a single space
                    .trim(); // Trim leading/trailing spaces
    console.log(content);
} catch (error) {
    console.error("Error extracting content:", error);
}

This approach effectively extracts plain text while maintaining word separation.