Extracting HTML Elements from a Web Page using JavaScript

HTMLJavaScriptWeb ScrapingjQuery

In your Inspector Console, you can run JavaScript on any web page to extract information from it, but this requires looking for patterns and telling JavaScript what information you want. Prior to doing this, you should ensure that your activities are legal and ethical.

jQuery vs. Pure JavaScript

As I provide code examples, I will primarily use jQuery. However, not all web pages have jQuery installed, and so the provided jQuery commands may not work on every web page. For this reason, I've also provided pure JavaScript for many commands and any jQuery commands can be converted to pure JavaScript if needed.

Looking for Patterns and Identifiers

On the web page, find a piece of information that you would like to extract, right-click it, and view it in the Inspector. Notice the following about the information:

  1. What type of element is it in (e.g., <p>, <span>, <div>)?
  2. What is the structure of its parent and children elements?
  3. Does it or its immediate parent have a class attribute assigned?

Most of the time, this is all the information you need to identify other elements like the one you are inspecting. For instance, right-click and inspect the following email address:

Notice the following:

  1. It is a link.
  2. Its immediate parent is a <p> element.
  3. It does not have a class, but its parent has the class email-address.

Now, in the Inspector Console, run the following jQuery command:

// jQuery
$('.email-address').each(function() {

    console.log($(this).text());
});

// JavaScript
document.querySelectorAll('.email-address').forEach(function(e) {
    console.log(e.innerText);
});

Notice that this gives you the email address in the console, but it also returned two other email addresses it found on the page that follow the same pattern (in this case, they all are in <p> elements that have the email-address class).

Let's try this with something else. Let's return all list items on the page by running the following:

// jQuery
$('#chapter-container li').each(function() {

    console.log($(this).text());
});

// JavaScript
document.querySelector('#chapter-container').querySelectorAll('li').forEach(function(e) {
    console.log(e.innerText);
});

This command gets the page's element with the id chapter-container and returns all list item elements (<li>) within it, exporting the text of each element to the Inspector Console.

Try this with headings as follows:

// jQuery
$('h2').each(function() {

    console.log($(this).text());
});

// JavaScript
document.querySelectorAll('h2').forEach(function(e) {

    console.log(e.innerText);
});

This returns all <h2> elements on the page.

How about all links on the page?

// jQuery
$('#chapter-container a').each(function() {

    console.log($(this).attr('href'));
});

// JavaScript
var links = document.querySelector('#chapter-container').querySelectorAll('a');
links.forEach(function(link) {
    console.log(link.getAttribute('href'));
});

Notice with this one instead of using .text() we used .attr('href') to return the value of the href attribute instead of the displayed text.

Now, let's try one more thing and collapse all of our results together into a single comma-separated list. To do this with headings, we can run the following command:

// jQuery
var headings = new Array();

$('h2').each(function() {
    headings.push($(this).text());
});
console.log(headings.join(','));

// JavaScript
var headings = document.querySelectorAll('h2');

var headingTexts = Array.from(headings).map(function(heading) {
    return heading.textContent.trim();
});
console.log(headingTexts.join(', '));

In this example, we first created an array to capture the results, then added each result to it, and then output the results to the console as a string where all elements were joined together with a comma.

Activity: Extract All Emails as a Comma-Separated List

Using the previous examples as a guide, output all email addresses on the page into a comma-separated list in the Inspector Console.

Helpful Examples

ERIC Search Results

For this example, let's say that you are doing a literature review on the topic of "social media," and you want to get all results from an ERIC search results page on the topic. First, access the search results normally by going to https://eric.ed.gov/?q=social+media 

Figure 1

ERIC Results for "Social Media"

Then, in the Inspector, you can run a jQuery command to get all of the article names as follows:

$('.r_t').each(function() {
    console.log($(this).text());
});

Or, if you want the direct links for all articles, try the following:

$('.r_f a').each(function() {
    console.log($(this).attr('href'));
});

Or, if you only want the direct links to articles that have a full-text download, try the following:

$('.r_f a').each(function() {
    if($(this).attr('href').includes('fulltext')) console.log($(this).attr('href'));
});

This process works because (a) each article title on the site is inside an element that has a class called "r_t," (b) each direct link is within an element that has a class called "r_f," and (c) each full-text direct link goes to a file that has the phrase "fulltext" in the URL.

Though this is all neat, the results aren't super helpful just yet, because you probably want to take them to a spreadsheet or somewhere else where you can manipulate them. Again, JavaScript can help us by structuring the data we want into an array and then displaying it in a table in the Inspector like this:

var results = [];
$('.r_t').each(function(i){results.push({'title':$(this).text(),'authors':$('.r_a').eq(i).text(),'abstract':$('.r_d').eq(i).text(),'href':$('.r_f a').eq(i).attr('href'),})});
console.table(results);

Figure 2

Tabular Results of an ERIC Search

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/elearning_hacker/extracting_information_from_a_web_page.