Extracting Structured Content with an API and JavaScript

JavaScriptWeb ScrapingAPI

Application Programming Interfaces (APIs) are tools provided by developers to give other developers access to content or software features in more direct ways than going through a web interface or HTML. As Figure 1 illustrates, when it comes to scraping content, APIs have two advantages over parsing web pages. First, APIs can return structured data that is already parsed, allowing you to avoid finding patterns in the rendered HTML, and second, APIs can sometimes return data that is not available in the HTML.

Prior to using APIs, you should ensure that your activities are legal and ethical.

Figure 1

Accessing Information via a Web Page vs. an API

To access API information via the web, you can often use your web browser or JavaScript to open what is called an endpoint, which is simply a URL that returns structured information. For instance, EdTech Books has an open API that allows you to access content and other information via the following endpoint: https://edtechbooks.org/api.v2.php.

If you attempt to access this endpoint in your web browser, you will receive an error. This happens because the endpoint expects you to provide some information to help it know what data to return to you. In this case, try telling it that you want to search for books about technology by using this link: https://edtechbooks.org/api.v2.php?action=search&term=technology&entity_type=Book . In this link, we told the API to do the action "search," to look for the term "technology," and to return only "Book" entities. If you go to this URL, it will return any books on the site with the term "technology" in the title. Notice that the results are formatted in JavaScript Object Notation (JSON), which may be difficult for a human to read, but it is great for a computer.

To see the value of this approach, let's try accessing the same endpoint in the Inspector Console with JavaScript by running the following command:

fetch('https://edtechbooks.org/api.v2.php?action=search&term=technology&entity_type=Book')
  .then(response => response.json())
  .then(data => {
      for(var i in data.books) {
        console.log(data.books[i].name);
      }
  });

Fiddle with this a little to get results with different terms, like "research" or "chemistry," or check out the EdTech Books API documentation for more options.

Each API endpoint is different, meaning that it accepts different parameters and returns different types of results. So, they are great if you want to extract a lot of data from a single site, such as a repository, but they are less useful if you only want a little information from many sites.

Additionally, access to APIs is managed by the software developers, who determine what content can be accessed and by whom. Some sites allow open access to their APIs, some require a developer login or key, and others don't allow anyone outside of the organization to access them.

Some useful APIs that I have used in my work include the following:

For most of these, you can have some level of free access, but you must create an account to request an API key to begin. You would then pass this key to the endpoint in your calls so that the API knows to give you access.

For instance, if you wanted to get a list of articles from the Scopus API for the journal Educational Technology Research and Development (ISSN 1556-6501), you could use the following endpoint after signing up and getting your own API key and instance token:

https://api.elsevier.com/content/search/scopus?query=issn(1556-6501)&apiKey=YOURAPIKEY&insttoken=YOURINSTTOKEN&count=25

Note that this returns an XML formatted document rather than JSON, but if you fetch this URL in your Inspector, you will get JSON, which you can then parse for titles or other information as follows:

fetch('https://api.elsevier.com/content/search/scopus?query=issn(1556-6501)&apiKey=YOURAPIKEY&insttoken=YOURINSTTOKEN&count=25')
  .then(response => response.json())
  .then(data => {
      for(var i in data['search-results'].entry) {
        console.log(data['search-results'].entry[i]['dc:title']);
      }
  });

This will return results like the following:

Figure 2

Example Results from a Scopus API Search

A Note on Security

If you must use a password or secret key to access an API in any way, then you should be aware that simply using JavaScript to pass these values to the API server (as in the Elsevier example above) will give others using your code unauthorized access. So, if you are sharing your code with others, be sure to remove keys or other authentication data to maintain security.

If you are developing an application that you would like to share with others that uses your credentials to access an API, then you will need to have a way to securely have your application communicate with the API server without revealing sensitive data (like keys) to your user. The typical way to do this is to have your own server that takes your user behaviors in your application and routes them to the API server on their behalf, overlaying or removing sensitive data as needed, as in Figure 3.

Figure 3

Example of an Application that Uses an API

In this example, the API key would be stored on your server and sent to the API server with requests, but your end user on your webpage or application will never be able to see the API key or how your server communicates with the API server. This book will not guide you in setting up a server environment for hosting secure applications in this way, but I merely point this out here so that you will realize that any code you create or use in this book will need to be adapted for such uses and should not be shared with others in a way that includes compromising information (like security keys).


This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/elearning_hacker/extracting_structured_content_with_an_api_and_javascript.