Cleaning HTML

HTMLWeb DevelopmentCode Cleaning

In eLearning development, you will often encounter tools that do a terrible job of converting content to HTML and CSS. This creates all kinds of formatting problems and increases file size for your end users.

For example, here is a bloated HTML snippet from Microsoft Word that occurred just from copying and pasting a single paragraph of text:

<p class="MsoNormal"><span style="font-size: 18px; font-family: 'Roboto',serif; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-themecolor: text1; color: #212529;">Example paragraph <o:p></o:p></span></p>

It renders as follows:

Example paragraph

Now, here is a cleaner version:

<p>Example paragraph</p>

And how it renders:

Example paragraph

Notice that the two versions render the same, but by removing unnecessary elements, we have drastically reduced the size of the HTML elements (from 260 characters to 8, a reduction of 97%) and also reduced the likelihood of formatting errors for our end users and other developers. Also notice that since the original HTML uses the style attribute, a nesting span, and other unnecessary markup, it may not properly assign CSS formatting from the page to the element. This is not an isolated example, and it is common for imported HTML to be bloated by over 90%.

In addition, imported HTML will typically not make use of accessibility features enabled by your web platform, such as adaptive scaling of images to fit different screen sizes.

For these and other reasons, it is good practice to clean HTML that you import from other sources, such as Microsoft Word, into a web authoring platform, such as a Learning Management System or Content Management System.

Remove Styles

Styles can conflict with some CSS properties and also introduce extreme bloat because the formatting is declared on every element instead of just once for the element type or class. Some web authoring platforms have built-in tools to remove styles for you. For instance, in EdTech Books, you can click on Tools > Code Cleaning > Remove Styles to remove every style in a document.

Figure 1

Removing Styles in EdTech Books

This function uses a simple jQuery command to remove each style attribute as follows, and it could be used in the Inspector Console on other web authoring tools that use jQuery by replacing chapter-container with your own encapsulating element id:

$('#chapter-container').find('*').removeAttr("style");

Alternatively, you can also do this with JavaScript in the Inspector Console as follows by replacing chapter-container with your own encapsulating element id:

document.getElementById('chapter-container').querySelectorAll('*').forEach(function(element) {
  element.removeAttribute('style');
});

You can also achieve this with some text editors that allow for searching and replacing using regular expressions, such as Visual Studio. In the editor, paste your HTML, do a find on the following expression, and replace all results with a blank string:

style=".*?"

Remove Spans

Many tools nest span elements within other elements, but these are typically not needed and can really mess up formatting. In EdTech Books, you can click Tools > Code Cleaning > Remove Spans to remove every span in a document without losing its contents.

Figure 2

Removing Spans in EdTech Books

This function uses a simple jQuery command to replace each span element with its contents, and it could be used in the Inspector Console on other web authoring tools that use jQuery by replacing chapter-container with your own encapsulating element id:

$('#chapter-container').find('span').each(function() {
  $(this).replaceWith($(this).contents());
});

This can also be achieved using JavaScript as follows:

var parent = document.getElementById('parent');
var spanElements = parent.getElementsByTagName('span');
for (var i = spanElements.length - 1; i >= 0; i--) {
  var spanElement = spanElements[i];
  while (spanElement.firstChild) {
    parent.insertBefore(spanElement.firstChild, spanElement);
  }
parent.removeChild(spanElement); }

You can also achieve this in a text editor with regular expression support by searching and replacing <span.*?> and </span> with blank strings.

Remove Tool-Specific Elements

Some tools inject their own unique elements into HTML that should be removed. These will vary by tool but can be removed with jQuery as follows, while replacing chapter-container with your own encapsulating id and o\:p with the name of the tool-specific elements you want to replace:

$('#chapter-container').find('o\\:p').each(function() {
  $(this).replaceWith($(this).contents());
})

This finds all elements of type <o:p>, which is a common Microsoft Word addition, and removes them without losing any content inside them.

You can also do this with JavaScript as follows while replacing chapter-container with your own encapsulating id and o:p with the name of the tool-specific element.:

var parent = document.getElementById('chapter-container');
var wrappedElements = parent.getElementsByTagName('o:p');
for (var i = 0; i < wrappedElements.length; i++) {
  var wrappedElement = wrappedElements[i];
  while (wrappedElement.firstChild) {
    parent.insertBefore(wrappedElement.firstChild, wrappedElement);
  }
  parent.removeChild(wrappedElement);
}

You can also achieve this in a text editor by finding and replacing the opening and closing tags with blank strings.

Using Artificial Intelligence (AI)

In addition to the above methods, you can also look at the HTML, decide what you want done, and then ask an AI, like ChatGPT, to write a JavaScript function for you. You can then use the function in the Inspector Console.

For example, you might use the following as a prompt:

Write a JavaScript function that removes all style attributes within an HTML element with the id "chapter-container".

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/elearning_hacker/cleaning_html.