Careful What You Cut and Paste

If you have been fortunate enough to be designated as the content editor for your company’s website (or a site you own), you probably find yourself in the position of adding content to your web pages. I have known many people who create their content using Microsoft Word and then simply cut and paste the content into their web pages. Today I am going to show you that this may or may not cause bloating in your website and may even introduce code that is not ADA compliant. So, let’s begin.

I am going to begin by creating a simple web page with a free text editor, Notepad++, which allows me to create and edit a variety of text documents including HTML files.

Source of a simple web page edited with Notepad++

For those of you who may not be familiar with HTML coding, this short 9-line file defines a simple but complete web page. The maroon text in angle brackets is called tags. They define segments of the document and are almost always found in pairs. Some tags like <meta> and <hr> do not require pairs, but that is a story for another time. The opening tags like <HTML>, <HEAD>, and <BODY> define the start of a segment. The closing tags like </HTML>, </HEAD>, and </BODY> define the close of a segment. Tag pairs can appear sequentially such as <HEAD> and <BODY> or nested such as <HEAD> and <BODY> found within the <HTML> tag pair. (Nesting of tags is another important topic for another day.)

Without turning this into an HTML class, let me just summarize what is going on in this file. The <HTML></HTML> tag pair define the page’s start and end. The tag pair <HEAD></HEAD> define metadata about the page including the page title which, when present, appears in the window header. The <BODY></BODY> pair define all content that appears on the page. Therefore, the text: ‘Hello readers!’ appears when displaying this page within a browser. For simple files, you can even use Microsoft’s File Explorer (which in some ways is also a browser but does not display all formatting tags properly). In fact, the following figure shows this file selected in File Explorer.

A simple text HTML page displayed in File Explorer

Now, suppose I want to add another paragraph to this content. Because I am familiar with Microsoft Word, I may choose to create that content there. (Or perhaps if I have been lucky and others created the content for me in Word.) Let’s assume that the paragraph below is a portion of the content I received from a colleague.

Content received from a colleague written in Microsoft Word

Notice that some of the content has been formatted in Word adding bold and italic styles to some words and making some words appear larger and perhaps in a different font. I select the text, and copy it to the clipboard. Then returning to the web page opened within Notepad++, I paste the text as shown in the figure below.

MS Word content pasted in the HTML document in NotePad++

After pasting the copied content, I added two more tags, <p> and </p> to define the text as a separate paragraph. Otherwise, HTML would attempt to display the new content immediately after the exclamation mark in the first line. HTML ignores white space (carriage/line returns). However, when I look at the page now, I notice several things.

File Explorer shows the updated web page with the MS Word content

First, the apostrophe in the word, “Don’t” appears to have been replaced with a series of unexpected characters even though it appeared fine in the source file. Also, text that was bold, italic, and a larger size has lost the formatting defined in Word. Notepad++ as well as the version of Notepad that comes with Windows strips out most special style formatting from text copied from the clipboard. With a little knowledge of HTML formatting, these issues can be easily fixed. But what if you were using another HTML editor other than Notepad++ or Notepad? Well, let’s see.

I am going to create this simple web page in a free tool called BlueGriffon. Unlike Notepad++ which does not support WYSIWYG formatting, BlueGriffon provides a wide selection of formatting tools to help build web pages. It even shows the page side-by-side with the source code for the page. The following figure shows the initial starter web page with the text ‘Hello Reader!’ In the content area. Notice that it automatically added a <meta> tag and <title></title> tags to the <head> section. It also created a paragraph block after the single line of text because I pressed the Return key at the end of that line when I created it. In any case, you can see that the HTML source for the webpage is still quite small taking a total of 12 lines

Recreating the base web page using BlueGriffon with Dual View turned on

Now let me copy that same text from Microsoft Word that is still in my clipboard. I paste the text into the left window which shows the WYSIWYG version of the page. The following figure shows a portion of the result:

Example of the additional formatting added by MS Word to a web page using BlueGriffon

Even with truncating most of the inserted text from the source text in this image, you can see that the text has ballooned from 12 lines to 346 lines. You can recognize the first 8 lines and the lines after 336 as essentially the same as before with the addition of the added text. However, all the other text added comes from Microsoft Word when formatting is not automatically removed like NotePad++ does.

On the positive side, you can see the <b></b> tags where bold text appeared and <i></i> where italic text appeared. There is another pair of tags used to define other font characteristics. The <span></span> tags define a segment of text with special characteristics. In this case, the font size is set to 14 pt, the line-height to 107% of normal, and the font family to Arial Black. Actually, the font-family element of the style attribute defines two font families. When more than one font family is specified, it allows the browser to ‘downgrade’ the font family if the preferred font family does not exist on the viewing device. In this case, the browser reading from left to right attempts to use the font-family Arial Black if it exists. Otherwise, it defaults to Sans-Serif. While it is possible to specify more than two font families, that is rare and should be used with caution because each font family has its own unique character sizes and therefore could cause the text to be rearranged in unexpected ways.

But the big concern is the extra instructions between lines 9 and 335. This formatting is carried along with the copying of text from Microsoft Word. Is it needed for your web page? No. Does it hurt anything? Visually, no. However, it makes the web page significantly larger than needed and therefore affects the time needed for a browser to download and create the page before displaying it. In most cases, this extra time may be insignificant, but if the site has hundreds of pages, the wasted space begins to add up.

In addition to adding to the size of your web pages, the copy process may add content that is not strictly ADA compliant. For example, the <b></b> tags for bold text should be replaced with <strong></strong> tags and <i></i> should be replace with <em>,</em> tags. Why? Because most screen readers will not announce the style change for <b> and <i> tags meaning that people who rely on screen readers will miss this information. Of course, replacing these tags is easy with a Replace function found in most editors.

One way to avoid the bloat that occurs when you copy and paste directly from Microsoft Word is to first paste the text into NotePad++ or NotePad to strip out all the extra formatting. Then copy and paste the remaining text from there. Of course, you will then have to reformat any of the text that was previously formatted. However, the resulting source file is now a much trimmer 15 lines as shown in the following image.

MS Word content trimmed of formatting copied into BlueGriffon web page.

The bottom line is this. If you are using an HTML editor other than NotePad++ or NotePad, test first what happens when you copy and paste text from Microsoft Word. Also, check if your HTML editor has an option to paste the contents of the clipboard without the formatting as shown here:

Some HTML editors provide an option to paste text without formatting

That’s all for this time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: