Revamping the List-O-Books, Part Two


As noted in a previous post, I’ve revamped my List-O-Books™ pages to include IndieBound affiliate links, instead of unaffiliated links to Amazon.  Creating these list pages became more difficult via  Amazon, IndieBound has a good affiliate program, and philosophically I prefer to support independent book stores.  That was my rational behind changing the List-o-Books pages.  The remainder of this post deals with how I implemented those changes.

I normally don’t write technical articles; the depth of my technical knowledge is surpassed many times over if you scan the web looking for tech articles.  But, the process of revamping these pages may be useful to others who want to switch to IndieBound links on their blogs, and it is a good, practical introduction to regular expressions, an handy tool in a blogger’s toolbox.  If this doesn’t interest you, please, check out the rest of my site, skipping what will otherwise a very boring post.

Note:  Writing this article was a tedious exercise, due mostly to my inexperience using the somewhat limited code editing capabilities through WordPress.com.  If you happen to use the regular expression I annotate herein and find an error, please let me know.  I’d appreciate the opportunity to hear your feedback.

IndieBound has a basic link generator that gives you the option to enter link text, include a book’s cover image, or have an alternate generic image urging readers to “Shop IndieBound.”  Reworking these links is time consuming and tedious, but necessary if you want the links to fit stylistically with the rest of your blog site.   I wanted to include the cover image, floating to the right of the book title, with the author below, and both the cover image and the title needed to link to IndieBound.  The HTML code for this is more complicated than the standard code provided by IndieBound.

A Basic IndieBound Link

Most of the elements I wanted are included in these generated links: a snappy cover image, a link to the book, and, if you include it as the link text, the title of the book. The web form for creating a link allows you to enter link text of your choice, so you have to type or cut-and-paste the title into the form to override the default text of “Shop IndieBound.”  In the code below, I’ve create a link for the classic science fiction novel Dune by Frank Herbert. I’ve formatted this across multiple lines for readability.

<a href="http://www.indiebound.org/book/9780441172719?aff=cvanhasselt">
<img  style="border: 1px solid #000"
src="http://images.booksense.com/images/books/719/172/FC9780441172719.JPG"
onerror="this.src = 'http://www.indiebound.org/files/book_not_found.jpg';" />Dune
</a>

The above code will result in a link that looks like this:


Dune

Note: I have removed the Javascript onerror attribute from the above rendering; WordPress would strip it out if I hadn’t. But y’all get the idea, right?

IndieBound’s servers seem to respond very quickly; your typing and internet speed will be the critical factors in how long it takes to create links for your site.  Creating each link and pasting into a text editor takes about 15-20 seconds.  At that rate, with over 250 links to change it took about 1.5 to 2 hours to create all the links, though not all at one setting.

I pasted these links into Notepad++, my editor of choice for working with text and HTML code, a free  editor based on the open-source Scintilla project.  There are more sophisticated editors, but for the cutting and pasting work I was doing, Notepad++ worked well, and did I mention it is free?  But if you prefer using another editor, that is ok.  There are many text editors that support regular expression search and replace, which is real meat-and-potatoes of what this article is about.  The regular expression explicated herein has been tested in Notepad++, but should work in other text editors that support regular expressions, with perhaps some minor tweaking.

What I need…

There are a number of things I needed to change from the above generated link:

  1. The Javascript onerror attribute needed to be removed; Javascript is not allowed on WordPress.com
  2. The style attribute needed to go, as I would use more flexible styling from this blog’s custom CSS styles.
  3. The book title needed to be rendered in bold, as the inner HTML of a <strong> element
  4. After the title, a line-break needed to be added, followed by  the word “by”, after which I would type the author’s name(s).
  5. The entire link needed to be wrapped as the inner HTML of an <li> tag, as I wanted this to be an ordered list.

My goal was to accomplish all of these changes in one search and replace operation, using regular expressions.  By the way, the term “regex” is a common shorthand for regular expression, and I’ll be using that shorthand throughout this article.

Get Regular!

The regular expression used to break apart the various pieces of the link is below:

^(<[^>]+>)(<img\s+).*(src="[^"]+\"\s+)onerror="[^"]+"(\s+/>)<br />(.*)(</a>)

Break it Down!

Now I ‘ll break down the regular expression one step at a time, explaining what each part of the expression finds, or matches, within the generated link.  Keep in mind what the whole link structure is:

<a href="http://www.indiebound.org/book/9780441172719?aff=cvanhasselt">
<img  style="border: 1px solid #000"
src="http://images.booksense.com/images/books/719/172/FC9780441172719.JPG"
onerror="this.src = 'http://www.indiebound.org/files/book_not_found.jpg';" />Dune
</a>

Together, all the regex parts will crack apart the above link, separating all the useful pieces and throwing away the pieces I don’t need.  We’ll save the good bits, throw out the bad, recombine the good bits, and replace text based on all this work.  The end result will be a new link formatted to meet my specific needs.

RegEx Part One:

^(<[^>]+>)
  1. This section of the regex will look for a text string starting from the beginning of a line, indicated by the “^” caret character.
  2. The open parentheses tells the regular expression parser to continue searching, and whatever string is found that matches the expression contained within the parentheses should be saved or captured  in memory for later use.
  3. The left angle bracket (less-than sign) is a character literal.  The regex parser will try to match the beginning of the line followed by an angle bracket.
  4. The square brackets represent a character class, meaning a whole set of defined characters.  Within a the character class definition, bounded by square brackets, the caret character means Boolean not.  Therefore, this character class definition represents all characters not equal to the right angle bracket.
  5. The plus sign following the close of the square bracket, followed  by the right angle bracket  directs the regex engine to find one or more characters from the preceding character class, followed by the literal text character “>”.
  6. The close parentheses tells the regular expression engine to take everything found starting with the first character matched after the open parentheses, and save it to memory for later use.  Everything from the beginning of the line to the first closing angle bracket will be matched.

Whew!  That’s a lot of explanation for one little regex, right?  Did I mention regular expressions were terse? At this point, the regex engine will set up one memory slot, named \1, that will store the result of its search efforts thus far.  If you consider the original link text from above, that means \1 will contain

<a href="http://www.indiebound.org/book/9780441172719?aff=cvanhasselt">

This is the full opening link tag for the link, including the href attribute.   We’ll definitely need to keep this.  By the way, “cvanhasselt” is my affiliate account link.  Feel free to use it if you want, but the affiliate proceeds will go to me instead of you!

RegEx Part Two:

(<img\s+).*
  1. The opening parentheses tells the regex engine to remember everything it finds matching the expression within parentheses.  This section of the string will be stored as \2.
  2. The regex engine will look for a string starting with an open angle bracket, followed by the literal text “img”, the start of an HTML <img> tag.
  3. \s+ directs the regex engine to then find one or more spaces following the img tag.
  4. The clothes parentheses indicates the boundary of the matched expression that will be stored in \2.
  5. “.*” is regex shorthand for any character (the dot), zero or more times (the asterisk).   In this case, looking back at the original string, following the img tag is a style tag, defining the border of the cover image.  I don’t want to keep that style, so there is no reason to keep the matched characters for later use.  Essentially, by not enclosing the “.*” in parentheses, I am telling the regex engine to scan over this text, and then throw it away.

At this point, \1 has the opening <a> tag, \2 has the opening part of the <img> tag.

RegEx Part Three:

(src="[^"]+\"\s+)
  1. The opening parentheses says to the regex engine, “Hey, save this next little bit, up until the close parentheses.”  The matching text will be stored as \3.
  2. src=” directs the regex engine to find the literal text “src=” followed by a quotation mark.
  3. [^”] is a character class that will include every character that is not a quotation mark.
  4. Following the character class by a plus sign directs the regex engine to find one or more characters from the previous match, in this case anything that is not a quote symbol.
  5. \” is an escaped character.  Because a quotation mark can have a special meaning in regular expressions, it is best to escape to the quote sign here.  The slash mark tells the regex engine to consider the next character to be literal text, ignoring any special meaning the character may have to the regex engine.  For example, we’ve seen that a plus sign has a special meaning for the regex engine, so how might you search for a plus sign?   By using an escape!  You would search for \+.  At this point, the regex expression is trying to match the literal text “src=”, followed by a quotation mark, a bunch of letters that are not the quotation mark, and finally a quotation mark.  In other words, the entire quoted src attribute.
  6. \s is regex shorthand for a space character.  Following it by a plus directs the regex engine to match one or more spaces following the src attribute.
  7. Finally, the closing parentheses ends the regex pattern that will be searched for and stored in \3.

So, \3 should match the src attribute of the <img> tag, including the quoted  URL string for the  attribute.  This is a great example of what makes regular expressions powerful.  We know every src attribute is going to have a unique URL, so we can’t do a normal find operation.   But by matching a pattern, namely a pattern of “looks like a src attribute” followed by a URL string,  we can find all the src attributes.

RegEx Part Four:

onerror="[^"]+\"
  1. This section of the search expression will search for the literal text “onerror”, followed by a quote symbol.
  2. The character class  [^”] should be familiar now; it is any characters that is not a quote symbol.
  3. The plus sign following the character class tells the regex engine to search for one or more characters from the previous character class.
  4. Finally, a match for this part of the expression should end in a quote, which I’ve again escaped.

Since no part of this section is enclosed in parentheses, none of this section will be saved.  The character class is used here just because I didn’t want to type out the Javascript code for the onerror attribute.  Remember that WordPress.com does not allow inclusion of active Javascript code within user content areas, like post text, widget containers, or in pages.  For that reason, I’m just telling the regex engine to skip over the onerror attribute as it won’t be saved for later.

RegEx Part Five:

(\s+/>)<br />(.*)(</a>)

This final section of the regular expression closes out the end of the IndieBound link.  Essentially, this will complete the capture of everything we need to move forward with the replacement of text.

  1. (\s+/>) captures one or more spaces prior to the HTML closing slash-bracket, storing this value in \4.
  2. The HTML line-break is not enclosed in parentheses, and thus skipped over – matched but not saved.
  3. (.*) captures anything and everything after the HTML line-break, stopping at the close of the <a> tag.  It will be stored as \5.  If you’ve added
  4. The close of the <a> tag is captured and stored as \6.  This isn’t entirely necessary; you could skip over this, as long as you used the closing <a> tag as a stop to the preceding “catch-all” expression.

And that does it!  A long explanation for a rather short regular expression.  Now, on to how to replace things.

Shake, Stir, and Replace:

Believe it or not, we’ve reached the easy part of all this mess: putting back together what we have torn asunder.  We have arranged the regular expression to save all the relevant pieces of the IndieBound link temporarily, and now we need to rearrange them.

  1. The new link should be contained in an HTML list-item, wrapped with an <li> tag.
  2. The link text should be in bold.
  3. We want to include the word “by” after the link, so we can easily paste in author names.

All this can easily be done.  The relevant replacement string is:

<li>\1\2\3\4<strong>\5</strong>\6<br />\nby \n</li>

The “\n” section of the replacement string forces a line break into our replacement. This makes the resulting HTML a little easier to read, but isn’t entirely necessary. Watching Notepad++ shift everything around, well that is a satisfying moment.  With a few short statements, you can completely reformat 60, 100, or a 1000 IndieBound links.  The rest of the look I’ve achieved on on the List-O-Books pages is through careful use of CSS.  Note: You must have the Custom Design upgrade to handle custom CSS styles for your site.

And in conclusion…

Please contact me if this article has sparked your interest in regular expressions.  I hope it has been instructive.  There are a number of good books that can help you if you are interested in learning more about regular expressions.  By far the best, at least in my opinion, is Jeffrey E.F. Friedl’s Mastering Regular Expressions.  You’ll also find some great tutorials, and links to great tools, at RegularExpressions.info .  One tool, available through this site, is RegExBuddy, a simple tool that can help you develop complicated expressions for almost any situation.

Advertisements

About Chris van Hasselt

I eat, sleep, play guitar...but wait, there's more!
This entry was posted in Books & Reading, Software and tagged , , . Bookmark the permalink.