-->

Sunday, August 08, 2010

Retrieve and Edit a Web Page in Emacs

While browsing the web, I often find a web page I want to save locally. As often as not, I'll want to do some editing on it. Of course Firefox and other browsers allow me to save a web page as a local file. And, yes, I could then pull the local file into emacs (what I use most often for editing web pages) and do whatever editing I wish. But I go through this routine often enough that I'd like a faster and more streamlined method. This is a computer, after all. And using open source software, I don't have to settle for just those capabilities that are built in to the software. If there's some functionality that I need and I can write the code for it, open source allows me to add functionality to what's already provided out of the box. Of course, once you start programming, it's just too tempting to add on another little feature... and then another... and then— why not?— another one.

One of two such add-ons had to do with text files created by an application on a Microsoft™ system. Such files commonly contain a Control-M character at the end of each line. On Linux and UNIX systems this appears as a ^M. It's part of the carriage return and linefeed pair of characters used in Microsoft™ files since the first days of DOS™. UNIX and Linux systems use just a ^J instead, so all the ^M characters at the end of every line are without purpose and just a nuisance. It's always a couple extra steps to remove all these characters, so I decided to insert a little extra bit of code to wipe out all of them automatically as soon as the file is loaded into emacs.

Another feature I wanted was the inclusion of citation. That is, when I download a web page, I want to save the URL where the web page came from— this in the event I want to cite the source of the web page (as is done in academic work) or if, later, I want to go back to the same website again to see what else they might have. So the code below automatically inserts the web address as well.

So, to review in brief, here's what this code does:

  1. Fetches a web page. You're prompted for the web address in the minibuffer.
  2. Strips out the ^M characters. If you're using Windows™, you might want to remove, or just comment out, the two lines of code for that.
  3. Inserts a paragraph containing the URL of the source web page. This paragraph is inserted at the top of the web page. You can change the location by replacing my re-search-forward search string with something which works for you. In a Linux or UNIX shell, do info elisp "Regular Expressions" for documentation.
;url-fetch-web-page ;From within emacs, download & locally edit a remote web page. ;Copyright (c) 2010, Kenneth Fisler, Cleveland, Ohio, USA. ;This program is free software; you can redistribute it and/or ;modify it under the terms of the GNU General Public License :as published by the Free Software Foundation; either version 2 :of the License, or (at your option) any later version. ; ;This program is distributed in the hope that it will be useful, ;but WITHOUT ANY WARRANTY; without even the implied warranty of ;MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;GNU General Public License for more details. ; ;You should have received a copy of the GNU General Public ;License along with this program; if not, write to the Free ;Software Foundation, Inc., 59 Temple Place, Suite 330, ;Boston, MA 02111-1307 USA ;Using the url.el library, so probably need to load it. (require 'url) (defun url-fetch-web-page (url) "Retrieve minibuffer-specified web page, load into a new buffer, then call another function for editing." (interactive "sLoad URL: ") (with-temp-buffer (url-retrieve url 'edit-web-page (list url 'status)))) (defun edit-web-page (&optional redirect url status) "Switch to the buffer returned by `url-retrieve'. Automatically strip out all C-m characters and then insert after html body tag the buffer's URL, appropriately html-tagged." (switch-to-buffer (current-buffer)) ;; remove all instances of ^M (found in MS-created files). (goto-char 0) (perform-replace " " "" nil nil nil t nil nil nil) (let ((case-fold-search t)) ;case-insensitive search (goto-char 0) ;go to top of buffer (re-search-forward "<[\t\n ]*BODY[^>]*>" nil t) ) ;insert URL into page (insert "\n\n<p>From: <a href=\"" url "\">" url "</a>\n </p>\n\n"))

In the first function, interactive, prompts for the URL of the web page you want. The url-retrieve function then fetches that web page, puts it in a buffer created by the with-temp-buffer function, and calls the second function, edit-web-page, function for further processing.

The edit-web-page function, given the text of the web page from the first function, invokes switch-to-buffer to display it, then places the mark at the top of the buffer with goto-char 0. Then re-search-forward searches for any ^M (Control-M) characters and deletes them. (More precisely, each ^M is replaced by a one-character vacuum.) This function also gives a count of those "replacements" in the minibuffer.

The last four lines of code set the variable case-fold-search for a case-insensitive search, put the mark to the top of the buffer (again), then find the HTML <body> tag. Finding this tag places the mark at the end of that tag, which is where the HTML code displaying the URL is inserted. (Many thanks to Denny on the gnu-emacs-help mailing list for coming up with the search string for locating the HTML body tag with all its imaginable variations and combinations.)

Note that the code above does not save the buffer. You have to do that yourself manually. Or you can add code to do it if you want. I left this out by design so I could examine, and even edit the buffer first and then decide where I wanted to save it. Or, if I decided I didn't want to save it at all, I could just kill it. Again, as this code is open source, you're free to add code to automatically save the fetched buffer to a file.

There are a lot of other ways this code could be altered or rewritten for other purposes. For example, when a web page is fetched, along with the actual HTML code, some output from the web server shows up at the top of the buffer. I could have written code to automatically delete these several lines of text. But I find that information interesting, so I left it in. You may prefer not to ever see it. Or, this might be precisely the information you happen to be looking for, the HTML being of no interest. A little hacking of the above could accomplish either purpose.

Many other purposes are possible, especially if we consider that the specified URL can use internet protocols other than http. What's nice about the modular structure of the above code is that any function which works on an emacs buffer can, with little modification, be executed on any emission from any accessible URL. In effect, this code is a tool loosely coupled with a working example. So it's very much an invitation too.

No comments: