While browsing the web, I often find a web page I want to save locally. As often as not, I'll want to do some editing on it. Of course Firefox and other browsers allow me to save a web page as a local file. And, yes, I could then pull the local file into emacs (what I use most often for editing web pages) and do whatever editing I wish. But I go through this routine often enough that I'd like a faster and more streamlined method. This is a computer, after all. And using open source software, I don't have to settle for just those capabilities that are built in to the software. If there's some functionality that I need and I can write the code for it, open source allows me to add functionality to what's already provided out of the box. Of course, once you start programming, it's just too tempting to add on another little feature... and then another... and then— why not?— another one.
One of two such add-ons had to do with text files created by an
application on a Microsoft™ system. Such files commonly contain a
Control-M character at the end of each line. On Linux and UNIX systems
this appears as a ^M
. It's part of the carriage return and
linefeed pair of characters used in Microsoft™ files since the
first days of DOS™. UNIX and Linux systems use just a
^J
instead, so all the ^M
characters at the
end of every line are without purpose and just a nuisance. It's always
a couple extra steps to remove all these characters, so I decided to
insert a little extra bit of code to wipe out all of them automatically
as soon as the file is loaded into emacs.
Another feature I wanted was the inclusion of citation. That is,
when I download a web page, I want to save the
So, to review in brief, here's what this code does:
- Fetches a web page. You're prompted for the web address in the minibuffer.
- Strips out the
^M
characters. If you're using Windows™, you might want to remove, or just comment out, the two lines of code for that. - Inserts a paragraph containing the URL of the source web
page. This paragraph is inserted at the top of the web page. You
can change the location by replacing my
re-search-forward
search string with something which works for you. In a Linux or UNIX shell, doinfo elisp "Regular Expressions"
for documentation.
From: " url "\n
\n\n"))In the first function, interactive
, prompts for the URL
of the web page you want. The url-retrieve
function then
fetches that web page, puts it in a buffer created by the
with-temp-buffer
function, and calls the second function,
edit-web-page
, function for further processing.
The edit-web-page
function, given the text of the web
page from the first function, invokes switch-to-buffer
to
display it, then places the mark at the top of the buffer with
goto-char 0
. Then re-search-forward
searches
for any ^M
(Control-M) characters and deletes
them. (More precisely, each ^M
is replaced by a
one-character vacuum.) This function also gives a count of those
"replacements" in the minibuffer.
The last four lines of code set the variable
case-fold-search
for a case-insensitive search, put the
mark to the top of the buffer (again), then find the HTML
<body>
tag. Finding this tag places the mark at the
end of that tag, which is where the HTML code displaying the URL is
inserted. (Many thanks to Denny on the gnu-emacs-help mailing list for
coming up with the search string for locating the HTML body tag with all
its imaginable variations and combinations.)
Note that the code above does not save the buffer. You have to do that yourself manually. Or you can add code to do it if you want. I left this out by design so I could examine, and even edit the buffer first and then decide where I wanted to save it. Or, if I decided I didn't want to save it at all, I could just kill it. Again, as this code is open source, you're free to add code to automatically save the fetched buffer to a file.
There are a lot of other ways this code could be altered or rewritten for other purposes. For example, when a web page is fetched, along with the actual HTML code, some output from the web server shows up at the top of the buffer. I could have written code to automatically delete these several lines of text. But I find that information interesting, so I left it in. You may prefer not to ever see it. Or, this might be precisely the information you happen to be looking for, the HTML being of no interest. A little hacking of the above could accomplish either purpose.
Many other purposes are possible, especially if we consider that the
specified URL can use internet protocols other than http
.
What's nice about the modular structure of the above code is that any
function which works on an emacs buffer can, with little modification,
be executed on any emission from any accessible URL. In effect, this
code is a tool loosely coupled with a working example. So it's very
much an invitation too.
No comments:
Post a Comment