[Date Prev][Date Next][Subject Prev][Subject Next][ Date Index][ Subject Index]

Re: Possible activity heading towards XyWrite replacement



Harry Binswanger wrote:
>Wally Bass wrote
>>(The functionality of being able to quickly insert or delete
>>bytes in the *middle* of what is essentially a million byte
>>string (for a 1MB file), and provide the *appearance* of
>>continued continuity of the resulting "string" for search and
>>other purposes, but doing so without having to move 500000
>>bytes each insert/delete to "make room" (for insert) or "close
>>ranks" (for delete), is a pretty significant piece of
>>functionality.)

>How is it done?

Well, of course, that's the $64000 question.

To my knowledge, there have been two main strategies for an editor to
buffer file data in memory while editing it. One is the "link list"
strategy, and the other is the "buffer-gap" strategy.

Linked list buffering was vary popular in the days of mainframes, where
files very often consisted fixed length 80 byte "lines" (as per punch
cards). With link list buffering, the file is split in to "lines" when it
is read. With "fixed length file formats" on mainframes, that simply meant
taking 80 bytes at a time from the file, and for variable length file
formats on mainframes, each line started with a two or four byte prefix
word which gives the length of the line in binary (and which also allowed
you to find the start of the "next" line, to find the next length field).
(In neither case were there actually any CR or LF characters in the file
to indicate line endings, so if one were to use a "buffer gap" strategy on
the mainframe, CRs and/or LFs would have to be added to the data as it was
read in in that environment -- this, of course, made linked list buffering
much more of a "natural" choice on mainframes.)

With Unix/DOS files, though, separating the file into "lines" means
scanning the data for linefeed chars or CR/LF pairs -- the CR/LFs are then
discarded (e.g., not included in each line buffer) but the CR/LFs are
reinserted between lines when the file is written.

With linked list buffering, a separate buffer is allocated for every line,
and the buffers are linked together in memory in a "doubly linked list,"
with forward and backward "pointers" to the buffers for the next and
previous lines. Moving a line to a different place in a file under such a
scheme is merely a matter of reconnecting pointers for the buffer and for
the adjacent buffers. If a line gets a change that makes it longer than
it's buffer, one allocates a longer buffer, copyies the new longer line
into it, "hooks it" into the chain in place of the old buffer, and frees
the old buffer. But no single change, including an insertion, involves the
movement of very much data.

IBM's XEDIT mainframe editor is an example of such an editor. The PC
editor KEDIT (Mansfield software, www.kedit.com) is largely a clone of
XEDIT, and is a linked list editor also. So is the open source XEDIT clone
called "THE". Visual Slick Edit (an editor that usually scores very high
on "programmer's editors" comparisons) is also a linked list type editor.

XyWrite, epsilon, and emacs probably use the "buffer gap" approach. That
approach says you use a large buffer for the entire file, with line
boundaries (paragraph boundaries in XyWrite's case) just being represented
by CR/LF pairs (or linefeeds alone) just as they were in the original
file. But you place a "gap" of extra, unused, bytes somewhere in the
string representing the file -- maybe 1000 bytes (or perhaps 100000, with
the large amounts of memory available in today's machines) or so in
length. Normally, you put that gap is at the point in the file where the
cursor is at the moment. Of course, you always keep track of exactly where
the gap starts and ends. If the cursor moves one byte forward in the
buffer, you take the first byte after the gap and move it end of the data
proceeding the gap, so that the gap moves with the cursor. But you only
have to move one byte of data to move the gap forward one byte.

This strategy means you will always have some space (the gap) to insert
data at the point where the cursor is. Or to delete. Or change, even if
the change involves a net insertion or net deletion of characters. On
insert, the gap shrinks, on deletion, the gap grows. But you don't worry
about that until the gap has shrunk to nearly zero (lots of insertions) or
grown to, say, 2000 instead of 1000 (lots of deletions), at which time you
probably do the big 500000 byte (or whatever, depending on file size) size
move as required to make the gap bigger or smaller.

The "gap," of course, is not internally transparent to, say, routine which
searches the buffer, and the low level search code becomes more complex
for having to dealing with the buffer gap. Same for the code that displays
the buffer on-screen. But, it is possible to specify interfaces by which
most "higher level" routines which access a buffer can access and add or
remove data from buffers, using calls whereby the existence of any such
gap is completely hidden by the way the interface is specified and
implemented. This then means that the higher level code using the
functionality of the buffers can be simpler, and since there is a lot of
such code, that makes the overall program simpler.

You can look at Epsilon's abstractions as to how to interface with
buffers, as the Epsilon "virtual machine" provides them to higher level
EEL code. It's in the epsilon manual page at

www.lugaru.com/man/Buffer.Primitives.html

Take a look. It just takes a minute to get the idea.

But I'm sure there are a thousand variations on the buffer gap
implementation scheme. For example, I could have a rule that the gap is
always placed at the end of the current paragraph -- in this case, I might
have to move a paragraph's worth of data to do an insert, but I don't have
to move 500000 bytes.

Of course, none of this matters much at all if you're simply entering data
at the keyboard -- moving 500000 bytes per keystroke to insert typed
input, when necessary, is probably no problem on today's machines. But
when you do change commands that touch every line in the file, and change
each line's length (as was the case with the XY IV "wildcard search and
replace" example that you gave a post or two ago), it starts making an
enormous difference. I don't know about the current Windows Notepad
program, but the NT 4 Notepad program was an example of a program using a
large, linear buffer, but with no gap. If you do a changeall in that
version of Notepad that changes the length of every line in say, a 5MB
file, you might as well go and fix dinner and watch a movie while you wait
for that single change request to complete, because it will take about
that long. Really. Twenty minutes to execute a single "change all" command
was common with that version of Notepad.

When an editor is a "link list" editor, by the way, that fact is usually
reflected in several ways at the user interface. Most such editors, for
example, have mechanisms to elide groups of lines in a file from the
display, and, say, display only those lines in a file which contain the
string "Mrs. Murphy" (or whatever). That that kind of functionality can be
*very* handy. (In Kedit, for example, the "all" command does such a
selective display based on a search string, and the "more" and "less"
commands can subsequently be used to add or subtract lines from the
current set, also based on search arguments, or on notions like "also
include the next two lines following each currently displayed line." This
gives you a setup where you can quickly scroll through all (but just) the
lines of a file that are relevant to some issue that concerns you at the
moment.)

There are some folks who will tell you that the "all" functionality alone
is worth more than whatever it is that a program like XyWrite might do,
and they won't consider an editor that doesn't have "all"-like
functionality. All of this happens in link-list editors because it is so
much easier to "skip over lines" in a linked list structure (which merely
involves following links) than it is to skip forward over lines in a
linear buffer (which involves searching). And also because the existence
of separate buffers and their related control blocks for each line gives
one a handy place to store flags as to which lines are to currently to be
displayed and which are to be skipped. I keep a copy of Kedit around
mostly because of the "all" functionality, but I only use it for short
periods from time to time, when I need that specific functionality.

Also, another big functional difference between "link list" editors and
other ones using linear buffers with a buffer gap is that, in the latter,
CR/LF (or LF) are just character(s), which can be used in search strings,
deleted to combine lines, added to split lines, and so on, using
search/replace or other mechanisms. You usually can't do that so easily in
editors which use linked list buffers. In fact, the first thing that I
usually do in looking at an editor is to look for that (CR/LF editablity)
functionality, as an indication as to whether the editor uses link lists
vs. a linear buffer. I find the capability to edit CR/LF's to be very
useful, and wouldn't consider (for my primary editor) one where I can't do
that.

Anyway, all of this is actually a topic which interests me a fair amount.
If anyone knows of any other buffering strategies being used in editors,
I'd like to hear what they are (and a URL would also be great).

Wally Bass