[Date Prev][Date Next][Subject Prev][Subject Next][ Date Index][ Subject Index]

Automated Clean-up of Ragged Text ?



Hi Carl,

The jumbo .U2 archive is great, but it's also kind of like a
Paul Bunyon sized Swiss Army Knife. There's a considerable
amount of stuff in it that will be esoteric to many, and such
an array of stuff overall that it probably has many things in
it that I would use if I knew they were there. All well and
fine, though, with no good reason to remove any of it. I know
there is an index, but I seldom take time out to browse through
it, not having much idea just what I'm looking for, if indeed I
even have something specific in mind.

For now (despite printing out the 26 pages of material that
comprise the XyWWWeb site), it can be hard to get a handle on it
all, and I've only discovered about 4 % of what the .U2 program
archive has to offer. A few things like DELTAGS and the Circular

Search (which was discussed on the List) I am using with some
regularity.

So, from time to time, please excuse me if I inquire about some-
thing I did not know the archive could do.

Now I don't suppose there happens to be any smart routine in
there that parses a document or a block of text you have come
across which is in very raggedy condition formatting-wise, cleans

it up, and re-formats it ? Let me get more specific. I'm
talking
about ASCII text, not foreign codes from some "inferior" word-
processor. The ReadMe files from some shareware used to be
notorious for this sort of thing. The non-HTML version of the
Compuserve newsletter still often comes to me this way. (Fortu-
nately, that one is almost never worth saving !) When you've
stripped all the HTML from the text of web-pages that have
links and _other stuff_ in columns on one or both sides, you
can get this. And a lot of other junk to remove, but that's
another story. Or, consider the sample I'm appending below
(actually *far* less bad than most of the examples I had in
mind), from a reference page of a website that did _not_ have
any of that stuff located adjacent to the text.

By ragged formatting, I mean that there are frequent, random gaps

between words of from 5 to 20 spaces, the lines are not of any
approxi-
mately consistent length, and carriage returns are applied quite
haphazardly. You wonder how anything could have been written,
posted, or distributed that way, but it was. (Or maybe it all
magically looks right if you loaded it into OUTLOOK, I dunno.)

Another example might be text going between different mail
clients, with incorrect or incompatible word-wrap settings, so
you get line lengths like:

xxxxxxxxxxxxxxxxxx
yyyy
zzzzzzzzzzz

Many years ago, I made a series of very rudimentary macros. One
removed all carriage returns from a document, except for double
CRs, which were assumed to be paragraph demarcators. Another one

successively took a 20 space gap down to 5, 19 down to 5, and so
on, then the 5s or fewer were reduced to single spaces, but two
spaces would be left after a sentence-ending period. No XPL was
employed (because I couldn't hack it back then, and very likely
couldn't do much better today), so this was crude, and of course
there were plenty of exceptions it would not adequately serve.

I'm sure it is very possible to apply a bunch of "If X exists,
then Do Y" tests / corrections within a single, comprehensive
cleanup routine. The first problem might be in arriving at a
definition of what the "standard" formatting should be. Then
it's a matter of taking into account most of the common format-
ting anomalies one encounters, and some common exceptions to
them, such as abbreviations not being mistaken for
end-of-sentence,
colons followed by deliberate blank lines, tables or indented or
outlined text, among others.

Actually, that last part may be too ambitious. Some things just
need plenty of reformatting, if it is important enough and one
intends to make further use of it. But I am still interested in
significantly cutting down the amount of processing involved,
particularly on the routine and repetitive problems. Hopefully
something smarter, better, and more versatile than those ancient
macros.


Jordan Fox

-----------------------------------------------------------------------------------------------------------------------------------------------

[Brief Sample Excerpt]

  WARP already installed:
  You still need to do the config.sys modification, however, I
have never done this            *after* WARP is
installed, so I don't know if that is enough to get WARP
to            access the sound
card/CD.           2. A: A number of suggestions were
made   to change settings for the 2940UW card. None
            of the   changes made any
difference. I just tried again, only this   time I inserted
            media into  both the ZIP and  JAZ
drives. The CDROM is   now properly recognized.

------------------------------------------------------------------------------------------------------------------------------------------------

{If you have no right margin set, you should be seeing the above
block
as it displays to me in XY. Your browser or mail program may do
otherwise.}