[Date Prev][Date Next][Subject Prev][Subject Next][ Date Index][ Subject Index]

Re: Cleaning up html





Good idea, Richard.
I tried it and it does not work, at least on a Detroit News file. I don't have the expertise to figure out how they do that, but they really make it a pain to get text.

I'm sending you privately the file that did not work.

Jay



At 10:23 PM 6/4/02 -0500, you wrote:
Wouldn't it be easier to bring up a browser in local mode and access the
HTML file, then save as in test (txt) format. This should give ONLY the
text of the contents and none of the HTML tags.

Jay McNally wrote:
>
> Can anyone offer me some advice for this problem?
>
> I often need to take text from a web document that has been saved in html.
>
> My somewhat tedious but simple process for some years is to simply loop an
> xpl routine that defines then deletes everything from the first "less
> than" bracket to the next "greater than" bracket. I then manually clean up
> the rest of the junk. It works.
>
> Can I run a CHange command with wildcards that would erase the whole string
> between the brackets, such as the following junk?
>
> 
> 
> 
> 
> 
> 
> 
>
> I'm thinking it would be nice to have one command that would clean out
> everything between the brackets, then another command deleting the
> brackets. I tinkered with it briefly yesterday and got nowhere.
>
> Is there a simpler way around this problem?
>
> Thanks
>
> Jay