[Date Prev][Date Next][Subject Prev][Subject Next][ Date Index][ Subject Index]

Re: Duplicates Pattern Search



Reply to note from "J. R. Fox"  Wed, 19 Dec 2001
13:38:18 -0800

> Based on a couple of trial runs, I find this approach awfully
> slow. (Contrast to the way SPELL works on a named but Unopen
> file: a Spell.Tmp file is generated before you know it.)
> Secondly, there are no brakes on this thing, ... And if I go
> down past the point where it stops, to initiate a new run
> (which you suggested), the routine returns to TOF and resumes
> ...
> Given my observations above, it feels like I would be better
> off with a generated list ala SPELL.TMP, and then removing the
> duplications manually.

Well, fair enough, you did ask for a program that generates a list.
It just seemed to me that a list is not much use if the object is to
pare duplicate URLs, because you still have to go back and find them
in the subject file, and SEarching for URLs is laborious. But it
would be easy enough to write a routine that compiles a list of
URLs, grouping duplicates together, and that also provides character
positions for each URL which could be JuMPed to in the subject file.
OK, I'll do that -- did it actually (see below).

Of course, the SPELL command is lightning fast, but it won't work
here because the dots, slashes, etc. in URLs will cause them to be
treated as multiple words. So bits and pieces of individual URLs
will end up all over the list. But try the frame URLS below. The
command, with the subject file in the current window, is
URLS. When the list is done, you can move the cursor to
any URL in the list and hit  to execute the XMACRO
statement on the CMline. That will JuMP you to that particular
instance of the URL in the subject file. You can then delete the
URL, edit it (or the surrounding text), or do nothing at all -- your
choice.

I've also rewritten the original frame, FDU, to give it some
"brakes"; to allow (via a cursor-position toggle) a visual
comparison between the original occurrence of the URL and the
duplicate; and to permit the program, at the user's request, to
ignore a duplicate URL and just continue scanning the file. This
probably still won't be your cup of tea, but I think it's better
than before.

> I am also wondering about near-matches and variations. Most --
> but not all -- of these are apt to occur in the URL _after_
> ".com/" ".net/" or whatever.

The new URLS frame addresses that issue to some extent. But we can
talk about that. Let's get the basic procedure down first. Here
are the routines; issue DECODE to decode them.

XPLeNCODE v2.0
b-gin [UNTITLED]
{{;5URLs}} List all URLs (duplicates together) [CLD 12/19/01]
[cr|lf]{2}[XH_]{<}IF{<}VA$WS{>}<>1{>}{<}PRNo file{>}{<}EX{>}{
<}EI{>}{<}SX01,{<}VA$WA{>}{>}{<}IF{<}PV01{>}<1{>}{<}PRNo wind
ow{>}{<}EX{>}{<}EI{>}[BX_]es 1[Q2_][TF_]{<}SV02,!{>}{<}SU03,{
<}SX04,{<}VA$DS{>}{>}{<}SX04,{<}IS04{>}+"{tab}"{>}{<}SU05,[AS
_][255+48+68][AS_][JM_]2.ReJuMP[Q2_]{>}[AS_]{<}GT04{>}[CP_][2
55+48+68][AS_]{>};*;[cr|lf]{<}LBa{>}[YD_][BX_][Q2_][JM_]2.Fin
dNextURL[Q2_][DO_][DX_];*;[cr|lf]{<}IF{<}VA$ER{>}==10{>}[TF_]
[YD_]{<}IF{<}VA|02{>}<1{>}[AS_][BC_]xmacro DXNPPPYDDFBX(se/f{32}
 {tab})DF[sv01],YDASBX(jmp [pv01])[TF_][LD_][LD_][LD_]{<}PRPu
t cursor on URL and hit  to go to URL{>}{<}EX{>}{<}E
I{>}{<}PRNo URLs{>}{<}EX{>}{<}EI{>}{<}IF{<}VA|02{>}>0{>}{<}SX
02,"{<}SZ12PT{>}{<}UFSTANDARD{>}{<}OF{>}{<}TP{>}{<}BT{>}{<}FD
256LI{>}{<}PL256LI,256LI,256LI{>}{<}HY0{>}{<}NJ{>}{<}RT0{>}{<
}LM{>}{<}PW8IN{>}{<}IP0,12DI{>}{<}TS11DI{>}URLs in "+{<}VA$FP
{>}+"[cr|lf]CharPos{tab}URL[cr|lf]======={tab}===[cr|lf]"{>}[
BX_]func #{<}PV01{>}[Q2_][BX_]ne/{<}IF{<}VA$VE{>}<"V4.1"{>}9{
<}GLb{>}{<}EI{>}25{<}LBb{>}[Q2_]{<}GT02{>}[AS_]{<}SV02,{>}{<}
EI{>};*;[cr|lf]{<}SX50,{<}CP{>}{>}{<}SX06,{<}VA$DS{>}{>}{<}IF
{<}VA|02{>}<1{>}[AS_][TF_]{<}SV01,{>}[BX_]se [wC]{<}PV06{>}{
tab}[Q2_]{<}IF{<}ER{>}{>}{<}SV01,!{>}{<}EI{>}[BF_][AS_]{<}IF{
<}VA|01{>}<1{>}{<}GLa{>}{<}EI{>}{<}EI{>}{<}SV01{>}{<}SX06,"Wo
rking on "+{<}IS01{>}{>}{<}PR@06{>}{<}GT03{>}{<}IF{<}VA|01{>}
<81{>}{<}LBc{>}[BX_]se/f [999]{<}PV01{>}[999][Q2_]{<}IF{<}ER{
>}{>}{<}GT05{>}{<}GLa{>}{<}EI{>}[JM_]2.FindNextURL[Q2_]{<}GT0
3{>}{<}GLc{>}{<}EI{>}{<}LBd{>}[JM_]2.FindNextURL[Q2_]{<}IF{<}
VA$ER{>}==10{>}{<}GT05{>}{<}GLa{>}{<}EI{>}{<}SV06{>}{<}IF@upr
({<}IS01{>})==@upr({<}IS
06{>}){>}{<}GT03{>}{<}EI{>}{<}GLd{>}{2}[cr|lf][cr|lf]{{;5fdu}
} Find Duplicate URLs [CLD 12/18/01][cr|lf]{2};*;   If dupl
icate is found:[cr|lf];*;    Press "1" to toggle between{32}
original URL and duplicate[cr|lf];*;    Press "2" to cont
inue scanning file[cr|lf];*;    Press Escape to quit[cr|l
f];*;   FDU does not delete duplicate URLs[cr|lf];*;   {32}
Delete manually and run FDU again[cr|lf];*;[cr|lf][XH_]{<}IF{
<}VA$WS{>}<>1{>}{<}PRNo file{>}{<}EX{>}{<}EI{>}[BX_]es 1[Q2_]
[TF_];*;[cr|lf]{<}LBa{>}{<}SX01,0{>}[YD_][BX_][Q2_][JM_]2.Fin
dNextURL[Q2_];*;[cr|lf]{<}IF{<}VA$ER{>}==10{>}[TF_][YD_]{<}PR
Done{>}{<}EX{>}{<}EI{>}[DO_][DX_]{<}SX50,{<}CP{>}{>}{<}SV02{>
}[YD_];*;[cr|lf]{<}LBb{>}{<}IF@siz({<}IS01{>})<2!({<}IS01{>}+
"A"){240}"0A"{>}{<}SX03,"Searching for duplicate - Testing "+
{<}IS01{>}{>}{<}PR@03{>}{<}EI{>};*;[cr|lf][JM_]2.FindNextURL[
Q2_]{<}IF{<}VA$ER{>}==10{>}[JM_]2.ReJuMP[Q2_]{<}GLa{>}{<}EI{>
}{<}SX01,{<}PV01{>}+1{>}{<}SV03{>}{<}IF@upr({<}IS03{>})==@upr
({<}IS02{>}){>}{<}SV02,dupe{>}{<}SV03,{>};*;[cr|lf]{<}LBc{>}{
<}IF{<}IS02{>}{240}"dupe"{>}{<}SV03,Duplicate{>}{<}SV02,orig{
>}{<}GLd{>}{<}EI{>}{<}SV03,Original{>}{<}SV02,dupe{>}{<}LBd{>
}{<}SX04,{<}CP{>}{>}{<}SX04,{<}IS03{>}+" URL (pos="+{<}IS04{>
}+"): 1=View "+{<}IS02{>}+" 2=Continue [Esc quits]"{>}{<}LBe{
>}{<}PR|@04{>}[DO_][DX_]{<}SX04,{<}RK{>}{238}"12"{>}{<}IF{<}V
A$KC{>}<2{>}[DE_]{<}PRDelete duplicate URL & run FDU again{>}
{<}EX{>}{<}EI{>}{<}IF{<}PV04{>}<0{>}{<}GLe{>}{<}EI{>}{<}IF{<}
PV04{>}<1{>}{<}IF{<}VA$IN{>}>0{>}[JM_]2.ReJuMP[Q2_]{<}GLc{>}{
<}EI{>}[DE_]{<}GLc{>}{<}EI{>}[DE_][DO_][DX_]{<}EI{>}{<}GLb{>}
{2}[cr|lf][cr|lf]
-nd
XPLeNCODE

--
Carl Distefano
cld@xxxxxxxx
http://users.datarealm.com/xywwweb/