[Date Prev][Date Next][Subject Prev][Subject Next][ Date Index][ Subject Index]

Re: DOSEMU/DOSBOX question



At 04:49 PM 1/20/08 -0500, Paul Lagasse wrote:
As a test, I just created ...
Paul, thank you very much for conducting and documenting those experiments.
As it turns out, I was also discussing the DOSEMU question in another fora
(this time, brought up by folks interested in using WordPerfect 5.1 or 6.0
for DOS under DOSEMU), and that discussion got me to the point that I
downloaded the source for DOSEMU from sourceforge.net. I spent some time
with the source code. Your test case results were just the right ingredient
to, I think, bring the whole picture into some amount of focus for me, and
verify what I thought I was figuring out from the source.
For purposes of anyone who's interested, and also for purposes of
organizing my own thoughts, I will try to write down what I think I know.
Firstly, the topic is DOSEMU under Linux. It has been discussed before, but
mostly from the point of view of whether people could get XY (or whatever)
to run in that environment. I'm simply assuming that running XY will be
possible, and the total focus here is, given that one runs XY under DOSEMU,
exactly what files will one be able to access from XY.
On my primary data partition, I have about 25,000 files, about 9/10 of
which are ascii text, and when I migrate that partition to Linux, I would
like to be able to access any of those 25,000 files with XY, just as I do
now. I want to anticipate what problems I might have.

 So what I (think I) know is:
1. First, the bad news. The file access code in DOSEMU seems to be of very low development priority, and hasn't been updated noticeably since about 1997, when it was first written, largely in Russia. So, if there is a showstopper now, it may be there for a long time.
2. When trying to find stuff on the web about the 8.3 poblem, the word
"mangling" is very useful. The general process of generating an 8.3 name,
that is to be usable by DOS programs in lieu of a "long" filename, is
referred to by the Linux community as "filename mangling." (Seems colorful
enough.) Filename mangling is actually better understood in the Linux
community, or more specifically, the Samba community (Samba is the name for
the open source community's MS compatible file server code), than I would
have thought, because Samba has been serving up files to MSDOS in 8.3 mode
for a long time now. DOSEMU and Samba were, I believe, very similar in
their filename mangling in the beginning, but Samba is a *much* bigger deal
to the open source crowd than is DOSEMU and has evolved quite a bit. DOSEMU
has very much fallen behind.
3. Microsoft systems do filename mangling also (don't know that they use
that name, though), and have been doing it since they started doing long
filenames. They actually do better than Linux, because supporting DOS
access was a requirement for them since day 1. In particular, in Windows,
the NTFS and VFAT filesystems understand mangled names, and retain in each
filesystem subdirectory entry, both the long and the corresponding mangled
filename, for every file which has a long filename. Linux file systems
don't do that, in general.
4. The basic algorithms for producing mangled filenames, for both MS and
Linux, don't necessarily produce filenames that are unique. But after
producing a mangled filename, MS Windows systems check that the name is
unique, amongst the filenames in the same directory, and if the name is not
unique, the Windows system code twiddles with the name until it becomes
unique. Since it can store the mangled in the directory entry, it now
stores the unique mangled name, which persists in that directory un
 Not so, DOSEMU, so, for long filenames, there will be a possibility of
non-uniqueness of corresponding mangled filenames under DOSEMU, and when
that occurs, only the first (in directory search order) of the files with
matching mangled filenames will be accessible under DOSEMU.
5. For Microsoft, the mangling algorithm is usually simple, and very
predictable. Scan backwards from the right, for a period, to get a name
part and an extension part. Delete all illegal DOS filename chars from both
parts (including additional periods, if any, in the name part), fold to
upper case, truncate the name at six chars, and the extension at three
chars. Add "~1" to the end of the name part. And we're done, usually.
But then, MS scans the subdirectory to make sure that this mangled name is
unique. If not, replace "~1" with "~2" and try again. If that fails more
than about 9 times, they do some additional mangling which involves lopping
a couple more characters off of the end of the original name, is much
harder to predict. But that only occurs when there are a lot of files with
the same 6 first characters, and that can usually be avoided.
But the point is, unless the name was bumped in a search for uniqueness, an
MS mangled name is very easy to construct, by a user at the keyboard, by
just looking at the long filename and doing the obvious manipulations. That
can be very helpful.

6. The basic DOSEMU mangling algorithm is similar. Differences are:
a). DOSEMU keeps 5 chars from the name, rather than six, and adds something like "~XX" to that. The "XX" is computed by hashing the whole original long filename and taking the result modulo 1296, so it is predictable and repeatable, but unlike simply adding "~1" as windows does, it is not a suffix that a user can typically come up with by simple inspection. There are 1296 (36 squared -- 26 letters and 10 digits) possibilities for the value of XX, and DOSEMU counts on that fact for uniqueness, rather than doing a check for uniqueness and trying a different suffix if the first try wasn't unique. Name duplications is unlikely, but possible.
b) with DOSEMU, scanning for a dot is from the left, not from the right, so
whereas "afilename.tar.zip" would become something like "AFILEN~1.ZIP"
under Windows (and therefore retain it's "file type"), it would become
something like "AFILE~XX.TAR" under DOSEMU. This can result in it being
opened by the wrong app, so Windows does better on this one.
c) DOSEMU adds an extension of "___" (three underscores) if the original
filename had no extension.
7. The need to store the mangled filenames, once computed, in the directory
entries or somewhere else that is equally accessible, seems to be
recognized by the Samba folks. The Samba 3.0 announcement says that Samba
3.0 has a "new filename mangling system. The filename mangling system has
been completely rewritten. An internal database now stores mangling maps
persistently." Persistence storage of the mangled names would be a
prerequisite to checking for duplicates, and picking a new mangled name
when duplicates occur, as Microsoft does, so I'm guessing that Samba 3.0
does that. But it doesn't look like that is going go make it into DOSEMU
anytime soon.
There is also talk in the Samba community about, when the file being
accessed is actually on a FAT or NTFS partition (but mounted on Linux, and
accessed through Linux by mounting a Linux directory on DOSEMU), using the
mangled name that is actually recorded in the FAT or NTFS directory entry,
rather than having Samba create it's own, different, mangled name. This
would be seem to be good, but I don't know where it stands.
8. Interestingly, under DOSEMU, it may be that 8.3 names, and not long
filenames, cause the most trouble for folks who want to access everything
from their DOS apps. Best I can tell, 8.3 filenames are not mangled, ever.
Samba has a convention of folding 8.3 names from old DOS apps (e.g., apps
using the original 8.3 file system calls -- there are some new calls that
a new DOS app can use to access files with both case specificity and longer
names, but you won't find them in use in the older DOS apps) to lower case
before using them to access the filesystem, and I think that DOSEMU does
the same. (In Samba, this folding is an installation configuration option,
but I don't think that it is in DOSEMU). So, the bottom line, as I
understand it, is that any file you create from DOS will therefore be 8.3,
but lower case, as it actually is created within Linux. Similarly, any
Linux file that is 8.3, but other than all lower case, will be unaccessible
to old DOS apps, period. However, I don't have a high confidence level that
I really know all the details of 8.3 handling, and would suggest some kind
of independent verification before counting on what I have said here.
9. DOSEMU code looks like it is going to be very slow, in searching
directories, and hence in opening files, when there are a lot of long
filenames in the directory. To search a directory, each name from the
directory is obtained, mangled, and compared with the search argument. So,
to search a directory with 1000 files in it, there are 1000 calls to the
mangle routine. The mangle routine doesn't appear to be all that fast -- in
just the hashing part of the algorithm to generate the "XX" part of the
suffix, there are a number of shifts and adds and what not for each
character in the long name, and that is a lot of CPU cycles to add to an
inner search loop. But with a Linux file system, there is no permanent copy
of the mangled name, so there may be few alternatives.
Bottom line: if you think you will be moving to Linux (which I will, since
I won't buy a product with the tether called activation, and W2K will
eventually starve for drivers), you have a lot of data that you care about,
and you hope to keep using XY, you should probably start thinking about the
form of your data now.

Comments on and corrections to the above would be appreciated.

Wally Bass