 |
There are two fundamentally different types of files in the
computer world. They differ in both the kind of content they hold,
and the way they are internally organized. (And, since life is
never simple, and there are ALWAYS exceptions to every rule, you
will also find certain files that mix the two types.)
Because of these differences, it is important during file
transfers to choose the appropriate type of transfer for the file.
If you aren't sure of the filetype, extensions commonly used on
filenames will often give you a clue.
Here's a summary of the content and internal organization of
TEXT and BINARY files, with notes about transferring
them and what common filename extensions refer to each.
- Text Files
- Binary Files
- Mixed (text and binary) Files
- How the number of bits in each bytes affects
them
(back to contents list)
| Content |
Plain old printable characters -- the
7-bit ASCII character set. Among the common types are:
- Readable documents -- the "normal" kind of text
file.
- Web pages -- written in HTML, which is plain text
- Encoded files -- binary files (see below) which have
been turned into plain text to protect them from destruction when
they are being sent through e-mail or file transfers that can't
handle the special needs of binary files.
NOT text files -- Word processor
files, except when saved as some form of "text" or "ascii" file,
are binary, not text files, because they contain special
non-ascii codes for formatting. |
| Organization |
Text files are organized into lines. By definition, a
line must have an end. It's important to know that these line ends,
which consist of the ASCII characters carriage-return (CR, a
control-M) and/or line-feed (LF, a control-J), are different on PCs
and Macs and Unix:
PC
Mac
Unix
IBM mainframe |
CR + LF
CR
LF
|
|
| Transfers |
Choosing an ascii or text type of transfer for
text files is important because of the two main functions performed
by such a transfer:
- handling the differences in line-ends on different systems,
and
- performing character-set translations when necessary, such as
on
- IBM mainframes, which use the "EBCDIC" character set instead of
the ASCII set used by all other computers, or
- Apple II's, whose "ASCII" characters have their 8th bit turned on.
Binary style transfers leave everything untouched, which is
disastrous when character translations are necessary -- you get
gobbledygook. And if the line-ends on the two systems are
different, the results of a binary transfer range from funny to a
royal pain. Looking at the table of line-ends (under
"Organization", above), you can see why binary transfers would
result in
- PC files sent to Macs having a control-J (linefeed) in front of
every line
- PC files on Unix having a control-M (carriage return) at the
end of every line
- Unix files on PC's or Macs looking like one large mass of text
sprinkled with control-J's where line-ends should be
- any other files on the IBM mainframe having an extra blank line
between each line.
A common example of this can be observed in the HTML source for a
web page downloaded by a web browser from a different kind of
system from your own, and saved with a "Save As" or related
command. HTML files are text files, but Web browsers do binary
downloads. Browsers can ignore CR's and LF's, and display the file
just fine, but anything else looking at the file (especially text
editors) generally can't.
The only time a binary transfer is safe for text files is
when the sending and receiving machines use the same system --
and even that isn't always true (e.g., the IBM mainframe).
|
Filename
Extensions |
Readable text
WWW home pages
PC "uuencoded" files
Mac "binhex" encoded files |
.txt, .asc
.htm, .html
.uu
.hqx |
|
(back to contents list)
| Content |
Binary files include
- graphics, sound and video files
- computer programs
- archives (files that contain one or more files that have been
grouped together for convenience and compressed to save space)
- formatted documents produced by word processors, spread-sheets,
databases, etc. (As noted above, word processor documents not saved
explicitly as "text" or "ascii" are binary files).
Unlike text files, whose characters all tend to be made up of 7 bits, binary files are made up of bytes which
depend on having all 8 bits intact. Sending
such files through e-mail or text-style transfers, which tend to
respect only 7 bits, tends to destroy them. |
| Organization |
Because binary files don't consist of readable text, they tend
not to have a conceptual needs for "lines", and thus tend to have
neither lines nor line-ends. They are just one long stream of 8-bit
bytes -- hence the "octent stream" term frequently encountered in
Web browsing.
Mac note -- Macs have a special kind of binary file
called MacBinary. Such a file includes not only the plain
"binary" (a.k.a "raw binary") information of the file itself, but
also any associated "resource" information (e.g., fonts). Computer
programs and formatted documents (mentioned above) tend to be
MacBinary. Most other files (graphics, sound, video), though they
may include information such as the creator-application for the
file (which helps define the icon used for the file), are
essentially just plain "binary" files.
|
| Transfers |
Binary transfers leave the file untouched. As noted in the
"Text" section above, they do no character translation or line-end
adjustment. The only adaptation they will do is handling the
exchange of physical file information (size, creation/modification
dates, etc.)
MacBinary -- most Mac file transfer programs have a
variant type of transfer for MacBinary files. Use this when
transferring files between Macs, or between a Mac and an
intermediary machine in what will ultimately be a Mac-to-Mac
transfer. Transferring Mac graphic/sound/video files should be done
as "raw binary", as should formatted documents intended for use
with the same application on a PC.
|
Filename
Extensions |
| PC |
Executable programs
Compressed archives
Graphics, sound, etc.
Formatted documents |
.exe, .com
.zip
.gif, .jpg, .tif, .au
.doc, .wpd, .pdf, .xls, .xlw, .mdb . . . |
| Mac |
Self-extracting archives
Compressed archives
Graphics, sound, etc.
Formatted documents |
.sea
.sit, .cpt
.gif, .jpeg, .tiff, .au
.doc, .pdf, .xls, .xlw . . . |
|
(back to contents list)
| Content |
The only type of file we have encountered which is a mixture of
binary bytes and text characters is the special output of the SAS
and SPSS statistical packages, known as the SAS dataset and
the SPSS system file. Variable names and labels are kept in
text form, while numbers tend to be stored as binary values. |
| Organization |
Since these files consist of binary numbers and text words or
phrases with no line structure, they have the same "endless stream"
quality as binary files, though on the UMDD IBM mainframe, as with
other binary files, they are stored in arbitrary fixed-length
records. |
| Transfers |
The binary nature of much of the content of these files
requires that a binary file transfer be used for these.
Since there is no line structure, the failure to use a text-type
transfer doesn't create problems as long as no
character-translation is needed. So transfer between PCs and Macs
and Unix can be done in simple binary fashion.
The only problem in transferring these files occurs with
transfers between the IBM mainframe and other systems. The IBM
mainframe's EBCDIC character set is incompatible with the ASCII
character set used for text on all other computer systems, and thus
requires translation. So,
- a binary transfer, which leaves the text portion of the
file untranslated, will make a mess out of it
- a text transfer will translate not only the text part of
the file, but also the binary part (since it cannot distinguish
between the two parts), making a mess of the binary
part.
This catch-22 can only be resolved by using the statistical package
involved to generate its own portable file -- SAS's export file
or SPSS's transport file -- and using a binary transfer to send it. These
use ASCII for the text portion, and can be reconverted to the
original system file once they are on the desired system. |
Filename
Extensions |
There are no standard extensions for these files. |
(back to contents list)
Characters (in text files) and bytes (in binary
files) refer to the same thing -- a set of eight binary bits:
| binary byte |
|
* * * * * * * * |
| ASCII character |
|
0 * * * * * * * |
where 0 is a zero bit and *
can be either a one or a zero.
Since only 7 bits are required to create all 128
characters in the ASCII set, programs that deal with transferring
characters (such as e-mail or text-type file transfers) very often
don't pay attention to the left-hand 8th bit, or, worse, may
use it for other purposes, such as the error-checking technique
known as parity, or deliberately clear it.
Since a binary file absolutely requires that the eighth bit be
trusted -- after all, it might be part of an instruction code in a
program, or represent a pixel in a picture, or signal the start of
a bold-face passage in a word processor file -- using a text-type
transfer for a binary file virtually guarantees the file's
destruction.
|