[ TechnoCage | Caskey | dos2unix ]
The three main computer operating systems in use today have (unfortunately) diverged long ago in their treatment of line-endings. In most documents, the author of the document has certain control over how the information they have authored is being presented. Of major importance is the notion of a line-ending. When a written document is encoded by a computer there exist special characters to signal when the end of one line has been reached and the next should begin. These invisible control characters is one specific area in computing that frustrates many computer users to this day.
Under DOS (Windows/PC) the end of a line of text is signalled using the ASCII code sequence CarriageReturn,LineFeed. Alternately written as CR,LF or the bytes 0x0D,0x0A. On the Macintosh platform, only the CR character is used. Under UNIX, the opposite is true and only the LF character is used. Advanced document encoding formats such as HTML, PDF and XML do not utilize such crude techniques to signal the end of one line of text and the beginning of the next thus avoiding this entire class of problems (among others).
As luck would have it, a very powerful tool, PERL,
exists and has many, many ways of solving this simple problem. In fitting with the
perl motto of "There's More Than One Way To Do It", I provide several samples of just how
this could be done. All of these convert to/from
the unix format of a single LF. These examples are written around the context of having
a number of JAVA source code files that need to
be converted due to the insanity of a particular tool you may have used to edit the files
which forced it's notion of line-endings upon
you. In perl the CarriageReturn character is represented by "\r"
and LineFeed (aka NewLine)
is represented as "\n".
The simplest perl script is this one:
perl -pi -e 's/\r\n/\n/;' *.java
This does the reverse:
perl -pi -e 's/\n/\r\n/;' *.java
If you wish to be a little more complicated, you can do the same
in two lines of perl. This enables you to simply name the file(s)
you wish to convert on the command line. It would be used like so:
dos2unix-2line *.java
Here is what dos2unix-2line it looks like:
#!/usr/bin/perl -pi
s/\r\n/\n/;
Here is what unix2dos-2line it looks like:
#!/usr/bin/perl -pi
s/\n/\r\n/;
To do it "right" it takes much more complicated code. The tragic thing about this version is that it is marginally more readable than the previous two versions yet contains fifteen times more lines than the longer of the two prior. In bytes, it is over 20 times larger.
Nonetheless
Here is the code for dos2unix:
#!/usr/bin/perl -w
#
# A script to convert a number of java source
# files from dos line ending format to unix line format.
#
# WARNING: THIS SCRIPT DOES NOT CHECK FOR PRE-EXISTING
# FILES, USE WITH CAUTION.
#
# Usage: dos2unix <DIRECTORY>
#
$directory = shift @ARGV;
$directory = '.' unless $directory;
chdir( $directory ) || die "Unable to enter directory '$directory'.\n$!\n";
@files = <*.java>;
$| = 1;
$linesFixed = 0;
foreach( @files ) {
print "$_\t";
open(INPUT, "<$_");
rename( $_, "$_.bak") || die "Unable to rename $_\n$!\n";
open(OUTPUT, ">$_");
while(<INPUT>) {
if ( s/\r\n/\n/ ) {
$linesFixed++;
}
print OUTPUT;
}
} continue {
print "($linesFixed)\n";
$linesFixed = 0;
close INPUT;
close OUTPUT;
}
It's matching partner unix2dos has only one line
that differs:
if ( s/\r\n/\n/ ) {
becomes
if ( s/\n/\r\n/ ) {
More information about line-feeds and character sets can be found on the internet and at the following locations.
Comments welcome.

Last updated: 2004-08-23