Fix Encoding Issues With Perl
Sometimes when working with text files (like log exports), you might encounter weird encoding issues. If you need to fix encoding issues with Perl, you can use a script like this:
This opens all of the files as raw encoding, then converts to ascii. You can check what type of file linux thinks your files are by using the “file” command followed by -i. If it’s binary, it will tell you, if it’s UTF-8 encoded or ASCII encoded, it will tell you.
As always, if you want help with your Perl Programming projects please contact us for a free quote!
Here is the script that pulls out specific weird characters and converts to ascii encoding in Perl:
#!/usr/bin/perl
# fixer: fixes weird chars in a bunch of text files
use warnings; # standard best practices
use strict; # standard best practices
use 5.010; # for the
opendir my $dir, "." or die "Cannot open directory: $!";
while ( my $file = readdir($dir) ) {
# skip certain types/etc
#next if $node =~ /^\./;
#next if $node =~ /\.zip$/;
#next if $node =~ /\.pl$/;
# or just pick a certain type
next unless $file =~ /\.csv$/;
# renamed because of glob originally, opendir works fine, not required
my $renamed = $file;
$renamed =~ s/\s+/_/g;
rename "$file",$renamed;
print "FILE: '$file' ($renamed)\n";
# open the input file
open IN, "$renamed" or die $!;
# open the output file
open OUT, ">${renamed}-" or die $!;
# binmode, select encoding
binmode(IN, ":raw");
binmode(OUT, ":encoding(ascii)");
# read file line by line
foreach (<in>) {
my $line = $_;
chomp $line;
$line =~ s/\xC2/ /g;
$line =~ s/\xA0/ /g;
$line =~ s/\x93/ /g;
$line =~ s/\x80/ /g;
$line =~ s/\xE2/ /g;
$line =~ s/\x0a/ /g;
$line =~ s/\x0d/ /g;
print OUT "$line\r\n"; # make it end in windows format
#print OUT "$line\n"; # make it end in linux format
}
# close file handles
close IN;
close OUT;
}
closedir $dir;
# 0d \r = CR = carriage return = ASCII code 13 (decimal), 015 (octal), 0d (hex)
# 0a \n = LF = line feed = ASCII code 10 (decimal), 012 (octal), 0a (hex)
# 0d0a = windows new line
# 0a = linux new line
#On Windows, the combination of those two control characters, i.e. \r\n,
#is used to indicate a newline, while on Linux/Unix, a single \n is used as newline.