Information Technology Grimoire

Version .0.0.1

IT Notes from various projects because I forget, and hopefully they help you too.

Fix Encoding Issues With Perl

Sometimes when working with text files (like log exports), you might encounter weird encoding issues. If you need to fix encoding issues with Perl, you can use a script like this:

This opens all of the files as raw encoding, then converts to ascii. You can check what type of file linux thinks your files are by using the “file” command followed by -i. If it’s binary, it will tell you, if it’s UTF-8 encoded or ASCII encoded, it will tell you.

As always, if you want help with your Perl Programming projects please contact us for a free quote!

Here is the script that pulls out specific weird characters and converts to ascii encoding in Perl:

#!/usr/bin/perl

# fixer: fixes weird chars in a bunch of text files

use warnings;                 # standard best practices
use strict;                   # standard best practices
use 5.010;                    # for the

opendir my $dir, "." or die "Cannot open directory: $!";

    while ( my $file = readdir($dir) ) {
      # skip certain types/etc
      #next if $node =~ /^\./;
      #next if $node =~ /\.zip$/;
      #next if $node =~ /\.pl$/;

      # or just pick a certain type
      next unless $file =~ /\.csv$/;

      # renamed because of glob originally, opendir works fine, not required
      my $renamed = $file;
         $renamed =~ s/\s+/_/g;
      rename "$file",$renamed;
      print "FILE: '$file' ($renamed)\n";

      # open the input file
      open IN, "$renamed" or die $!;

      # open the output file
      open OUT, ">${renamed}-" or die $!;

      # binmode, select encoding
      binmode(IN, ":raw");
      binmode(OUT, ":encoding(ascii)");

      # read file line by line
      foreach (<in>) {
         my $line = $_;
         chomp $line;
         $line =~ s/\xC2/ /g;
         $line =~ s/\xA0/ /g;
         $line =~ s/\x93/ /g;
         $line =~ s/\x80/ /g;
         $line =~ s/\xE2/ /g;
         $line =~ s/\x0a/ /g;
         $line =~ s/\x0d/ /g;
         print OUT "$line\r\n";  # make it end in windows format
         #print OUT "$line\n";  # make it end in linux format
      }

      # close file handles
      close IN;
      close OUT;
   }

closedir $dir;

# 0d \r = CR = carriage return = ASCII code 13 (decimal), 015 (octal), 0d (hex)
# 0a \n = LF = line feed = ASCII code 10 (decimal), 012 (octal), 0a (hex)

# 0d0a = windows new line
# 0a = linux new line

#On Windows, the combination of those two control characters, i.e. \r\n,
#is used to indicate a newline, while on Linux/Unix, a single \n is used as newline.
Last updated on 10 Oct 2018
Published on 10 Oct 2018