Join Similar CSV Using Perl

When exporting logs in Check Point using the smart log feature, sometimes the column order is randomized. If you have multiple files and need to join them, especially if they have different column orders, you don’t want to do this by hand.

1 Million Record Limit

You have a limit of 1 Million records when you export log files using the Check Point Smart Log application. Excel similarly limits file sizes to 1 Million records. However, many times you will find multiple CSV files stored as flat text databases and you need to parse them and they will total more than 1 Million records.

Text::CSV_XS to Join CSVs

If you want to join multiple CSV together by hand, you would have to open each CSV, make sure the column order was correct, rearrange bad columns, then paste them. Excel has a limit of 1 Million records. What happens if your CSV file has 15 million records? This Perl script was written to automate this tedious puzzle without the use of Excel.

Join CSV Files with Different Columns

The first version simply opened a set of files and then appended them to a new file, skipping any headers after the first one was found. This version uses a hash to map the columns then rewrites all CSV files to the output format you specify.

Error Checking on CSV

This Perl script does not check the validity of your CSV, and trusts Text::CSV_XS to do that job. If there are encoding errors, it might choke. For the purposes of joining multiple Check Point Log exports into 1 single CSV, it performs very well. You could easily add a few lines that would remove commas or massage the data as needed if that was required. Luckily, it wasn’t required for this Perl script or the data sets we were using.

Would you like Custom Perl Scripting for your projects? We can help you parse data and extract the information you want, just like this Perl script does:

#!/usr/bin/perl

use warnings;
use strict;
use Text::CSV_XS;
use 5.010;
use Benchmark;

# start timer
my $t0 = new Benchmark;

# output name
my $outfile = $ARGV[0] // 'X';
if (length $outfile < 3) { die "Need a file name for 'output'\n"; }

# only print header once
my $header = 0;

# find our list of csvs first, then loop over them
my @csvs = glob "./*.csv";
foreach my $in (@csvs) {
	my $t1 = new Benchmark;
	print "FILE: $in\n";

	# this is what it should be
	# we push into array of what is actually is.
	#$[0] is proper order
	#$[1] is just a counter for seen/not
	#$[2] will be pushed with what input file is

	my %base = (
	'Time'				=> ['A',0,0],
	'Description' 			=> ['B',1,0],
	'Destination'			=> ['K',10,0],
	'Policy Name'			=> ['Q',16,0],
	'Protocol'			=> ['T',19,0],
	'Source Port'			=> ['W',22,0],
	'Destination Port'		=> ['X',23,0],
	'Service Name'			=> ['Z',25,0],
	'Source'			=> ['AA',26,0],
	'ICMP Type'			=> ['AE',30,0],
	);

	# don't want to type up a manual map.   This does enough cols to match
	# extra columns are ignored
	my %map; my $abc = 0;
	foreach my $abc1 ('A'..'Z') { $map{$abc1} = $abc; $abc++; }
	foreach my $abc1 ('A'..'Z') { my $k = 'A'.$abc1; $map{$k} = $abc; $abc++; }
	foreach my $abc1 ('A'..'Z') { my $k = 'B'.$abc1; $map{$k} = $abc; $abc++; }
	foreach my $abc1 ('A'..'Z') { my $k = 'C'.$abc1; $map{$k} = $abc; $abc++; }
	my %rmap = reverse %map;

	my %in; 	# maps the keys on the inbound file

	# get headers of each input file
	my @fields;
	my $csv = Text::CSV_XS->new ( {binary => 1, auto_diag => 1 });      # new CSV object in binary mode
	open my $fh, "<:encoding(utf8)", $in or die "$!\n";					# open file handle ($fh) for reading (<:)
	while (my $row = $csv->getline ($fh)) { @fields = @$row; last; }
	close $fh;
	# all fields, raw, even ones we don't want

	# show me where the ones I want are, and do they match my %base?
	my $map = 0;
	# loop over all, even bad ones
	foreach my $f (@fields) {
		# if it's a keeper, do something
		if (exists $base{$f}) {
			# but only keep the first one found,
			if ($base{$f}[2] > 0) { next }
			# not the first one, so, put in our list and keep the $map id too
			push @{$base{$f}},$map;
			# tracks if it's first or not
			$base{$f}[2]++;
		}
		# what column are we on?   This is the "input column"
		$map++;
	}

	# %base isn't sorted like we want, so sort by something we like
	my @keys = sort { $base{$a}[1] <=> $base{$b}[1] } keys %base;

	# ok, verify rewrite of our keeper fields, now in order
	foreach my $f (@keys) {
		print "BASE: $base{$f}[0] $base{$f}[1]\tIN:$base{$f}[3]\t$f\n";
	}

	# we have the proper map now in $base{$f}[3], open CSV to print out
	# and print it out properly, append mode, utf8 encoding
	open my $fh3, ">>:encoding(utf8)", $outfile or die "$!\n";

	# open CSV to read in, loop line by line and print to fh
	my $csv2 = Text::CSV_XS->new ( {
		binary => 1,
		allow_loose_quotes => 1,   # required for lines that have quotes randomly inside
		auto_diag => 1
	});      # new CSV object in binary mode

	open my $fh2, "<:encoding(utf8)", $in or die "$!\n";             # open file handle ($fh) for reading (<:)
	while (my $row = $csv2->getline ($fh2)) {

		# skip header, if it's been tripped
		if ($header == 1){ $header++; next}

			# gather info, mapped
			my $time 	= $row->[$map{$rmap{$base{'Time'}[3]}}];
			my $desc 	= $row->[$map{$rmap{$base{'Description'}[3]}}];
			my $dest 	= $row->[$map{$rmap{$base{'Destination'}[3]}}];
			my $policy 	= $row->[$map{$rmap{$base{'Policy Name'}[3]}}];
			my $proto 	= $row->[$map{$rmap{$base{'Protocol'}[3]}}];
			my $sport	= $row->[$map{$rmap{$base{'Source Port'}[3]}}];
			my $dport 	= $row->[$map{$rmap{$base{'Destination Port'}[3]}}];
			my $service	= $row->[$map{$rmap{$base{'Service Name'}[3]}}];
			my $source	= $row->[$map{$rmap{$base{'Source'}[3]}}];
			my $icmp	= $row->[$map{$rmap{$base{'ICMP Type'}[3]}}] // " ";   # sometimes the field is blank and gives obnoxious error

		# print out to output
		print $fh3 qq~"$time","$desc","$dest","$policy","$proto","$sport","$dport","$service","$source","$icmp"\n~;

	}
	close $fh2;
	close $fh3;

	# header is 1 after every try, then goes past 1, then reset to 1 on each new open
	$header = 1;
	my $t2 = new Benchmark;
	my $td = timediff($t2,$t1);
	print "FILE DONE!!  took '",timestr($td), "' seconds\n";
}

my $t1 = new Benchmark;
my $td = timediff($t1,$t0);
print " ALL DONE!!  took '",timestr($td), "' seconds\n";

Last updated on 10 Oct 2018
Published on 10 Oct 2018