Convert .csv File to MySQL Database via Perl

Have you ever had a spreadsheet file or a large .csv file that you wanted to manipulate, but you want more power than a spreadsheet program could offer?

Before I started using MySQL, I would usually throw the .csv file into a desktop database program, like FileMaker. FileMaker would allow you to import the .csv file and it would automatically create the column headers for you. Recently, I was given a spreadsheet with 27,000 rows in it. I still use FileMaker for some databases, but I wanted the power of MySQL to manipulate the information contained in this file. So, I could have easily just typed out the database column names manually into a MySQL “create table” statement, guessed at the types and sizes of the columns and then imported the .csv file. Instead, I decided to write a Perl script to do the dirty work for me. Plus, this spreadsheet had 45 columns of varying lengths. Ouch.

Please keep in mind that this was a quick hack. I did some testing on some small data, and then crunched the big spreadsheet. Everything seemed to work out okay, but I did get a bunch of Perl errors “Use of uninitialized value…”. So, I just turned off “warnings” and proceeded. Feel free to show me how to correct these errors – I just didn’t want to take the time. I am not an expert at writing Perl code anyway.

The script is fairly simple. The .csv file has to have header information in order to be able to create the column names, the delimiter must be double-quotes and a comma – “,” – and so each line must look something like this:

"Name","Address","City","State","Zip","Phone","Amount1","Amount2","Amount3"

There are times when you will have a .csv file that doesn’t have the full “,” delimiter – such as when a column is a number, the “s are left off and only a comma is used as the delimiter – and I could have incorporated that functionality into the script, but my .csv file had the full delimiter for both numbers and text.

I created a test .csv file, and this is the data that I used to initially test the script:

"Name","Address","City","State","Zip","Phone","Amount1","Amount2","Amount3"
"Tony Davidson","100 Main Street","Suwanee","Georgia","30024","6785552222","440.03","4522.00","82837"
"Jack Smith","5000 Washington Ave.","Chattanooga","California","31069","(770)5552941","400","456.787","2828"
"Paul Davis","P.O. Box AB123","Jackson","Georgia","31069","(770)555-6532","325.1","837.3","3848"
"Jacky Smith-Davidson","400 Metropolitan Ave","Macon","Georgia","30024-2032","770-555-8987","503.920","392.4","292833"
"Davd Jackson","100 Oak Drive","Savannah","Tennessee","40024","770) 555- 8941","5492.0","15674.01","43"

The script only determines three data types – varchar, integer and decimal. If your data doesn’t fit into these three types, or if you want to use different data types, then the program probably won’t work for you. The script calculates the longest length for each column, so if you will be changing and increasing the length of the data in the columns, you will want to edit the length of the columns in the “create table” statement that is created.

Once the script was run against the above data, two files were created – mysql_create_table.sql, which is the SQL that you can use to create the actual table – and mysql_data_values.sql, which is the “insert into” file that you can use to load the data into the table.

Here is the SQL statement that was created by the script (mysql_create_table.sql), which can use used to create the table for the data above:

CREATE TABLE `Addresses` (
`Name` varchar (20),
`Address` varchar (20),
`City` varchar (11),
`State` varchar (10),
`Zip` varchar (10),
`Phone` varchar (14),
`Amount1` decimal (7,3),
`Amount2` decimal (8,3),
`Amount3` int (6)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

And here is the “insert” data (mysql_data_values.sql) that is created that you can use to load the database:

insert into Addresses (Name, Address, City, State, Zip, Phone, Amount1, Amount2, Amount3)
values ('Tony Davidson', '100 Main Street', 'Suwanee', 'Georgia', '30024', '6785552222', '440.03', '4522.00', '82837');
insert into Addresses (Name, Address, City, State, Zip, Phone, Amount1, Amount2, Amount3)
values ('Jack Smith', '5000 Washington Ave.', 'Chattanooga', 'California', '31069', '(770)5552941', '400', '456.787', '2828');
insert into Addresses (Name, Address, City, State, Zip, Phone, Amount1, Amount2, Amount3)
values ('Paul Davis', 'P.O. Box AB123', 'Jackson', 'Georgia', '31069', '(770)555-6532', '325.1', '837.3', '3848');
insert into Addresses (Name, Address, City, State, Zip, Phone, Amount1, Amount2, Amount3)
values ('Jacky Smith-Davidson', '400 Metropolitan Ave', 'Macon', 'Georgia', '30024-2032', '770-555-8987', '503.920', '392.4', '292833');
insert into Addresses (Name, Address, City, State, Zip, Phone, Amount1, Amount2, Amount3)
values ('Davd Jackson', '100 Oak Drive', 'Savannah', 'Tennessee', '40024', '770) 555- 8941', '5492.0', '15674.01', '43');

Here is the MySQL command that you can use to import your .sql file into the MySQL database:

mysql -u -p database_name < mysql_data_values.sql

You will want to edit the script and edit/enter your relevant information – such as .csv file name, output file names and storage engine.

I tried this on my 27,000 line .csv file, and it appeared to work. If the data in the .csv file had been a bit cleaner, I could state that it positively did work, but there was so much data that it was difficult to check. The theory is that with this script, you can take a unwieldy .csv file, import it into MySQL, do your work, and then export it again as a .csv file. And, you can always write a script to automatically execute the creation of your table and the importing of the data.

So, feel free to give this a shot and let me know if it worked for you or not.


#!/usr/bin/perl

use File::Copy;
#use warnings;

# edit these next lines to match your preferences
$TABLE_NAME = "Addresses";
$DATABASE_ENGINE = "InnoDB";
$DEFAULT_CHARSET = "latin1";

# enter the file name that you want to convert to the SQL statements and values
$filename = "test.csv";

# create your output files
open(TABLE, ">mysql_create_table2.sql") || die "Can't redirect stdout";
open(VALUES, ">mysql_data_values2.sql") || die "Can't redirect stdout";

# count how many lines are processed
$count = 0;

# set to be blank
$Columns_Values = "";

# open the file to be processed
open FILE, "$filename" or die $!;

# assign the first line which contains headers to the variable $columns
my $columns = <FILE>;

# remove carriage return (for Windows, if you have a CR and a LF, you will need to chop twice)
chop $columns;
# uncomment the next line for Windows files
#chop $columns;

# check to see if the field contains a ' - and if so, add a slash \ in front
$columns =~ s/'/\\'/g;

# ...remove the first " and then "," will be our delimiter
$columns =~ s/\"//;
# ...remove the last " from the end of the line so that "," will be our delimiter
chop $columns;

# remove spaces, add an underscore _
$columns =~ s/ /_/g;
#print "$columns\n";

# split first line into individual field names
@Field_Names = split("\",\"",$columns);

# total number of field names
$Field_Names_Count = $#Field_Names;

# add one to the $Field_Names_Count
$Field_Names_Count_Plus_One = $Field_Names_Count + 1;

# start the field count at zero
$field_count = 0;

# create the column names (values) for the "insert into" part of the SQL statement
if ($count == 0)

{

$column_count = 0;

   while ($column_count <= $Field_Names_Count)
   
   {
      if ($column_count < $Field_Names_Count)
   
      {
         $Columns_Values = $Columns_Values . $Field_Names[$column_count] . ", ";
      }
      
      
      if ($column_count == $Field_Names_Count)
   
      {
         $Columns_Values = $Columns_Values . $Field_Names[$column_count];
      }

      $column_count++;
   }
   
# end if ($count == 0)
}

$count = 0;

# continue to parse the rest of the file which contains the data
while (<FILE>)

{

# remove the carriage return
chomp $_;

# remove the first " and then...
$_ =~ s/\"//;

# ...remove the last " from the end of the line so that "," will be our delimiter
chop $_;

# split the first line into what will be used as the column names
@Field_Values = split("\",\"",$_);

while ($field_count <= $Field_Names_Count )

{

   # check to see if the field contains a ' - and if so, add a slash \ in front
   $Field_Values[$field_count] =~ s/'/\\'/g;

         # if a field is blank, set it to zero, and then remove the zero later
         if (length($Field_Values[$field_count]) < 1)
         
         {
            $Field_Values[$field_count] = "0";
         }

         # check to see if the field value contains any alphabet characters
         if ( $Field_Values[$field_count] =~ m/[a-zA-Z]/)
         
         {
               $type[$field_count] = "varchar";
               
               # find the longest length of the data in the column
               if ($length[$field_count] < 'length($Field_Values[$field_count])')
            
               {
                  $length[$field_count] = length($Field_Values[$field_count]);
               }
         }
   
   
   # once a field has been designated as a varchar, we don't need to test it any further
   # as we aren't going to change a varchar field back to a number or decimal field
   if ($type[$field_count] ne "varchar")
   
   {
         # check to see if the field value does not contain any alphabet characters
         if ( $Field_Values[$field_count] =~ m/[^a-zA-Z]/)
   
         {
            # if the field was already determined to be a decimal, then keep it a decimal
            # if not, then set it to be a number
            if ($type[$field_count] ne "decimal")
            
            {
               $type[$field_count] = "int";
               
               # find the longest length of the data in the column
               if ($length[$field_count] lt 'length($Field_Values[$field_count])')
               {
                  $length[$field_count] = length($Field_Values[$field_count]);
               }
            }
         }
   
         # if the field contains numbers and a period
         if ( $Field_Values[$field_count] =~ m/[0-9.]/)
   
         {
               @count_periods = split("\\.",$Field_Values[$field_count]);
               $number_of_periods = $#count_periods;
            
            
            # if there are two periods in the field, then it is a varchar
            if ($number_of_periods > 1)
            
            {
   
            $type[$field_count] = "varchar";
            
         
               # check for the length of the field to make sure we have the highest field length
               if ($length[$field_count] < 'length($Field_Values[$field_count])')
               {
                  $length[$field_count] = length($Field_Values[$field_count]);
               }
   
   
                  # set these values to be zero - in case the previous field contained a decimal number
                  $decimal_length1[$field_count] = "";
                  $decimal_length2[$field_count] = "";
               }
   
            # if there is only one period in the field, then it is a decimal with X number of decimal places
            if ($number_of_periods == 1)
            
            {
               $type[$field_count] = "decimal";
               
               # split the number to find out the length of each side of the decimal
               # example 1234.56 = 4,2
               @split_decimal_number = split("\\.",$Field_Values[$field_count]);
               
               # find the length of each side of the decimal and keep the highest value
               # this is for the number to left of the decimal
               if ($decimal_length1[$field_count] lt length($split_decimal_number[0]))
               
               {
                  $decimal_length1[$field_count] = length($split_decimal_number[0]);
               }
               
               # find the length of each side of the decimal and keep the highest value
               # this is for the number to right of the decimal
               if ($decimal_length2[$field_count] lt length($split_decimal_number[1]))
               
               {
                  $decimal_length2[$field_count] = length($split_decimal_number[1]);
               }
                           
            # end if ($number_of_periods == 1)
            }
   
         # end if ( $Field_Values[$field_count] =~ m/[0-9.]/)
         }
                  
         # if the field contains anything else besides a 0-9 or a period (.)
         if ( $Field_Values[$field_count] =~ m/[^0-9.]/)
         
         {
               $type[$field_count] = "varchar";
   
               # find the longest length of the data in the column
               if ($length[$field_count] lt 'length($Field_Values[$field_count])')
            
               {
                  $length[$field_count] = length($Field_Values[$field_count]);
               }
   
         # end if ( $Field_Values[$field_count] =~ m/[^0-9.]/)         
         }
   
   # end if ($type[$field_count] ne "varchar")
   }
   
   else
   
   {         
   
               # check for the length of the field to make sure we have the highest field length
               if ($length[$field_count] < length($Field_Values[$field_count]))
            
               {
                  $length[$field_count] = length($Field_Values[$field_count]);
               }
   
   
   # end else
   }
   
   
   # uncomment this line if you want to see the data being processed - as well as another line below
   # print "$Field_Values[$field_count] $type[$field_count] $length[$field_count] $decimal_length1[$field_count] $decimal_length2[$field_count] | ";
   
   
         # if a field is blank, we set it to zero earlier, now we remove the zero
         if (length($Field_Values[$field_count]) < 1)
         
         {
            $Field_Values[$field_count] = "";
         }

   
      # create the syntax needed for the "insert into" statement    
      if ($field_count == 0)
      
      {
         print VALUES "insert into $TABLE_NAME ($Columns_Values) \nvalues ('$Field_Values[$field_count]'";
      }
      
         if ($field_count > 0 && $field_count < $Field_Names_Count_Plus_One)
         
         {
            print VALUES ", '$Field_Values[$field_count]'";
         }
         
      $field_count++;
      # end while ($field_count < $Field_Names_Count_Plus_One )
      }
   
         # check for last entry and then start over on next line
         if ($field_count == $Field_Names_Count_Plus_One)
         
         {
            $field_count = 0;
            $count++;
         
            # close the print statement for the column values
            print VALUES ");\n";
         }
   
   # uncomment this line if you want to see the data being processed
   # print "\n";

# end while (<FILE>)
}

# print the create table statement
print TABLE "\n\nCREATE TABLE `$TABLE_NAME` (\n";

$count_columns = 0;

# loop through the columns and print the type and length for each
while ($count_columns < $Field_Names_Count_Plus_One)

{
   # make sure that we don't have a blank field value
   if (length($Field_Names[$count_columns]) > 0)
   
   {
      if ($type[$count_columns] =~ "decimal")
      
      {
         $decimal_field_length = $decimal_length1[$count_columns] + $decimal_length2[$count_columns];
         print TABLE " `$Field_Names[$count_columns]` $type[$count_columns] ($decimal_field_length,$decimal_length2[$count_columns])";
      }
      
      else
      
      {
         print TABLE " `$Field_Names[$count_columns]` $type[$count_columns] ($length[$count_columns])";
      }
   
      if ($count_columns < $Field_Names_Count)
      
      {
         print TABLE ",\n";
      }
      
      if ($count_columns == $Field_Names_Count_Plus_One)
      
      {
         print TABLE "\n\n";
      }
      
   # end if (length($Field_Names[$count_columns]) > 0)
   }

$count_columns++;

# end while ($count_columns < $Field_Names_Count_Plus_One)
}

# print an output to show how many lines were processed
print "Processed $column_count columns and $count lines.\n";

print TABLE "\n) ENGINE=$DATABASE_ENGINE DEFAULT CHARSET=$DEFAULT_CHARSET\n";

print TABLE "\n\n";

close(FILE);

exit;

 


Tony Darnell is a Principal Sales Consultant for MySQL, a division of Oracle, Inc. MySQL is the world’s most popular open-source database program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn.
Tony is the author of Twenty Forty-Four: The League of Patriots

 

Visit http://2044thebook.com for more information.

18 Responses to Convert .csv File to MySQL Database via Perl

  1. perlhoser says:

    Your post reminds me of a similar, but smaller script that I wrote some time ago. Both the script and some of the comments left may be of interest to you:

    http://blog.josephhall.com/2008/11/importing-usda-sr21-into-mysql.html

  2. henrikingo says:

    And are you aware that MySQL can natively import CSV data with the LOAD DATA INFILE command? It also exports with SELECT INTO OUTFILE.

    • Tony Darnell says:

      Yes, I am aware that you can import a .csv file into a MySQL database. But this script allows you to create a new table and produces the names of the columns for you, and it figures out the necessary lengths of each column.

  3. Marcin Pohl says:

    Do you have any scripts to go backwards, from sql dumps (of various flavours) to CSV?

    • Tony Darnell says:

      I don’t have a way to take a MySQL dump file and convert it to a .csv file, but that is a good suggestion for a future post.

      But you can export data as a .csv file from MySQL.

      SELECT name, address, etc. INTO OUTFILE ‘/tmp/filename.csv’
      FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘”‘
      ESCAPED BY ‘\\’
      LINES TERMINATED BY ‘\n’
      FROM users WHERE …..

  4. Dave Cross says:

    Some comments on your Perl.

    1/ You should always “use strict” and “use warnings”. Commenting out “use warnings” is not an acceptable fix for your problems.

    2/ No need to quote “$filename” in open FILE, “$filename”.

    3/ Use lexical filehandles and three-argument open – open my $file, ‘<', $filename

    4/ Use 'chomp' instead of 'chop'

    5/ Your split looks horrible. How about just split(/","/). The $_ is implied.

    And most importantly, a lot of your program can be massively simplified by using the module Text::CSV. If, for some reason, you don't want to install new modules, Text::ParseWords vomes with Perl and has much of the same functionality.

    Hope this helps.

  5. Scott Noyes says:

    Here’s a version written in PHP which lets MySQL do all the heavy lifting of figuring out what field type to use.

    <?php
    $file = 'test.csv';
    $table = 't';
    $fileLineTerminator = '\r\n';

    mysql_connect('localhost', 'root', 'password');
    mysql_select_db('test');

    // Read the column headers
    $f = fopen($file, 'r');
    $line = trim(fgets($f));
    fclose($f);

    // Replace spaces with underscores
    $line = str_replace(' ', '_', $line);

    // Create the table using the column headers
    // All fields are created as blobs. Will alter table to optimize type.
    $fields = explode('","', trim($line, '"'));
    $sql = "CREATE TABLE $table (`";
    $sql .= implode('` blob, `', $fields);
    $sql .= '` blob)';
    mysql_query($sql) or die(mysql_error());

    // Load the data
    $sql = <<
    .....

    [Editor]
    I either exceed comment length, or the comment system doesn’t like heredoc syntax. Here’s the pastebin:
    http://pastebin.com/dthgD3X1

    • Ben says:

      @Scott Noyes … that’s fantastic! You rock!

      Tony, thanks for starting this thread!

      @Dave Cross … Seriously? Your critique of Tony’s Perl is just a bit obnoxious. Yeah, you know it all man…

      • Dave Cross says:

        Obnoxious? Really? So how should I have phrased it?

        Perl has evolved. There are better ways to write Perl programs. Tony obviously has a really basic understanding of Perl. He gets his job done, but a couple of days reading something like “Modern Perl” (http://www.onyxneon.com/books/modern_perl/) would make his programs smaller, easier to write and easier to maintain.

      • Ben says:

        @Dave Cross/ Uh, yes, definitely obnoxious, as in, “Your split looks horrible.” Your followup comment to mine makes my point even more valid, “but a couple of days reading something like ‘Modern Perl'” …? Seriously??? again you bore us with disenchanting discourse. This thread is not about coding as much as it is about innovation and getting things done (RTFM, or sidebar in this case). I ran the Perl script and it gets the job done.

        What should you have said? Well, nothing, I guess. You should have taken the participation/contributor route like @Scott Noyes and pasted your self-proclaimed, beautifully crafted code, instead of sitting on the self-assigned seat of Perl Parliament! Better yet, take the time to contribute your improvements on github.com.

      • Dave Cross says:

        I seriously didn’t mean to upset people. I was just trying to post some pointers that would allow Tony to improve his code. I really don’t understand why a suggestion to read a (really good) book can be taken as obnoxious. It’s not even like you have to pay for the book – there’s a version available for free download.

        I never suggested that Tony’s code didn’t work in exactly the way that he said it did. I only meant that better written code is easier to maintain.

        But your point about criticising and not offering alternatives is well made. Here’s a link to my alternative version. Feel free to criticise it as much as you like.


        #!/usr/bin/perl
        use strict;
        use warnings;
        use Data::Dumper;
        use Text::ParseWords;
        use List::Util 'max';
        my $file = shift || die "Need a csv file to process\n";
        my $table_name = 'Addresses';
        my $engine = 'InnoDB';
        my $charset = 'latin1';
        # Read and process header line
        open my $data, '<', $file or die "Cannot open $file for reading\n";
        my $header = <$data>;
        chomp $header;
        # @cols will be an array of hashes. Each has will contain details
        # of one of the columns
        my @cols = map { { name => $_ } } parse_line ',', 0, $header;
        for (@cols) {
        $_->{name} =~ s/\s+/_/g;
        $_->{name} =~ s/'/\\'/g;
        }
        # Read the rest of the data
        while (<$data>) {
        chomp;
        my @row = parse_line ',', 0, $_ ;
        for my $v (@row) {
        s/'/\\'/g;
        }
        foreach my $col_no (0 .. $#cols) {
        push @{$cols[$col_no]{values}}, $row[$col_no];
        }
        }
        # Analyse the data
        foreach (@cols) {
        guess_type($_);
        }
        # Output the table
        open my $table, '>', 'mysql_create_table2.sql' or
        die "Can't open file for table: $!";
        print $table table_def(@cols);
        # Output the data
        open my $values, '>', 'mysql_data_values2.sql' or
        die "Can't open file for values: $!";
        for my $row (0 .. $#{$cols[0]{values}}) {
        print $values insert($row, @cols);
        }
        die Dumper \@cols;
        # Analyse an array of column hashes and fill in various information
        # about the columns by looking at the data values in each column.
        sub guess_type {
        my $column = shift;
        $column->{type} = 'varchar';
        foreach my $val (@{$column->{values}}) {
        if ($val !~ /^-?\d+(\.\d+)?$/) {
        $column->{type} = 'varchar';
        last;
        }
        if ($val =~ /^-?\d+\.\d+$/) {
        $column->{type} = 'decimal';
        } else {
        $column->{type} = 'int';
        }
        }
        if ($column->{type} eq 'decimal') {
        $column->{dec1} = max map { length +(split /\./)[0] } @{$column->{values}};
        $column->{dec2} = max map { length +(split /\./)[1] } @{$column->{values}};
        } else {
        $column->{length} = max map { length $_ } @{$column->{values}};
        }
        }
        # Return a table definition string given an array of column hashes.
        sub table_def {
        my @columns = @_;
        return "\n\nCREATE TABLE `$table_name` (\n" .
        join(",\n", map { column_def($_) } @columns) .
        "\n) ENGINE=$engine DEFAULT CHARSET=$charset\n" .
        "\n\n";
        }
        # Given a column hash, return a string containing the SQL column definition.
        sub column_def {
        my $column = shift;
        my $def = " `$column->{name}` $column->{type} ";
        if ($column->{type} eq 'decimal') {
        my $dec_length = $column->{dec1} + $column->{dec2};
        $def .= "($dec_length,$column->{dec2})";
        } else {
        $def .= "($column->{length})";
        }
        return $def;
        }
        # Given a row number and an array of column hashes, return a string
        # containing an SQL insert statement for the given row.
        sub insert {
        my ($row, @columns) = @_;
        return "insert into $table_name (" .
        join(', ', map { $_->{name} } @columns) .
        ") \nvalues (" .
        join(', ', map { "'$_->{values}[$row]'" } @cols) .
        ");\n";
        }

        view raw

        mysql-import

        hosted with ❤ by GitHub

        It seems to me that given Tony’s sample data, this code prints exactly the same output as the original program except for one character where I think I’ve fixed a bug. Tony’s version declared the Amount3 column as int(2). Mine declares it as int(6). I believe that mine is correct given the sample data. I think that this bug is down to the problems explained by morungos in comment 9.

      • Scott Noyes says:

        > Tony’s version declared the Amount3 column as int(2). Mine declares it as int(6).

        Unless you have zerofill defined, the difference between int(2) and int(6) is nothing (“doodly-squat”, as my father would say).

      • Ben says:

        Well @Dave Cross, hats off to you and your choice to contribute. Your code looks very well put together.

  6. Scott Noyes says:

    I either exceed comment length, or the comment system doesn’t like heredoc syntax. Here’s the pastebin:
    http://pastebin.com/dthgD3X1

  7. Netta B says:

    Thank you so much! Scott – you made my day :)

  8. morungos says:

    The expression “($decimal_length1[$field_count] lt length($split_decimal_number[0]))” is almost certainly broken. The “lt” operator is a string comparison (equiv to strcmp functionality), not a numeric comparison, so, for example, 10 lt 2 is actually true, even though 10 and 2 don’t look like strings. You probably want the “<" operator.

    And "if ($length[$field_count] lt 'length($Field_Values[$field_count])')" is probably even more broken, as the second operand is a quoted string.

  9. Pingback: Being Helpful – Perl Hacks

  10. Fernando "Eagle" de Sousa says:

    A great post. Nice to read.

Leave a reply to Tony Darnell Cancel reply