MySQL versus MongoDB for storing JSON documents in a doc-store database – which is faster?

April 10, 2020 4 Comments

MySQL added the ability to store JSON documents in a doc-store format with MySQL version 5.7 (general availability date – October 21, 2015). Although MySQL’s initial release of their doc-store feature did have some limitations, MySQL has made some really nice improvements in their version 8.0 release, and more recently, in their 8.0.19 release.

I came across a paper written by Erik Andersson and Zacharias Berggren titled “A Comparison Between MongoDB and MySQL Document Store Considering Performance“. I think the paper was written for a college project (and it might have been their senior thesis as they were in the bachelor’s program in computer science). In the paper, they compared MongoDB 3.4.4 against MySQL version 5.7.12, which was an early release of MySQL version 5.7. And, it wasn’t a surprise to see that MongoDB’s document store was much faster than MySQL.

But, with all of the new and improved changes to MySQL’s document-store in the latest release (version 8.0.19), I attempted to re-create their tests to see if MySQL had improved since version 5.7.12. And, both products should have improved added since the original test in 2017.

Disclaimer

I am a solutions engineer at MySQL, and I am not a MongoDB user. There may be ways of configuring both databases to be faster, but I did these tests without modifying any of the default variables which could affect performance. The only exception is that for MySQL, I did a separate test with the binary log disabled (I explain this later in the post).

There were two variables that weren’t related to performance I had to change. For MySQL, I did have to set max_join_size=11000000 when running the queries for the largest database. For MongoDB, I had to set DBQuery.shellBatchSize=1000 or MongoDB would only return 20 rows at a time in the terminal window.

The Equipment

I ran the tests on my own server, and then on a Virtual Machine in Oracle’s Cloud Infrastructure (OCI). My server’s hardware was as follows:

Intel 8th Gen Core i7-8700K Processor (6 Cores / 12 Threads)
32GB DDR4 DRAM 2666MHz
500GB PC SSD SATA III 6 Gb/s M.2
Gigabyte Z370 AORUS motherboard
Mac OS 10.13.6

For OCI, I used the VM.Standard.B1.1 shape, which consisted of the following:

Single OCPU – 2.2 GHz Intel® Xeon® E5-2699 v4
11.8 GB of Memory
100 GB Remote Block Volume

Installation

NOTE: I am not going to go into detail how to install MongoDB, NodeJS or MySQL, but I will provide the links.

I began by installing MongoDB via yum. Per the installation instructions, I created a MongoDB repo file (/etc/yum.repos.d/mongodb-org-4.2.repo), edited the repo file and added the following:

[mongodb-org-4.2]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/4.2/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-4.2.asc

Note: gpgcheck is GNU privacy guard, which helps to verify the that you are installing the correct package from MongoDB, and not from a third-party.

I then used this command to install MongoDB via yum.

sudo yum install -y mongodb-org

I also had to install NodeJS. To install, I used the following commands:

curl -sL https://rpm.nodesource.com/setup_10.x | sudo bash -
sudo yum install nodejs

I downloaded the RPM for the latest (version 8.0.19) MySQL community release, and installed it via yum:

sudo yum localinstall mysql80-community-release-el7-3.noarch.rpm -y
sudo yum install mysql-community-server -y

I started both MongoDB and MySQL:

sudo service mongod start
sudo service mysqld start

And I confirmed the status of each:

sudo service mongod status
sudo service mysqld status

Now that I had MongoDB and MySQL installed, I needed some data.

Data

I found some basic JSON data on the data.gov web site – which is a great site for finding generic JSON data. In this case, I needed a large data set, so I found some JSON data which consisted of a list of businesses in the state of Washington (USA). The file I downloaded contained 2.6 million records. The JSON records contained this data:

{
"Ubi": "1234567890",
"Title": "GOVERNOR",
"FirstName": "CHRISTOPHER",
"MiddleName": "WALKEN",
"LastName": "KRAMER",
"Address": "324 SMITHY DR",
"City": "NOWHERE",
"State": "XA",
"Zip": "05252"
}

I know this example isn’t a large document, but it was the only dataset that I could find with millions of records.

Managing the Data

I took the 2.6 million (2,600,000) records, and split the original single file into three files (using the split command). This produced two files containing one million records, and the last file contained 600,000 records. I discarded the last file.

I used one of the files containing one million records, and split it into ten files containing 100,000 records. I then took one of the files with 100,000 records, and split it into ten files with 10,000 records each. I did the same for the 10,000 record file, splitting it into ten 1,000 record files.

I used the same JSON files for both MongoDB and MySQL.

Note: I am not going to share the data I used. Even though the data is public data, it does contain personal information, and as an employee of Oracle, it might violate their privacy policy.

The Databases

I created the same databases to be used on both as well. The naming convention was as follows:

db_json_test10k = 10 x 1k records imported – 10k records total
db_json_test100k = 10 x 10k records imported – 100k records total
db_json_test1000k = 10 x 100k records imported – 1,000k records total
db_json_test10000k = 10 x 1000k records imported – 10,000k records total

When importing JSON documents, the databases are automatically created in MongoDB. With MySQL, you have to create them manually.

create database db_json_test10k;
create database db_json_test100k;
create database db_json_test1000k;
create database db_json_test10000k;

The documents inside the database can be created automatically by both MongoDB and MySQL.

The Tests

I did not add any configuration variables for either database – except for what I put in the MongoDB repo file – so the default variables were used for each. For MySQL, the binary log is enabled by default, so I ran the tests with the binary log turned on and turned off. For MySQL, the binary log contains all of the transactions which could change data. In other words, all insert, update and delete transactions are written to a (binary) log on disk. Obviously, running the tests without the binary log enabled was faster in MySQL.

Each test consisted of the following:

Drop all databases (if required)
Create the MySQL databases
Import 1k, 10k, 100k and 1,000k records (10 times each with unique data in each import)
Create indexes – executed once per round per database
Perform 100 search queries x 10 different times
Perform 100 update queries x 10 different times
Perform 100 delete queries x 10 different times
Repeat four times, for a total of five tests

I recorded the time it took for each part of the test to be executed, and inserted the test results into a spreadsheet. I used the same data files and queries for both MongoDB and MySQL.

The graphs will show you the time for each action and the results are labeled Mongo, MySQL and MySQL-noBL (with the binary log disabled). For the search, update and delete tests, I performed the same 100 unique queries, 10 times against each database. Each series of 10 tests was executed five times, and the time needed to complete each of the individual 10 tests for each database size (10k, 100k, 1000k and 10000k records) was then averaged across all five tests to create the graphs. (Note: for this test, “k” equals 1,000 and not 1,024)

While I was running the tests, the only applications I had open were a text editor, three terminal windows and a spreadsheet program. However, there were some small anomalies in the test results where a test would take a bit longer than the normal run time. I did not correct the data when an anomaly occurred.

Also, not all off the queries returned or modified the same number of rows. For example, one update query might change 10 rows, while another changed 50, so the query result times are not going to be relative to the every other query. But, each query was executed in the same order, so the time it takes to run query #1 should be relative in all five executions of the tests.

Importing Data

For each of the four databases, I imported the same set of records. This involved importing 1k, 10k, 100k and 1,000k records ten times each into their respective databases.

For MySQL, I used the MySQL Shell, which is a command-line client. Here is an example of the import command:

mysqlsh root:password@localhost/db_json_test10k --import /Volumes/HD1/0mongo_v_mysql/1000/1000-0 json_test10k

For MongoDB, I used the mongoimport utility. Here is an example of the import command:

mongoimport --jsonArray --db db_json_test10k --collection json_test10k --file /Volumes/HD1/0mongo_v_mysql/1000/1000-0m

Note: With MySQL, you have to be careful when you import JSON documents – if the syntax is incorrect, the file can import but it will not create a collection (it will create a table instead). The syntax on the import command is important.

Here are the results of the imports. The Y-axis (vertical line) on the graph represents the time in seconds to perform the action.

MongoDB was much faster than MySQL in importing data, even with the MySQL binary log turned off.

Creating Indexes

Whenever you import data into a collection for either MongoDB or MySQL, both instances will automatically create some indexes. After the initial import, I took a look at the indexes:

For MongoDB:

> use db_json_test10k
switched to db db_json_test10k
> db.json_test10k.getIndexes()
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "db_json_test10k.json_test10k"
	}
]

For MySQL:

mysql> use db_json_test10k;
Database changed
mysql> SHOW INDEXES FROM json_test10k\G
*************************** 1. row ***************************
        Table: json_test10k
   Non_unique: 0
     Key_name: PRIMARY
 Seq_in_index: 1
  Column_name: _id
    Collation: A
  Cardinality: 98627
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
      Visible: YES
   Expression: NULL
1 rows in set (0.00 sec)

Both MongoDB and MySQL automatically create an index on the column _id (which is a primary key as well).

I needed to create an index on the column UBI, as I was going to use this column for my searches, updates and deletes.

I would only need to create the indexes once for every series of tests, but I still tracked how long it took for both MongoDB and MySQL to create the indexes. Here are the commands to create an index for each one:

MySQL

use db_json_test10k
ALTER TABLE json_test10k ADD COLUMN UBI_2 INT GENERATED ALWAYS AS (doc->"$.Ubi");
ALTER TABLE json_test10k ADD INDEX (UBI_2);

MongoDB

use db_json_test10k
db.json_test10k.createIndex( { Ubi: 1 })

Note: MySQL indexes JSON documents via virtual columns. See this blog post for a detailed explanation of virtual columns.

For the tests, I would be creating an index on all four databases for each instance. Here are the results of the index creation for each database. The Y-axis (vertical line) on the graph represents the time in seconds needed to perform the action.

MySQL was a bit faster than MongoDB in index creation.

Searches

The search test consisted of executing ten scripts, where each script performed 100 search queries. The test was executed against all four databases, each containing their own update queries (there was a total of 40 separate scripts running 100 separate queries each). The test was repeated five times and the average times are shown in the graphs.

Here is an example of the search syntax: (Only the first three queries are shown)

MySQL

use db_json_test10k
select * from json_test10k where UBI_2 = '603013962'; 
select * from json_test10k where UBI_2 = '603598341'; 
select * from json_test10k where UBI_2 = '601574968';
...

MongoDB

use db_json_test10k
DBQuery.shellBatchSize = 1000
db.json_test10k.find({"Ubi" : "603013962"})
db.json_test10k.find({"Ubi" : "603598341"})
db.json_test10k.find({"Ubi" : "601574968"})
...

The search queries were placed in files – SQL files for MySQL, and JavaScript files for MongoDB – and each file was executed as follows: (Only the first three queries are shown)

MySQL

time mysql -uroot -p < /Volumes/HD1/0mongo_v_mysql/3_search/mysql_search_sql_10k/search_mysql_100_10k_0.sql
time mysql -uroot -p < /Volumes/HD1/0mongo_v_mysql/3_search/mysql_search_sql_10k/search_mysql_100_10k_1.sql
time mysql -uroot -p < /Volumes/HD1/0mongo_v_mysql/3_search/mysql_search_sql_10k/search_mysql_100_10k_2.sql
...

MongoDB

time mongo < /Volumes/HD1/0mongo_v_mysql/3_search/mongo_search_10k/search_mongo_100_10k_00.js
time mongo < /Volumes/HD1/0mongo_v_mysql/3_search/mongo_search_10k/search_mongo_100_10k_02.js
time mongo < /Volumes/HD1/0mongo_v_mysql/3_search/mongo_search_10k/search_mongo_100_10k_04.js
...

Here are the results of the searches for each database. The Y-axis (vertical line) on the graph represents the time in seconds needed to perform the action.

The results were much different than in the original tests. MySQL was much faster than MongoDB in searches.

Update

The update test consisted of executing ten scripts, where each script contained 100 unique update queries. The test was executed against all four databases, each containing their own update queries (there was a total of 40 separate scripts). The test was repeated five times and the average times are shown in the graphs.

The update syntax was as follows:

MySQL

use db_json_test10k;
SET max_join_size=11000000;
SET SQL_SAFE_UPDATES=0;
UPDATE json_test1k SET doc = JSON_SET(doc, '$.State', 'KS') 
WHERE UBI_2 = '604052443';

MongoDB

use db_json_test10k
DBQuery.shellBatchSize = 1000
db.json_test1k.updateMany({"Ubi" : "604052443"}, {$set: { "State" : "KS"}});

Here are the results of the updates for each database. The Y-axis (vertical line) on the graph represents the time in seconds needed to perform the action. Note that not all of the queries updated the same number of rows, so the results aren’t going to be the same for each group of queries.

Delete

The delete test consisted of executing ten scripts, where each script contained 100 unique delete queries. The test was executed against all four databases, each containing their own update queries (there was a total of 40 separate scripts). The test was repeated five times and the average times are shown in the graphs. (I did have to add the changes to max_join_size and DBQuery.shellBatchSize as I explained earlier)

The delete syntax was as follows: (Only the first three queries are shown)

MySQL

use db_json_test10k;
SET max_join_size=11000000;
SET SQL_SAFE_UPDATES=0;
DELETE FROM json_test1k WHERE UBI_2 = '603013962';
DELETE FROM json_test1k WHERE UBI_2 = '603598341';
DELETE FROM json_test1k WHERE UBI_2 = '601574968';
...

MongoDB

use db_json_test10k
DBQuery.shellBatchSize = 1000
db.json_test1k.deleteMany({"Ubi" : "603013962"});
db.json_test1k.deleteMany({"Ubi" : "603598341"});
db.json_test1k.deleteMany({"Ubi" : "601574968"});
...

Again, the results were much different than in the original tests. MySQL was much faster than MongoDB in deletions.

Oracle Cloud Infrastructure Test Results

Here are the test results for running the same tests on Oracle Cloud. MongoDB’s imports were still much faster. On the queries, MongoDB performed better, and in some tests, MySQL was competitive only if you had the binary log disabled.

Import

Create Index

Searches

Updates

Deletes

The results from running with only one CPU via OCI are surprising.

Overall, I think the latest version of MySQL is certainly a great alternative to MongoDB for storing JSON documents. If you are using MySQL version 5.7 for storing your JSON documents, I would recommend upgrading to the latest 8.0 version.

Finally – remember, I am not a MongoDB expert. If you have any suggestions on how to tune MongoDB to run faster, please leave a comment below.

	Tony Darnell is a Principal Sales Consultant for MySQL, a division of Oracle, Inc. MySQL is the world’s most popular open-source database program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn.
	Tony is the author of Twenty Forty-Four: The League of Patriots Visit http://2044thebook.com for more information.
	Tony is the editor/illustrator for NASA Graphics Standards Manual Remastered Edition Visit https://amzn.to/2oPFLI0 for more information.

Filed under doc store, Document Store, MySQL, MySQL Shell, NoSQL, Scripting, Uncategorized

MySQL Server Deployment with Docker – Basic Installation Instructions for Both the Community and Enterprise Versions

May 10, 2019 Leave a comment

An easy way to setup one or multiple MySQL server deployments on a single server is to use Docker – a computer program that performs operating-system-level virtualization. Docker is simple-to-use and allows you to run multiple containers at once.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. (Source: https://www.docker.com/resources/what-container)

In other words, think of a container as a virtual machine without the graphical user interface (GUI). There are third-party GUI’s available, but for this post, I am going to use a terminal window.

I am not a Docker expert, so I did have to spend some time figuring out the basics. But with this tutorial, you should be able to install the Docker software and a MySQL server in less than a fifteen minutes.

Let’s get started

First, you will need to download and install Docker. I am not going to cover this part, but installation is fairly straightforward. I downloaded and installed the Docker Desktop for my Mac.

I already have a MySQL instance installed on my server, so I will install this new instance using a different port number than the default port of 3306. On the server-side, I will use port 3307 to connect to the default MySQL port of 3306 inside the Docker container. By using a different external port number, I can install multiple MySQL instances on one server, but still use the default port for the MySQL instance. MySQL has their own set of Docker container images on github, and I can install MySQL directly from the command line. I don’t have to download anything separately. You can create your own local repository, but for this example, Docker will pull the latest version from MySQL’s github page.

To install MySQL, I opened a terminal window and ran the following command – changing the first port number (the external port) to 3307. The second port number is the port for the MySQL instance inside the container. You will notice that Docker first checks the local repository, and then once it can’t locate it, it goes out to github. Installation is done via the Docker run command.

$ docker run -p 3307:3306 -d --name mysql -e MYSQL_ROOT_PASSWORD=password mysql/mysql-server
Unable to find image 'mysql/mysql-server:latest' locally
latest: Pulling from mysql/mysql-server
35defbf6c365: Pull complete 
e13cf68584a3: Pull complete 
259d03b6a792: Pull complete 
892ac46af8c0: Pull complete 
Digest: sha256:8dd16a45d0e3e789f2006b608abb1bb69f1a8632a338eef89aec8d6fccda7793
Status: Downloaded newer image for mysql/mysql-server:latest
d8695b074a014f31c65112fb00ec1e5ad79d4c5ba94eb3be1d0fa424f14f414c

I can then verify to see if the MySQL container is up and running via the Docker container command:

$ docker container ls
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS                            PORTS                               NAMES
0b55334fedcb        mysql/mysql-server   "/entrypoint.sh mysq…"   2 minutes ago      Up 3 seconds (health: starting)   33060/tcp, 0.0.0.0:3307->3306/tcp   mysql

Note:To start or stop the container, simply type use the Docker start/stop command, where mysql is the name of the container – and not the application being run inside the container:

$ docker stop mysql
mysql
$ docker start mysql
mysql

Note: If the container isn’t running and you need to start it, you will see an error like this when you try and connect to the container:

$ docker exec -it mysql bash
Error response from daemon: Container d8695b074a014f31c65112fb00ec1e5ad79d4c5ba94eb3be1d0fa424f14f414c is not running

I now have a copy of the MySQL container image stored locally on my server. I can look at all of the Docker images installed so far with the Docker images command:

$ docker images -a
REPOSITORY           TAG                 IMAGE ID            CREATED             SIZE
mysql/mysql-server   latest              39649194a7e7        2 weeks ago         289MB

I can verify if MySQL is running by using the Docker container command:

$ docker container ls -a
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS                     PORTS               NAMES
0b55334fedcb        mysql/mysql-server   "/entrypoint.sh --ip…"   23 seconds ago      Exited (1) 22 seconds ago                       mysql

Now that I have MySQL installed and I have verified that the container is running, I can connect to the container using the Docker exec command: (The word mysql is the container name, and not the mysql database instance)

$ docker exec -it mysql bash
bash-4.2#

After connecting, I am now at a regular Linux prompt. The MySQL data directory is stored in /var/lib/mysql, and the configuration file is in /etc/my.cnf.

bash-4.2# cd /var/lib/mysql
bash-4.2# ls -l
total 174160
drwxr-x--- 2 mysql mysql     4096 May  9 17:10 #innodb_temp
-rw-r----- 1 mysql mysql       56 May  9 17:10 auto.cnf
-rw-r----- 1 mysql mysql      178 May  9 17:10 binlog.000001
-rw-r----- 1 mysql mysql      155 May  9 17:10 binlog.000002
-rw-r----- 1 mysql mysql       32 May  9 17:10 binlog.index
-rw------- 1 mysql mysql     1676 May  9 17:10 ca-key.pem
-rw-r--r-- 1 mysql mysql     1112 May  9 17:10 ca.pem
-rw-r--r-- 1 mysql mysql     1112 May  9 17:10 client-cert.pem
-rw------- 1 mysql mysql     1676 May  9 17:10 client-key.pem
-rw-r----- 1 mysql mysql     5456 May  9 17:10 ib_buffer_pool
-rw-r----- 1 mysql mysql 50331648 May  9 17:10 ib_logfile0
-rw-r----- 1 mysql mysql 50331648 May  9 17:10 ib_logfile1
-rw-r----- 1 mysql mysql 12582912 May  9 17:10 ibdata1
-rw-r----- 1 mysql mysql 12582912 May  9 17:10 ibtmp1
drwxr-x--- 2 mysql mysql     4096 May  9 17:10 mysql
-rw-r----- 1 mysql mysql 29360128 May  9 17:10 mysql.ibd
srwxrwxrwx 1 mysql mysql        0 May  9 17:10 mysql.sock
-rw------- 1 mysql mysql        2 May  9 17:10 mysql.sock.lock
drwxr-x--- 2 mysql mysql     4096 May  9 17:10 performance_schema
-rw------- 1 mysql mysql     1676 May  9 17:10 private_key.pem
-rw-r--r-- 1 mysql mysql      452 May  9 17:10 public_key.pem
-rw-r--r-- 1 mysql mysql     1112 May  9 17:10 server-cert.pem
-rw------- 1 mysql mysql     1676 May  9 17:10 server-key.pem
drwxr-x--- 2 mysql mysql     4096 May  9 17:10 sys
-rw-r----- 1 mysql mysql 12582912 May  9 17:10 undo_001
-rw-r----- 1 mysql mysql 10485760 May  9 17:10 undo_002
bash-4.2# ls -l /etc/my.cnf
-rw-r--r-- 1 root root 1239 May  9 17:10 /etc/my.cnf

I can log into MySQL the same way as if it was a regular MySQL instance. (When I created the container, I used “password” as the password, but you will want a more secure password)

bash-4.2# mysql -uroot -ppassword
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 18
Server version: 8.0.16 MySQL Community Server - GPL

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Remember – since I am not connecting to the instance from outside of Docker, I don’t have to use port 3307. But, I will have to do that if I want to connect via MySQL Workbench.

Before I connect via MySQL Workbench, I will want to create a different user for this connection, and use this user for my Workbench connection:

mysql> CREATE USER 'docker'@'%' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.02 sec)

mysql> GRANT ALL PRIVILEGES ON * . * TO 'docker'@'%';
Query OK, 0 rows affected (0.01 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.01 sec)

I can now create a MySQL Workbench connection. I will open Workbench, and click on the plus symbol to create a new connection.

I need to provide a connection name (Docker Container 3307), the hostname (127.0.0.1) and I need to specify port 3307. If you don’t have another installation of MySQL on your server, you can use the default port of 3306. I will store the password in my keychain by clicking on “Store in Keychain”.

To test and see if you have the correct information, click the “Test Connection” button

I can now use MySQL Workbench to connect to the MySQL Docker container:

That’s it. I now have MySQL installed as a Docker container, and I can access it via a terminal window or via Workbench.

MySQL Enterprise Version

The MySQL Community Edition is a great database server, but if you are going to run a database in a production environment, I would recommend you purchasing a MySQL Enterprise Edition license.

The MySQL Enterprise Edition includes the most comprehensive set of advanced features, management tools and technical support to achieve the highest levels of MySQL scalability, security, reliability, and uptime. It reduces the risk, cost, and complexity in developing, deploying, and managing business-critical MySQL applications. (Source)

The steps for installing the Enterprise Edition is almost the same as the Community, but you have to download the Docker image from the Oracle Support portal (My Oracle Support). And, you will need a license to access support. If you are already a customer, login to the support web site, and go to the “Patches and Updates” tab, click on “Patch Search” and then on under “Product or Family (Advanced)”, search for the MySQL version you want and enter the description of “Docker”, then click the “Search” button.

MySQL only has the Linux version of the Enterprise Edition, but I can still install and run it on my Mac. The download file contains a tar file and a README file. For this example, the tar file is named mysql-enterprise-server-8.0.16.tar. I placed this file in my home directory, and from a terminal window, I will need to load the file into the repository using the Docker load command:

$ docker load -i mysql-enterprise-server-8.0.16.tar
d6b2dcf96e3d: Loading layer [==================================================>]  220.6MB/220.6MB
b84b6c2a237e: Loading layer [==================================================>]  8.704kB/8.704kB
141e4cf4cec5: Loading layer [==================================================>]  2.048kB/2.048kB
Loaded image: mysql/enterprise-server:8.0

I can now see the Enterprise Edition image along with the Community Edition image:

$ docker images -a
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
mysql/mysql-server        latest              39649194a7e7        2 weeks ago         289MB
mysql/enterprise-server   8.0                 d4410562024a        2 weeks ago         337MB

To install the Enterprise Edition, I only need to change a few of the variables from before. I will also want to use port 3308, since 3306 and 3307 are in use. (Remember – the first port number is the “external” server port number and the second is the port number inside the container)

$ docker run -p 3308:3306 -d --name mysqlEE -e MYSQL_ROOT_PASSWORD=password mysql/enterprise-server:8.0

I can now see the Enterprise Edition container:

$ docker container ls -a
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS                            PORTS                               NAMES
5b4df641d044        mysql/enterprise-server:8.0   "/entrypoint.sh mysq…"   6 seconds ago       Up 4 seconds (health: starting)   33060/tcp, 0.0.0.0:3308->3306/tcp   mysqlEE
0b55334fedcb        mysql/mysql-server            "/entrypoint.sh mysq…"   2 hours ago         Up 2 hours (healthy)              33060/tcp, 0.0.0.0:3307->3306/tcp   mysql

I can connect to docker, and open MySQL – using the container name of mysqlEE:

$ docker exec -it mysqlEE bash

bash-4.2# mysql -uroot -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.16-commercial MySQL Enterprise Server - Commercial

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

I will want to create a new user like before, and then I can create a Workbench connection as well, using port 3308.

That’s it. I now have two instances of MySQL running in two separate containers on my server.

Deleting images and containers

Here are the commands to delete any images or containers. To remove one or more specific images, use the Docker images command to see what images are available:

$ docker images -a
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
mysql/mysql-server        latest              39649194a7e7        2 weeks ago         289MB
mysql/enterprise-server   8.0                 d4410562024a        2 weeks ago         337MB

And you can delete the image by deleting the IMAGE ID, by using the Docker rmi command:

$ docker rmi 39649194a7e7
Untagged: mysql/mysql-server:latest
Untagged: mysql/mysql-server@sha256:8dd16a45d0e3e789f2006b608abb1bb69f1a8632a338eef89aec8d6fccda7793
Deleted: sha256:39649194a7e780713ee5681d3bc5ff9e1fddaca744113d4a64ed61f67b7de601
Deleted: sha256:46837581982573a52d3af65de8ac243749c3f8bdf16043541e1a3cfcac721f6b
Deleted: sha256:e311a637abb5186c3bafe967fbb4d10c16258b4b878258ed0ceaff9a07969930
Deleted: sha256:348e9a791d8deb3d6f7ea979c768db0086dbd5172fdbe065649aebfebe509c46
Deleted: sha256:c4a7cf6a6169fb6af5316b4917b6f3417d419b5b5c1e5befd74746996088fc57

To remove a container, use the Docker container command to get a list of containers:

$ docker container ls -a
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS                            PORTS                               NAMES
5b4df641d044        mysql/enterprise-server:8.0   "/entrypoint.sh mysq…"   6 seconds ago       Up 4 seconds (health: starting)   33060/tcp, 0.0.0.0:3308->3306/tcp   mysqlEE
0b55334fedcb        mysql/mysql-server            "/entrypoint.sh mysq…"   2 hours ago         Up 2 hours (healthy)              33060/tcp, 0.0.0.0:3307->3306/tcp   mysql

And you can delete the container by deleting the CONTAINER ID via the the Docker container command:

$ docker container rm 5b4df641d044
5b4df641d044

For more information on installing MySQL with Docker, see Deploying MySQL on Linux with Docker.

	Tony Darnell is a Principal Sales Consultant for MySQL, a division of Oracle, Inc. MySQL is the world’s most popular open-source database program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn.
	Tony is the author of Twenty Forty-Four: The League of Patriots Visit http://2044thebook.com for more information.
	Tony is the editor/illustrator for NASA Graphics Standards Manual Remastered Edition Visit https://amzn.to/2oPFLI0 for more information.

Filed under Docker, MySQL Enterprise Edition, Scripting, Uncategorized

Using a trigger to parse non-conforming data in MySQL

September 21, 2018 1 Comment

In another post, I explained how to use the MySQL Query Re-write Plugin to manipulate data that didn’t exactly match SQL standards of MySQL. In this post, I am going to give you another example on how to use a trigger to parse non-conforming data so the data can be written to a MySQL database.

A customer came to me with a problem. They were using third-party software which produced multiple rows of comma-separated data (like a .csv file), but all of the data wasn’t on a single row. Instead, the data consisted of multiple rows of data, separated by commas and with line feeds after each row of data. Here is a shortened example of the data (the original data was 32-lines long):

Header information which may, or may not, contain one or more commas
Item Name,Type,Description,Value
Potato Chips - Sealed Bag,,,
Manifest Number,Number,MANIFEST_NUMBER,8480014
Manufacturer Code,Number,BRAND_CODE,80375993
Information 1,Number,INFO_ONE,15869563
Information 2,Number,INFO_TWO,6569569665
Information 3,Number,INFO_THREE,562
Planned Start Date,String,PROD_START_DATE,9/13/2018
Planned Start Time,String,PROD_START_TIME,081234

For some reason, the output couldn’t be separated, but the software could use an ODBC connection. They needed to be able to insert this data into a MySQL database, but they only needed certain values – they didn’t need a lot of the descriptive information. Of the example above, they only wanted to insert the information in yellow:

“Header information which may, or may not, contain one or more commas.”
Item Name	Type	Description	Value
Potato Chips – Sealed Bag
Manifest Number	Number	MANIFEST_NUMBER	8480014
Manufacturer Code	Number	BRAND_CODE	80375993
Information 1	Number	INFO_ONE	15869563
Information 2	Number	INFO_TWO	6569569665
Information 3	Number	INFO_THREE	562
Planned Start Date	String	PROD_START_DATE	9/13/2018
Planned Start Time	String	PROD_START_TIME	081234

At first, I suggested using a script (Python, Perl, PHP, etc.) to parse the data before inserting into the database. But, the software which produced the data had the ability to insert the data directly into a MySQL database – but all of the data could only be inserted into one field. They didn’t have an option to add an extra step outside of the database connection – and they were new to MySQL, so they needed some help.

I decided to write a trigger for the table, in order to parse the data as it came into the database, but before it was inserted into a row. I also wanted to keep the original data in a column as well.

The first step was to create a database and a table. I mapped the column names to the first value of each row, skipping the header information which wasn’t needed. The column names are in blue, and the data is in yellow:

“Header information which may, or may not, contain one or more commas.”
Item Name	Type	Description	Value
Potato Chips – Sealed Bag
Manifest Number	Number	MANIFEST_NUMBER	8480014
Manufacturer Code	Number	BRAND_CODE	80375993
Information 1	Number	INFO_ONE	15869563
Information 2	Number	INFO_TWO	6569569665
Information 3	Number	INFO_THREE	562
Planned Start Date	String	PROD_START_DATE	9/13/2018
Planned Start Time	String	PROD_START_TIME	081234

I can then match the columns I want to the data values:

“Header information which may, or may not, contain one or more commas.”
Item Name	Potato Chips – Sealed Bag
Manifest Number	8480014
Manufacturer Code	80375993
Information 1	15869563
Information 2	6569569665
Information 3	562
Planned Start Date	9/13/2018
Planned Start Time	081234

Now, I can create the database and the table. I substituted the spaces in the field names with underscores (_). I also added a primary key column (id_MANIFEST_DATA) and a proper DATETIME column – as I want to combine the two columns PROD_START_DATE and PROD_START_TIME into one column to match MySQL’s DATETIME format. This way, they can perform a search on this column later (if necessary).

mysql> create database MANIFEST_DATA_DB;
Query OK, 1 row affected (0.00 sec)

mysql> use MANIFEST_DATA_DB;
Database changed

mysql> CREATE TABLE `MANIFEST_DATA_DB`.`MANIFEST_DATA_TABLE` (
    ->   `id_MANIFEST_DATA` INT NOT NULL AUTO_INCREMENT,
    ->   `MANIFEST_DATA` VARCHAR(4096) NULL,
    ->   `ITEM_NAME` VARCHAR(1024) NULL,
    ->   `MANIFEST_NUMBER` INT NULL,
    ->   `MANUFACTURER_CODE` VARCHAR(1024) NULL,
    ->   `INFO_ONE` CHAR(32) NULL,
    ->   `INFO_TWO` CHAR(32) NULL,
    ->   `INFO_THREE` CHAR(32) NULL,
    ->   `PROD_START_DATE` CHAR(10) NULL,
    ->   `PROD_START_TIME` CHAR(6) NULL,
    ->   `PROD_TIMESTAMP` DATETIME NULL,
    ->   PRIMARY KEY (`id_MANIFEST_DATA`))
    -> AUTO_INCREMENT = 1000000;
Query OK, 0 rows affected (0.00 sec)

The initial “non-conforming data” will be inserted into the MANIFEST_DATA field, so all I have to do is to create a trigger to parse this field before the data is inserted.

Even though the data is on separate lines, parsing this data will be relatively easy, since the data is comma-delimited. I can use the SUBSTRING_INDEX function to create an array to store all of the data. But, since the first line of the data may or may not contain a comma, instead of counting commas from the beginning of the data, I will start at the end. Also, in this example, they don’t need the first line of data, as it is header information.

Let’s take a look at why I want to count backwards. Here are three rows of data – where the first column may or may not contain a comma – or it might contain two or more commas. I really only want to capture the last two columns of data.

"Hello, it's me",12345,ABCDE
"Hello it's me",67890,FGHIJ
"Hello, yes, it's me",112233,A1B2C3

If I parse the data based upon commas and start at the beginning, I will get different results when counting commas from the beginning of the data when the first line of data contains a comma. And I only want the data in green:

1	2	3	4	5
“Hello it’s me”	67890	FGHIJ
“Hello	it’s me”	12345	ABCDE
“Hello	yes	it’s me”	112233	A1B2C3

But if I count backwards, I will get the same result set regardless of how many commas are in the first line of data:

-5	-4	-3	-2	-1
		“Hello it’s me”	67890	FGHIJ
	“Hello	it’s me”	12345	ABCDE
“Hello	yes	it’s me”	112233	A1B2C3

In the actual data I want to sue, I don’t want to store the first row of data anyway – so it can be ignored. If I did need the first line, I would have to search for the quotes and parse that column separately.

Since the initial data will contain a line feed, I will want to replace the line feeds with a comma, so I can have all of the data on a single line and be able to use the SUBSTRING_INDEX function. Here is the original data again, but this time, I have added a column for the line feeds. Also, I am going to count backwards from the last data value. The numbers are in (bold):

/tr>

(-37)“Header information which may, or may not, contain one or more commas.”
(-36) Item Name	(-35) Type	(-34) Description	(-33) Value	-line feed-
(-32) Potato Chips – Sealed Bag	(-31)	(-30)	(-29)	-line feed-
(-28) Manifest Number	(-27) Number	(-26) MANIFEST_NUMBER	(-25) 8480014	-line feed-
(-24) Manufacturer Code	(-23) Number	(-22) BRAND_CODE	(-21) 80375993	-line feed-
(-20) Information 1	(-19) Number	(-18) INFO_ONE	(-17) 15869563	-line feed-
(-16) Information 2	(-15) Number	(-14) INFO_TWO	(-13) 6569569665	-line feed-
(-12) Information 3	(-11) Number	(-10) INFO_THREE	(-9) 562	-line feed-
(-8) Planned Start Date	(-7) String	(-6) PROD_START_DATE	(-5) 9/13/2018	-line feed-
(-4) Planned Start Time	(-3) String	(-2) PROD_START_TIME	(-1) 081234

Of course, you don’t have to map out the fields like I did – you can just count backwards.

The SUBSTRING_INDEX function works similar to grabbing a value out of an array – except the count value is where you want to stop grabbing data. You specify the string you want to parse, the delimiter you want to use, and the count of the value from the string where you want to stop grabbing data.

The format is:

SUBSTRING_INDEX(string,delimiter,count)

Note: When the count value is a positive number, the count value is the ending position within the string. It doesn’t select the element itself.

Here is an example using the phrase “Hello, I must be going, see you later, bye.” which contains four values separated by commas. I want to capture the third value of “see you later”.

1	2	3	4
Hello	I must be going	See you later	bye

To select “see you later”, I choose the third column as the place where I will stop grabbing data – but, since this is the stopping point, I will get the rest of the phrase up until the third column.

mysql> select SUBSTRING_INDEX("Hello, I must be going, see you later, bye.",",",3) as PARSE;
+---------------------------------------+
| PARSE                                 |
+---------------------------------------+
| Hello, I must be going, see you later |
+---------------------------------------+
1 row in set (0.00 sec)

I don’t want all of the phrase, I only want “see you later”. With a positive count value of three, I am getting the ending point – it stops at the third delimited value. If I use a negative count value, then I am getting the starting point of the string. If I count backwards, which would give me a negative count value (and I want to do this anyway because the first line of data in the main example may have multiple commas), I can use the value of negative two (-2) and get this:

mysql> select SUBSTRING_INDEX("Hello, I must be going, see you later, bye.",",",-2) as PARSE;
+----------------------+
| PARSE                |
+----------------------+
|  see you later, bye. |
+----------------------+
1 row in set (0.00 sec)

That gives me a starting point for grabbing “see you later”, but I don’t want any data (the single word “bye”) after this. So, I can wrap an additional SUBSTRING_INDEX around the first one, and then only choose the first delimited set of data from the inner SUBSTRING_INDEX.

It looks like this:

The statement select SUBSTRING_INDEX(“Hello, I must be going, see you later, bye.”,”,”,-2) grabs both of these columns in blue:

1	2	3	4
Hello	I must be going	See you later	bye

But then I can use the second SUBSTRING_INDEX command to only select the first column of the remaining two columns:

SUBSTRING_INDEX(SUBSTRING_INDEX(“Hello, I must be going, see you later, bye.”,”,”,-2), “,”, 1)

1	2
See you later	bye

I can test it:

mysql> select SUBSTRING_INDEX(SUBSTRING_INDEX("Hello, I must be going, see you later, bye.",",",-2), ",", 1) as PARSE;
+----------------+
| PARSE          |
+----------------+
|  see you later |
+----------------+
1 row in set (0.01 sec)

This is essentially how I am going to parse the long line of data (after I convert the line feeds to commas). I will go backwards to the starting point, and then only select the first column beginning at that starting point.

I can then use the table I created to map put the data values I want to extract.

Column Name	Backwards Value
ITEM_NAME	-32
MANIFEST_NUMBER	-25
MANUFACTURER_CODE	-21
INFO_ONE	-17
INFO_TWO	-13
INFO_THREE	-9
PROD_START_DATE	-5
PROD_START_TIME	-1

Example: To extract the PROD_START_TIME, which is at position -1, I will use this in the trigger:

SET NEW.PROD_START_TIME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -1), ",", 1);

I can now create my trigger to parse all of the data being inserted into the MANIFEST_DATA column. In the trigger, I will use some temporary variables to parse the MANIFEST_DATA, and I will also use the SUBSTRING function to parse the PROD_START_DATE and PROD_START_TIME fields to rearrange them into the correct MySQL DATETIME format. I will use the CONCAT function to combine them into a new DATETIME field. I have some notes in the stored procedure to help you figure out what I am doing.

DELIMITER $$
  
CREATE TRIGGER _convert_MANIFEST_DATA
BEFORE INSERT ON MANIFEST_DATA_DB.MANIFEST_DATA_TABLE
FOR EACH ROW
BEGIN

# Declare temporary variables

# This is the temporary field which contains all of the data
DECLARE _parse_MANIFEST_DATA varchar(4096);

# The temporary date and time fields needed to create the DATETIME field named PROD_TIMESTAMP
DECLARE _parse_TIME_DATE_DATE char(10);
DECLARE _parse_TIME_DATE_TIME char(10);

DECLARE _parse_TIME_DATE_date_final char(10);

DECLARE _parse_TIME_DATE_hour char(2);
DECLARE _parse_TIME_DATE_minutes char(2);
DECLARE _parse_TIME_DATE_sec char(2);

# Take the incoming large dataset which has line feeds and
# Replace the line feeds with a comma
set _parse_MANIFEST_DATA = replace(NEW.MANIFEST_DATA, "\n", ",");

# Parse each new column from the temporary field
SET NEW.ITEM_NAME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -35), ",", 1);
SET NEW.MANIFEST_NUMBER = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -25), ",", 1);
SET NEW.MANUFACTURER_CODE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -21), ",", 1);
SET NEW.INFO_ONE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -17), ",", 1);
SET NEW.INFO_TWO = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -13), ",", 1);
SET NEW.INFO_THREE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -9), ",", 1);
SET NEW.PROD_START_DATE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -5), ",", 1);
SET NEW.PROD_START_TIME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -1), ",", 1);

# Get the values from these two fields in order to combine them into a DATETIME field
SET _parse_TIME_DATE_DATE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -5), ",", 1);
SET _parse_TIME_DATE_TIME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -1), ",", 1);

# Convert the date from MM/DD/YYYY to YYYY-MM-DD
SET _parse_TIME_DATE_date_final = STR_TO_DATE(_parse_TIME_DATE_DATE, '%c/%d/%Y');

# Parse the time so we can add colons between the hour, minutes and seconds
SET _parse_TIME_DATE_hour = SUBSTRING(_parse_TIME_DATE_TIME, 1, 2);
SET _parse_TIME_DATE_minutes = SUBSTRING(_parse_TIME_DATE_TIME, 3, 2);
SET _parse_TIME_DATE_sec = SUBSTRING(_parse_TIME_DATE_TIME, 5, 2);

# Combine the newly-formatted date with the newly-formatted time so it conforms to the DATETIME format
SET NEW.PROD_TIMESTAMP = CONCAT(_parse_TIME_DATE_date_final, " ", _parse_TIME_DATE_hour, ":", _parse_TIME_DATE_minutes, ":", _parse_TIME_DATE_sec) ;

END$$

DELIMITER ;

Now I can insert the sample data into the database – where all of the data will go into the MANIFEST_DATA column, and the trigger will populate the rest of the columns by parsing the data.

mysql> INSERT INTO MANIFEST_DATA_TABLE (MANIFEST_DATA) VALUES ('Header information which may, or may not, contain one or more commas
    '> Item Name,Type,Description,Value
    '> Potato Chips - Sealed Bag,,,
    '> Manifest Number,Number,MANIFEST_NUMBER,8480014
    '> Manufacturer Code,Number,BRAND_CODE,80375993
    '> Information 1,Number,INFO_ONE,15869563
    '> Information 2,Number,INFO_TWO,6569569665
    '> Information 3,Number,INFO_THREE,562
    '> Planned Start Date,String,PROD_START_DATE,9/13/2018
    '> Planned Start Time,String,PROD_START_TIME,081234');
Query OK, 1 row affected (0.07 sec)

When I look at the contents of the table, I can see the trigger executed successfully. All of the “non-conforming” data is stored in the MANIFEST_DATA field, but the other fields were populated:

mysql> select * from MANIFEST_DATA_TABLE\G
*************************** 1. row ***************************
 id_MANIFEST_DATA: 1000000
    MANIFEST_DATA: Header information which may, or may not, contain one or more commas
Item Name,Type,Description,Value
Potato Chips - Sealed Bag,,,
Manifest Number,Number,MANIFEST_NUMBER,8480014
Manufacturer Code,Number,BRAND_CODE,80375993
Information 1,Number,INFO_ONE,15869563
Information 2,Number,INFO_TWO,6569569665
Information 3,Number,INFO_THREE,562
Planned Start Date,String,PROD_START_DATE,9/13/2018
Planned Start Time,String,PROD_START_TIME,081234
        ITEM_NAME: Type
  MANIFEST_NUMBER: 8480014
MANUFACTURER_CODE: 80375993
         INFO_ONE: 15869563
         INFO_TWO: 6569569665
       INFO_THREE: 562
  PROD_START_DATE: 9/13/2018
  PROD_START_TIME: 081234
   PROD_TIMESTAMP: 2018-09-13 08:12:34
1 row in set (0.00 sec)

Of course, this is only a short example of all of the parsing you can do with triggers in MySQL. And, if you want to look at your trigger, you can query the INFORMATION_SCHEMA.TRIGGERS table:

mysql> SELECT * FROM INFORMATION_SCHEMA.TRIGGERS WHERE TRIGGER_NAME='_convert_MANIFEST_DATA'\G
*************************** 1. row ***************************
           TRIGGER_CATALOG: def
            TRIGGER_SCHEMA: MANIFEST_DATA_DB
              TRIGGER_NAME: _convert_MANIFEST_DATA
        EVENT_MANIPULATION: INSERT
      EVENT_OBJECT_CATALOG: def
       EVENT_OBJECT_SCHEMA: MANIFEST_DATA_DB
        EVENT_OBJECT_TABLE: MANIFEST_DATA_TABLE
              ACTION_ORDER: 1
          ACTION_CONDITION: NULL
          ACTION_STATEMENT: BEGIN

# Declare temporary variables

# This is the temporary field which contains all of the data
DECLARE _parse_MANIFEST_DATA varchar(4096);

# The temporary date and time fields needed to create the DATETIME field named PROD_TIMESTAMP
DECLARE _parse_TIME_DATE_DATE char(10);
DECLARE _parse_TIME_DATE_TIME char(10);

DECLARE _parse_TIME_DATE_date_final char(10);

DECLARE _parse_TIME_DATE_hour char(2);
DECLARE _parse_TIME_DATE_minutes char(2);
DECLARE _parse_TIME_DATE_sec char(2);

# Take the incoming large dataset which has line feeds and
# Replace the line feeds with a comma
set _parse_MANIFEST_DATA = replace(NEW.MANIFEST_DATA, "
", ",");

# Parse each new column from the temporary field
SET NEW.ITEM_NAME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -35), ",", 1);
SET NEW.MANIFEST_NUMBER = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -25), ",", 1);
SET NEW.MANUFACTURER_CODE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -21), ",", 1);
SET NEW.INFO_ONE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -17), ",", 1);
SET NEW.INFO_TWO = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -13), ",", 1);
SET NEW.INFO_THREE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -9), ",", 1);
SET NEW.PROD_START_DATE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -5), ",", 1);
SET NEW.PROD_START_TIME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -1), ",", 1);

# Get the values from these two fields in order to combine them into a DATETIME field
SET _parse_TIME_DATE_DATE = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -5), ",", 1);
SET _parse_TIME_DATE_TIME = SUBSTRING_INDEX(SUBSTRING_INDEX(_parse_MANIFEST_DATA, ',', -1), ",", 1);

# Convert the date from MM/DD/YYYY to YYYY-MM-DD
SET _parse_TIME_DATE_date_final = STR_TO_DATE(_parse_TIME_DATE_DATE, '%c/%d/%Y');

# Parse the time so we can add colons between the hour, minutes and seconds
SET _parse_TIME_DATE_hour = SUBSTRING(_parse_TIME_DATE_TIME, 1, 2);
SET _parse_TIME_DATE_minutes = SUBSTRING(_parse_TIME_DATE_TIME, 3, 2);
SET _parse_TIME_DATE_sec = SUBSTRING(_parse_TIME_DATE_TIME, 5, 2);

# Combine the newly-formatted date with the newly-formatted time so it conforms to the DATETIME format
SET NEW.PROD_TIMESTAMP = CONCAT(_parse_TIME_DATE_date_final, " ", _parse_TIME_DATE_hour, ":", _parse_TIME_DATE_minutes, ":", _parse_TIME_DATE_sec) ;

END
        ACTION_ORIENTATION: ROW
             ACTION_TIMING: BEFORE
ACTION_REFERENCE_OLD_TABLE: NULL
ACTION_REFERENCE_NEW_TABLE: NULL
  ACTION_REFERENCE_OLD_ROW: OLD
  ACTION_REFERENCE_NEW_ROW: NEW
                   CREATED: 2018-09-20 22:13:28.54
                  SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
                   DEFINER: root@localhost
      CHARACTER_SET_CLIENT: utf8mb4
      COLLATION_CONNECTION: utf8mb4_0900_ai_ci
        DATABASE_COLLATION: utf8mb4_0900_ai_ci
1 row in set (0.02 sec)

	Tony Darnell is a Principal Sales Consultant for MySQL, a division of Oracle, Inc. MySQL is the world’s most popular open-source database program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn.
	Tony is the author of Twenty Forty-Four: The League of Patriots Visit http://2044thebook.com for more information.
	Tony is the editor/illustrator for NASA Graphics Standards Manual Remastered Edition Visit https://amzn.to/2oPFLI0 for more information.

Filed under MySQL, Scripting, Uncategorized

MySQL Enterprise Edition Database Firewall – Control and Monitor SQL Statement Executions

April 23, 2018 Leave a comment

As of MySQL 5.6.24, MySQL Enterprise Edition includes MySQL Enterprise Firewall, an application-level firewall (it runs within the mysql database process) that enables database administrators to permit or deny SQL statement execution based on matching against whitelists of accepted statement patterns. This helps harden MySQL Server against attacks such as SQL injection or attempts to exploit applications by using them outside of their legitimate query workload characteristics.

Each MySQL account registered with the firewall has its own whitelist of statement patterns (a tokenized representation of a SQL statement), enabling protection to be tailored per account. For a given account, the firewall can operate in recording or protecting mode, for training in the accepted statement patterns or protection against unacceptable statements. The diagram illustrates how the firewall processes incoming statements in each mode.

MySQL Enterprise Firewall Operation

(from https://dev.mysql.com/doc/refman/5.6/en/firewall.html)

If you do not have a MySQL Enterprise Edition license, you may download a trial version of the software via Oracle eDelivery. The MySQL Firewall is included in the MySQL Product Pack, specifically for MySQL Database 5.6.24 or higher.

MySQL Enterprise Firewall has these components:

A server-side plugin named MYSQL_FIREWALL that examines SQL statements before they execute and, based on its in-memory cache, renders a decision whether to execute or reject each statement.
Server-side plugins named MYSQL_FIREWALL_USERS and MYSQL_FIREWALL_WHITELIST implement INFORMATION_SCHEMA tables that provide views into the firewall data cache.
System tables named firewall_users and firewall_whitelist in the mysql database provide persistent storage of firewall data.
A stored procedure named sp_set_firewall_mode() registers MySQL accounts with the firewall, establishes their operational mode, and manages transfer of firewall data between the cache and the underlying system tables.
A set of user-defined functions provides an SQL-level API for synchronizing the cache with the underlying system tables.
System variables enable firewall configuration and status variables provide runtime operational information.

(from https://dev.mysql.com/doc/refman/5.6/en/firewall-components.html)

Installing the Firewall

Installing the firewall is fairly easy. After you install MySQL version 5.6.24 or greater, you simply execute an SQL script that is located in the $MYSQL_HOME/share directory. There are two versions of the script, one for Linux and one for Windows (the firewall isn’t supported on the Mac yet).

The scripts are named win_install_firewall.sql for Windows and linux_install_firewall.sql for linux. You may execute this script from the command line or via MySQL Workbench. For the command line, be sure you are in the directory where the script is located.

shell> mysql -u root -p mysql < win_install_firewall.sql
Enter password: (enter root password here)

The script create the firewall tables, functions, stored procedures and installs the necessary plugins. The script contains the following:

# Copyright (c) 2015 Oracle and/or its affiliates. All rights reserved.
# Install firewall tables
USE mysql;
CREATE TABLE IF NOT EXISTS mysql.firewall_whitelist( USERHOST VARCHAR(80) NOT NULL, RULE text NOT NULL) engine= MyISAM;
CREATE TABLE IF NOT EXISTS mysql.firewall_users( USERHOST VARCHAR(80) PRIMARY KEY, MODE ENUM ('OFF', 'RECORDING', 'PROTECTING', 'RESET') DEFAULT 'OFF') engine= MyISAM;

INSTALL PLUGIN mysql_firewall SONAME 'firewall.dll';
INSTALL PLUGIN mysql_firewall_whitelist SONAME 'firewall.dll';
INSTALL PLUGIN mysql_firewall_users SONAME 'firewall.dll';

CREATE FUNCTION set_firewall_mode RETURNS STRING SONAME 'firewall.dll';
CREATE FUNCTION normalize_statement RETURNS STRING SONAME 'firewall.dll';
CREATE AGGREGATE FUNCTION read_firewall_whitelist RETURNS STRING SONAME 'firewall.dll';
CREATE AGGREGATE FUNCTION read_firewall_users RETURNS STRING SONAME 'firewall.dll';
delimiter //
CREATE PROCEDURE sp_set_firewall_mode (IN arg_userhost VARCHAR(80), IN arg_mode varchar(12))
BEGIN
IF arg_mode = "RECORDING" THEN
  SELECT read_firewall_whitelist(arg_userhost,FW.rule) FROM mysql.firewall_whitelist FW WHERE FW.userhost=arg_userhost;
END IF;
SELECT set_firewall_mode(arg_userhost, arg_mode);
if arg_mode = "RESET" THEN
  SET arg_mode = "OFF";
END IF;
INSERT IGNORE INTO mysql.firewall_users VALUES (arg_userhost, arg_mode);
UPDATE mysql.firewall_users SET mode=arg_mode WHERE userhost = arg_userhost;

IF arg_mode = "PROTECTING" OR arg_mode = "OFF" THEN
  DELETE FROM mysql.firewall_whitelist WHERE USERHOST = arg_userhost;
  INSERT INTO mysql.firewall_whitelist SELECT USERHOST,RULE FROM INFORMATION_SCHEMA.mysql_firewall_whitelist WHERE USERHOST=arg_userhost;
END IF;
END //
delimiter ;

After you run the script, the firewall should be enabled. You may verify it by running this statement:

mysql> SHOW GLOBAL VARIABLES LIKE 'mysql_firewall_mode';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| mysql_firewall_max_query_size |  4096 |
| mysql_firewall_mode           |    ON |
| mysql_firewall_trace          |   OFF |
+-------------------------------+-------+

Testing the Firewall

To test the firewall, you may use a current mysql user, but we are going to create a test user for this example – webuser@localhost. (The user probably doesn’t need all privileges, but for this example we will grant everything to this user)

CREATE USER 'webuser'@'localhost' IDENTIFIED BY 'Yobuddy!';
'GRANT ALL PRIVILEGES ON *.* TO 'webuser'@'localhost' WITH GRANT OPTION'

OPTIONAL: For our test, we will be using the sakila schema provided by MySQL. You may download the sakila database schema (requires MySQL 5.0 or later) at http://dev.mysql.com/doc/index-other.html. If you don’t want to use the sakila database, you may use your own existing database or create a new database.

After downloading the sakila schema, you will have two files, named sakila-schema.sql and sakila-data.sql. Execute the sakila-schema.sql first, and then sakila-data.sql to populate the database with data. If you are using the command line, simply do the following: (substitute UserName for a mysql user name)

# mysql -uUserName -p < sakila-schema.sql
# mysql -uUserName -p < sakila-data.sql

After creating the sakila schema and importing the data, we now set the firewall to record those queries which we want to allow:

mysql> CALL `mysql`.`sp_set_firewall_mode`("webuser@localhost","RECORDING")
+-----------------------------------------------+
| read_firewall_whitelist(arg_userhost,FW.rule) |
+-----------------------------------------------+
| Imported users: 0  Imported rules: 0          |
+-----------------------------------------------+
1 row in set (0.14 sec)

+-------------------------------------------+
| set_firewall_mode(arg_userhost, arg_mode) |
+-------------------------------------------+
| OK                                        |
+-------------------------------------------+
1 row in set (0.22 sec)
Query OK, 5 rows affected (0.28 sec)

We can check to see the firewall mode via this statement, to be sure we are in the recording mode:

mysql> SELECT * FROM MYSQL.FIREWALL_USERS;
+-------------------+------------+
| USERHOST          | MODE       |
+-------------------+------------+
| webuser@localhost |  RECORDING |
+-------------------+------------+
1 row in set (0.02 sec)

Now that we have recording turned on, let’s run a few queries:

mysql> use sakila
Database changed
mysql> show tables;
+----------------------------+
| Tables_in_sakila           |
+----------------------------+
| actor                      |
| actor_info                 |
| address                    |
| category                   |
| city                       |
| country                    |
| customer                   |
| customer_list              |
| film                       |
| film_actor                 |
| film_category              |
| film_list                  |
| film_text                  |
| inventory                  |
| language                   |
| nicer_but_slower_film_list |
| payment                    |
| rental                     |
| sales_by_film_category     |
| sales_by_store             |
| staff                      |
| staff_list                 |
| store                      |
+----------------------------+
23 rows in set (0.00 sec)

mysql> select * from actor limit 2;
+----------+------------+-----------+---------------------+
| actor_id | first_name | last_name | last_update         |
+----------+------------+-----------+---------------------+
|        1 | PENELOPE   | GUINESS   | 2006-02-15 04:34:33 |
|        2 | NICK       | WAHLBERG  | 2006-02-15 04:34:33 |
+----------+------------+-----------+---------------------+
2 rows in set (0.13 sec)

mysql> select first_name, last_name from actor where first_name like 'T%';
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| TIM        | HACKMAN   |
| TOM        | MCKELLEN  |
| TOM        | MIRANDA   |
| THORA      | TEMPLE    |
+------------+-----------+
4 rows in set (0.00 sec)

We turn off the recording by turning on the protection mode:

mysql> CALL `mysql`.`sp_set_firewall_mode`("webuser@localhost","PROTECTING");
+-------------------------------------------+
| set_firewall_mode(arg_userhost, arg_mode) |
+-------------------------------------------+
| OK                                        |
+-------------------------------------------+
1 row in set (0.00 sec)

We can check to see the firewall mode via this statement:

mysql> SELECT * FROM MYSQL.FIREWALL_USERS;
+-------------------+------------+
| USERHOST          | MODE       |
+-------------------+------------+
| webuser@localhost | PROTECTING |
+-------------------+------------+
1 row in set (0.02 sec)

And we can look at our whitelist of statements:

mysql>  SELECT * FROM MYSQL.FIREWALL_WHITELIST;
+-------------------+-------------------------------------------------------------------+
| USERHOST          | RULE                                                              |
+-------------------+-------------------------------------------------------------------+
| webuser@localhost | SELECT * FROM actor LIMIT ?                                       |
| webuser@localhost | SELECT SCHEMA ( )                                                 |
| webuser@localhost | SELECT first_name , last_name FROM actor WHERE first_name LIKE ?  |
| webuser@localhost | SHOW TABLES                                                       |
+-------------------+-------------------------------------------------------------------+
4 rows in set (0.00 sec)

The firewall is now protecting against non-whitelisted queries. We can execute a couple of the queries we previously ran, which should be allowed by the firewall.

mysql> show tables;
+----------------------------+
| Tables_in_sakila           |
+----------------------------+
| actor                      |
| actor_info                 |
| address                    |
| category                   |
| city                       |
| country                    |
| customer                   |
| customer_list              |
| film                       |
| film_actor                 |
| film_category              |
| film_list                  |
| film_text                  |
| inventory                  |
| language                   |
| nicer_but_slower_film_list |
| payment                    |
| rental                     |
| sales_by_film_category     |
| sales_by_store             |
| staff                      |
| staff_list                 |
| store                      |
+----------------------------+
23 rows in set (0.01 sec)

Now we run two new queries, which should be blocked by the firewall.

mysql> select * from rental;
ERROR 1045 (42000): Firewall prevents statement

mysql> select * from staff;
ERROR 1045 (42000): Firewall prevents statement

The server will write an error message to the log for each statement that is rejected. Example:

2015-03-21T22:59:05.371772Z 14 [Note] Plugin MYSQL_FIREWALL reported:
'ACCESS DENIED for webuser@localhost. Reason: No match in whitelist.
Statement: select * from rental '

You can use these log messages in your efforts to identify the source of attacks.

To see how much firewall activity you have, you may look look at the status variables:

mysql> SHOW GLOBAL STATUS LIKE 'Firewall%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Firewall_access_denied  | 42    |
| Firewall_access_granted | 55    |
| Firewall_cached_entries | 78    |
+-------------------------+-------+

The variables indicate the number of statements rejected, accepted, and added to the cache, respectively.

The MySQL Enterprise Firewall Reference is found at https://dev.mysql.com/doc/refman/5.6/en/firewall-reference.html.

Tony Darnell is a Principal Sales Consultant for MySQL, a division of Oracle, Inc. MySQL is the world’s most popular open-source database program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn.

Tony is the author of Twenty Forty-Four: The League of Patriots

Visit http://2044thebook.com for more information.

Filed under MySQL, MySQL Enterprise Edition, Scripting, Uncategorized

MySQL Enterprise Audit – parsing audit information from log files, inserting into MySQL table via LOAD DATA INFILE and Perl script

November 6, 2017 Leave a comment

The MySQL Enterprise Audit plug-in is part of the MySQL Enterprise Edition (available through a paid license). Basically, Enterprise Audit tracks everything that is happening on your MySQL server, and can be used to protect/detect the misuse of information, and to meet popular compliance regulations including HIPAA, Sarbanes-Oxley, and the PCI Data Security Standard.

MySQL Enterprise Audit uses the open MySQL Audit API to enable standard, policy-based monitoring and logging of connection and query activity executed on specific MySQL servers. Designed to meet the Oracle audit specification, MySQL Enterprise Audit provides an out of box, easy to use auditing and compliance solution for applications that are governed by both internal and external regulatory guidelines.

When installed, the audit plugin enables MySQL Server to produce a log file containing an audit record of server activity. The log contents include when clients connect and disconnect, and what actions they perform while connected, such as which databases and tables they access.
(from https://dev.mysql.com/doc/refman/5.6/en/audit-log-plugin.html)

When you enable MySQL Enterprise Audit, log files are generated in your MySQL data directory. You can use tools like MySQL Workbench (Enterprise Edition) or Oracle Audit Vault to import the log data, to view the information and to generate reports.

I was talking with a client, and he wanted to know if the audit data could be stored in a table. Currently (as of MySQL 5.6.25), the audit information is stored as XML in the audit log files. There are several ways to do this, and I will cover two methods.

The first is to use the LOAD XML [LOCAL] INFILE command. You will need to create a table to store the audit information:

CREATE TABLE audit_log (
   RECORD_ID varchar(40) NOT NULL,
   NAME varchar(64),
   TIMESTAMP timestamp,
   COMMAND_CLASS varchar(64),
   CONNECTION_ID bigint unsigned,
   DB varchar(64),
   HOST varchar(60),
   IPv4 int unsigned,
   IPv6 varbinary(16),
   MYSQL_VERSION varchar(64),
   OS_LOGIN varchar(64),
   PRIV_USER varchar(16),
   PROXY_USER varchar(16),
   SERVER_ID int unsigned,
   SQLTEXT longtext,
   STARTUP_OPTIONS text,
   STATUS int unsigned,
   STATUS_CODE int unsigned,
   USER varchar(168),
   VERSION int unsigned,
   PRIMARY KEY(RECORD_ID)
) DEFAULT CHARSET utf8mb4;

You can then load the data as:

LOAD XML LOCAL INFILE 'audit.log'
    INTO TABLE audit_log
    CHARACTER SET utf8mb4
    ROWS IDENTIFIED BY ''
         (RECORD_ID, NAME, @TIMESTAMP, COMMAND_CLASS, CONNECTION_ID, DB, HOST, @IP, MYSQL_VERSION, OS_LOGIN, PRIV_USER, PROXY_USER, SERVER_ID, SQLTEXT, STARTUP_OPTIONS, STATUS, STATUS_CODE, USER, VERSION)
     SET TIMESTAMP = CONVERT_TZ(STR_TO_DATE(@TIMESTAMP, '%Y-%m-%dT%H:%i:%s UTC'), 'UTC', 'Australia/Sydney'),
         IPv4 = IF(IS_IPV4(@IP), INET_ATON(@IP), NULL),
         IPv6 = IF(IS_IPV6(@IP), INET6_ATON(@IP), NULL);

Important notes for the above example:

The example converts the IP address to the numeric version and stores it in IPv4 or IPv6 depending on the type of IP address. It is also possible to store the IP address in a shared varchar() column.
Replace the target time zone in CONVERT_TZ() with the time zone of you system.
The use of CONVERT_TZ() requires named time zones to be loaded or that you use a numeric offset such as +10:00.

RECORD_ID is guaranteed unique with the following limitations:

If you change audit_log_format, the counter will reset. However as the timestamp is also included in the RECORD_ID this will in general not be an issue.
It is only unique for a given MySQL instance. If you want to import the audit logs for multiple instances, you can for example add the server_uuid to the table and use as part of the primary key.
server_uuid is available starting from MySQL 5.6.
The matching between the elements in the XML file and the column names is case sensitive. All the elements in the audit log are in upper case.
The ROWS IDENTIFIED BY ” clause is required unless the table is named AUDIT_RECORD as the table name is the default elements to look for as rows.

Here are some links for more reading on using LOAD XML [LOCAL] INFILE:

https://dev.mysql.com/doc/refman/5.6/en/audit-log-plugin-options-variables.html#sysvar_audit_log_format
https://dev.mysql.com/doc/refman/5.6/en/load-xml.html
https://dev.mysql.com/doc/refman/5.6/en/date-and-time-functions.html#function_convert-tz
https://dev.mysql.com/doc/refman/5.6/en/time-zone-support.html
https://dev.mysql.com/doc/refman/5.6/en/replication-options.html#sysvar_server_uuid

For the second option, I wrote a quick Perl script that would parse the XML log files and insert the information into a MySQL database. You will need to set the size of your audit log files in your my.cnf or my.ini configuration file via the audit_log_rotate_on_size variable. You might need to adjust the size of your log files based upon database activity and how well the script parses the log files. If your log files are very large, the Perl script might have issues processing it, and you might want to decrease the size of your log files and run the script more frequently.

CAVEAT
Enterprise Audit does require a license from MySQL. If you are interested in an Enterprise subscription, contact me via the comment section below. If you are an Enterprise customer, you will need to configure Enterprise Audit first. See the Enterprise Audit online documentation page for more information, or contact MySQL Support.

For the data fields, I used the audit log file format information found at The Audit Log File page on MySQL.com.

My MySQL server doesn’t have a lot of activity, so I tried to configure the size of the data fields as best as possible to accommodate the possible size of the data in each field. There may be instances where you will have to increase the size of these fields or change their data types. The largest field is the SQL_TEXT field which will contain your SQL statements. Every table has a max row size of 65,535 bytes. So, the largest possible size of the SQL_TEXT field could be for this example is around 63,200 bytes (65,535 bytes minus the sum of the size of all of the other fields, and minus the 1-byte or 2-byte length prefix used for each varchar field). In this example, the SQL_TEXT field is set to 8,096 bytes, so you may need to increase or decrease this value.

I used varchar data types for each field, excluding the primary key field named ID. I did not spend a lot of time on the database schema, so you might want to modify it a little. I am sure that some of the fields are integers, but I did not have enough data in my log files to positively determine all of the possible values for each field. I did read the online manual, and it stated that CONNECTION_ID, SERVER_ID, STATUS, STATUS_CODE and VERSION were unsigned integers – but I left them as varchar.

NOTICE
This script requires the use of the new format for the audit log files, which is available in MySQL versions 5.6.20 or later.

I created a database along with two tables; one to store the log file information, and a history table to keep track of what files had already been parsed and inserted into MySQL, as well as the number of log file entries. The CREATE DATABASE and CREATE TABLE syntax is as follows:

CREATE DATABASE `audit_information` /*!40100 DEFAULT CHARACTER SET latin1 */

CREATE TABLE `audit_parsed` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `COMMAND_CLASS` varchar(64) DEFAULT NULL,
  `CONNECTIONID` varchar(32) DEFAULT NULL,
  `DB_NAME` varchar(64) DEFAULT NULL,
  `HOST_NAME` varchar(256) DEFAULT NULL,
  `IP_ADDRESS` varchar(16) DEFAULT NULL,
  `MYSQL_VERSION` varchar(64) DEFAULT NULL,
  `COMMAND_NAME` varchar(64) DEFAULT NULL,
  `OS_LOGIN` varchar(64) DEFAULT NULL,
  `OS_VERSION` varchar(256) DEFAULT NULL,
  `PRIV_USER` varchar(16) DEFAULT NULL,
  `PROXY_USER` varchar(16) DEFAULT NULL,
  `RECORD_ID` varchar(64) DEFAULT NULL,
  `SERVER_ID` varchar(32) DEFAULT NULL,
  `SQL_TEXT` varchar(8096) DEFAULT NULL,
  `STARTUP_OPTIONS` varchar(1024) DEFAULT NULL,
  `COMMAND_STATUS` varchar(64) DEFAULT NULL,
  `STATUS_CODE` varchar(11) DEFAULT NULL,
  `DATE_TIMESTAMP` varchar(24) DEFAULT NULL,
  `USER_NAME` varchar(128) DEFAULT NULL,
  `LOG_VERSION` varchar(11) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

CREATE TABLE `audit_history` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `AUDIT_LOG_NAME` varchar(64) DEFAULT NULL,
  `PARSED_DATE_TIME` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `LOG_ENTRIES` int(11) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

The Perl script finds the non-active log files (which end in .xml – example: audit.log.14357895017796690.xml), parses the data, creates an SQL file with INSERT statements, imports the data via the mysql command-line program, and then moves the log file(s) and SQL file(s) to a directory. The history table records what files have been processed, so you don’t accidentally process the same file twice.

In the beginning of the Perl script, there are several values you need to replace to match your system. The values are under the section titled “values needed”. Here is the Perl script (named audit.pl):

#!/usr/bin/perl
# audit.pl

use DBI;
use CGI;
use XML::Simple;

#----------------------------------------------------------
# values needed
$Database = "audit_information";
$MYSQL_DATA_DIR = "/usr/local/mysql/data";
$MySQL_Host_IP_Name = "192.168.1.2";
$mysql_user = "root";
$mysql_password = "password_needed";

# directory to store old audit files after parsing
$audit_directory = "$MYSQL_DATA_DIR/audit_files";

# make an audit_files directory if one does not exist
mkdir($audit_directory) unless(-d $audit_directory);
#----------------------------------------------------------


#----------------------------------------------------------
#for each file do this
@files = @files = ;;
foreach $file_name_to_parse (@files) {

	#----------------------------------------------------------
	# check to see if file has already been parsed
	$dbh1 = ConnectToMySql($Database);
	$query1 = "select AUDIT_LOG_NAME from audit_history where AUDIT_LOG_NAME = '$file_name_to_parse'";
	$sth1 = $dbh1->prepare($query1);
	$sth1->execute();

          while (@data = $sth1->fetchrow_array()) {
            
         	   $audit_log_name = $data[0];

			}

	# if length of audit_log_name is less than 1, process file
	if (length($audit_log_name) $PARSED_FILE") or die print "Couldn't open log_file: $!";
		
		$count = 0;
		
		# XML::Simple variable - SuppressEmpty => 1   ignore empty values
		$xml = XML::Simple->new(SuppressEmpty => 1);
		$data = $xml->XMLin("$file_name_to_parse");
		
		foreach $info (@{$data->{AUDIT_RECORD}})
		{
			# replace tick marks ' with \' in the SQL TEXT
			$info->{"SQLTEXT"} =~ s/'/\\'/g;
		
			print LOGOUT "INSERT INTO audit_information.AUDIT_PARSED (COMMAND_CLASS, CONNECTIONID, DB_NAME, HOST_NAME, IP_ADDRESS, MYSQL_VERSION, COMMAND_NAME, OS_LOGIN, OS_VERSION, PRIV_USER, PROXY_USER, RECORD_ID, SERVER_ID, SQL_TEXT, STARTUP_OPTIONS, COMMAND_STATUS, STATUS_CODE, DATE_TIMESTAMP, USER_NAME, LOG_VERSION) values ('" . $info->{"COMMAND_CLASS"} . "', '" . $info->{"CONNECTION_ID"} . "', '" . $info->{"DB"} . "', '" . $info->{"HOST"} . "', '" . $info->{"IP"} . "', '" . $info->{"MYSQL_VERSION"} . "', '" . $info->{"NAME"} . "', '" . $info->{"OS_LOGIN"} . "', '" . $info->{"OS_VERSION"} . "', '" . $info->{"PRIV_USER"} . "', '" . $info->{"PROXY_USER"} . "', '" . $info->{"RECORD_ID"} . "', '" . $info->{"SERVER_ID"} . "', '" . $info->{"SQLTEXT"} . "', '" . $info->{"STARTUP_OPTIONS"} . "', '" . $info->{"STATUS"} . "', '" . $info->{"STATUS_CODE"} . "', '" . $info->{"TIMESTAMP"} . "', '" . $info->{"USER"} . "', '" . $info->{"VERSION"} . "');\n";
			$count++;
		
		# end foreach $info (@{$data->{AUDIT_RECORD}})
		}
		
		# load parsed file into MySQL - hide warnings
		system("mysql -u$mysql_user -p$mysql_password  /dev/null 2>&1");
										
		$dbh2 = ConnectToMySql($Database);
		$query2 = "insert into audit_information.audit_history (AUDIT_LOG_NAME, LOG_ENTRIES) values ('$file_name_to_parse', '$count')";
		
		# optional print output - uncomment if desired
		# print "$query2\n";
																													
		$sth2 = $dbh2->prepare($query2);
		$sth2->execute();

		# close audit log file
		close(INFILE);

		# optional print output - uncomment if desired
		# print "Moving audit log ($file_name_to_parse) and log file ($PARSED_FILE) to $audit_directory.\n";
		
		# strip directories off $file_name_to_parse
		@file_name_to_move_array = split("\/",$file_name_to_parse);
		$directory_count = $#file_name_to_move_array;
		$file_name_to_move = $file_name_to_move_array[$directory_count];
		
		
		# optional print output - uncomment if desired
		# print "mv $file_name_to_move $file_name_to_parse\n";
		# print "mv $PARSED_FILE $audit_directory\n";

		# move audit log files and parsed log files to $audit_directory
		system("mv $file_name_to_parse $audit_directory");
		system("mv $PARSED_FILE $audit_directory");

	# end - if (length($audit_log_name) < 1)
	}

	else

	{
		# optional print output - uncomment if desired
		# print "$audit_log_name already processed\n";
		system("mv $file_name_to_parse $audit_directory");
	}

# end - foreach $file_name_to_parse (@files) 
}

sub ConnectToMySql {

   $connectionInfo="dbi:mysql:$Database;$MySQL_Host_IP_Name:3306";

   # make connection to database
   $l_dbh = DBI->connect($connectionInfo,$mysql_user,$mysql_password);
   return $l_dbh;

}

It should not matter where you execute audit.pl, as long as you have correctly entered the required values in the script. You might get errors if you try to run this script on a log file that has not been rotated, which is the current log file in your MySQL data directory. The current log file is named audit.log.

# pwd
/usr/local/mysql/data
# ls -l audit.log
-rw-rw----  1 mysql  _mysql  9955118 Jul  2 15:25 audit.log

The script will only work on files ending in .xml. For testing, I used four small (and identical) audit log files:

# pwd
/usr/local/mysql/data
# ls -l *xml
-rw-rw----  1 mysql  wheel   15508 Jul  2 12:20 audit.log.14357895017796690.xml
-rw-r-----  1 mysql  _mysql  15508 Jul  2 13:46 audit.log.14357895017796691.xml
-rw-r-----  1 mysql  _mysql  15508 Jul  2 13:46 audit.log.14357895017796692.xml
-rw-r-----  1 mysql  _mysql  15508 Jul  2 13:46 audit.log.14357895017796693.xml

I have commented-out the print statements in the Perl script, but if you uncomment them, running the script gives you this output for each log file:

# perl audit.pl
Parsing - /usr/local/mysql/data/audit.log.14357895017796690.xml
insert into audit_information.audit_history (AUDIT_LOG_NAME, LOG_ENTRIES) values ('/usr/local/mysql/data/audit.log.14357895017796690.xml', '34')
Moving audit log (/usr/local/mysql/data/audit.log.14357895017796690.xml) and log file (/usr/local/mysql/data/audit.log.14357895017796690_parsed.sql) to /usr/local/mysql/data/audit_files.
mv audit.log.14357895017796690.xml /usr/local/mysql/data/audit.log.14357895017796690.xml
mv /usr/local/mysql/data/audit.log.14357895017796690_parsed.sql /usr/local/mysql/data/audit_files
....

After running my test script, the following data is what is in the audit_history table:

mysql> use audit_information
Database changed
mysql> select * from audit_history;
+----+-------------------------------------------------------+---------------------+-------------+
| ID | AUDIT_LOG_NAME                                        | PARSED_DATE_TIME    | LOG_ENTRIES |
+----+-------------------------------------------------------+---------------------+-------------+
|  1 | /usr/local/mysql/data/audit.log.14357895017796690.xml | 2015-07-02 15:25:07 | 34          |
|  2 | /usr/local/mysql/data/audit.log.14357895017796691.xml | 2015-07-02 15:25:08 | 34          |
|  3 | /usr/local/mysql/data/audit.log.14357895017796692.xml | 2015-07-02 15:25:08 | 34          |
|  4 | /usr/local/mysql/data/audit.log.14357895017796693.xml | 2015-07-02 15:25:09 | 34          |
+----+-------------------------------------------------------+---------------------+-------------+
4 rows in set (0.00 sec)

And here is an example of one line from the audit_parsed table.

mysql> select * from audit_parsed limit 1 \G
*************************** 1. row ***************************
             ID: 1
  COMMAND_CLASS: select
   CONNECTIONID: 10093
        DB_NAME: 
      HOST_NAME: localhost
     IP_ADDRESS: 127.0.0.1
  MYSQL_VERSION: 
   COMMAND_NAME: Query
       OS_LOGIN: 
     OS_VERSION: 
      PRIV_USER: 
     PROXY_USER: 
      RECORD_ID: 1614933_2015-07-01T22:08:58
      SERVER_ID: 
       SQL_TEXT: SELECT (UNIX_TIMESTAMP(now()) - CAST(variable_value AS SIGNED))*1000 as serverStartMillis 
  FROM information_schema.global_status 
 WHERE variable_name='uptime'
STARTUP_OPTIONS: 
 COMMAND_STATUS: 0
    STATUS_CODE: 0
 DATE_TIMESTAMP: 2015-07-01T22:08:58 UTC
      USER_NAME: root[root] @ localhost [127.0.0.1]
    LOG_VERSION: 
1 row in set (0.00 sec)

After parsing the log files, you can then write your own queries for searching through your audit data. You can even include this script in cron, so it runs and parses your files automatically. But as always, test this script and use it with caution before putting it in a production environment. You could also modify the Perl script to filter out values you did not want or need to store.

If you do use this script or if you have any suggestions or other questions, please leave a comment below.

Thanks to Jesper Krogh for providing the information on the LOAD XML [LOCAL] INFILE.

Tony is the author of Twenty Forty-Four: The League of Patriots

Visit http://2044thebook.com for more information.

Filed under MySQL, MySQL Enterprise Audit, MySQL Enterprise Edition, Perl, Scripting, Uncategorized

Easy-to-use Perl scripts to backup your MySQL database with mysqldump and FTP the files to a remote server

January 6, 2017 1 Comment

Most users of MySQL utilize the mysqldump utility to backup their database. While mysqldump is handy and easy-to-use (and free), if you have data which is important to your business, then you should take a look at the MySQL Enterprise Edition – and use MySQL Enterprise Backup instead. The MySQL Enterprise Backup allows you to backup your database without the table locking you get with mysqldump. And, it is extremely fast – especially when you have to restore your database. Here is a sample speed comparison between MySQL Enterprise Backup and using mysqldump:

49x Better Performance: Backup

80x Better Performance: Backup

From the image, you can see it takes a long time to either dump or restore a fairly large (73 gigabyte) database compared to mysqldump. Even if your database isn’t this large (and most people don’t care how long their backups take to complete), when it comes time to restore your database in a production environment, the quicker you can restore your database, the better.

If your only option is to use mysqldump, here are two Perl scripts to make the task easier. The first script will backup your database(s) and send a copy of your backup to a remote server via FTP. The second script will connect to your FTP server and delete your old backup files – in case you have a storage limit on your FTP server. You can put these scripts in cron or Windows Task Scheduler, or run them manually. You can have the backup script run as often as possible (maybe once an hour) – but keep in mind there will be table-level locking. The script to delete the old backups only needs to be run once a day.

THE BACKUP SCRIPT

For the backup script, you will have to enter a few variables to match your system. You will also need to create a configuration file of all of the databases you want to backup. I could have connected to the MySQL database and ran a query (SHOW DATABASES;) to retrieve all of the databases, but I prefer to manually manage the list of databases to backup. With this method, you can skip an hourly backup of static or read-only databases and only backup the databases which are being changed. This configuration file is a text file with a list of the databases to be backed up, and you can use a # (pound sign) to comment out databases you want to skip.

NOTE:You don’t want to backup the following databases: PERFORMANCE_SCHEMA, INFORMATION_SCHEMA or SYS SCHEMA.

# set the directory where you will keep the backup files
$backup_folder = '/Users/tonydarnell/cron/mysqlbackups';

# the config file is a text file with a list of the databases to backup
# this should be in the same location as this script, but you can modify this
# if you want to put the file somewhere else
my $config_file = dirname($0) . "/mysql_backup.config";

# Here is where you will put your FTP server name (or IP address)
# and your FTP username and password
my $host = "server_name.com";
my $user = "username";
my $password = "password";

You can also modify your mysqldump command which will be use to backup your databases, or use what I have in the script. You will need to add your mysql password where I have the word “PassWord“.

`/usr/local/mysql/bin/mysqldump -R -h192.168.1.2 --events --triggers -u mysqlbackup --password=PassWord --routines --add-drop-database --set-gtid-purged=OFF --add-drop-table $database $table | compress > $folder/$file.Z`;

I created a separate FTP user which has its own home directory, so I don’t have to change the directory once I login via FTP. If you can’t do this, you will need to uncomment the last three lines of this and navigate your FTP user to the correct directory:

# uncomment the last three lines if you can't set the home directory of the FTP user to a specific directory
# the directory on the FTP server where you want to save the backup files
# my $dir = "mysqlbackups";
# print "Accessing FTP - changing to $dir folder\n";
# $f->cwd($dir) or die "Can't cwd to $dir\n";

That is all you need to modify in the first script for it to work. If you run this script on the command line, or if you run it in cron, you should see something like this:

mysqladmin: [Warning] Using a password on the command line interface can be insecure.

It isn’t a good idea to have your password in plain text anywhere, so you should create a user which only has the limited read-only permissions needed to run mysqldump. You will need to change the value of “database_name” in the GRANT statement to match each database you want to backup. You will need to run the GRANT statement for every database you want to backup, or you can use an asterisk “*” in place of the database name.

CREATE USER 'mysqlbackup'@'192.168.1.2' IDENTIFIED WITH sha256_password BY '';
GRANT SELECT, LOCK TABLES, SHOW VIEW, EVENT, TRIGGER ON `database_name`.* TO 'mysqlbackup'@'192.168.1.2';

Or, to grant permissions to the mysqlbackup user on all of the tables:

GRANT SELECT, LOCK TABLES, SHOW VIEW, EVENT, TRIGGER ON *.* TO 'mysqlbackup'@'192.168.1.2';

Here is a sample output from the script:

# perl mysql_backup.pl
Beginning 2017-01-06-16:35:57
Backing up database_01 - 2017-01-06-1635 ... mysqldump: [Warning] Using a password on the command line interface can be insecure.
Finished exporting database_01 - as 2017-01-06-1635/database_01.sql.Z
Backing up database_02 - 2017-01-06-1635 ... mysqldump: [Warning] Using a password on the command line interface can be insecure.
Finished exporting database_02 - as 2017-01-06-1635/database_02.sql.Z
Backing up database_03 - 2017-01-06-1635 ... mysqldump: [Warning] Using a password on the command line interface can be insecure.
Finished exporting database_03 - as 2017-01-06-1635/database_03.sql.Z
....
Backing up database_04 - 2017-01-06-1635 ... mysqldump: [Warning] Using a password on the command line interface can be insecure.
Finished exporting database_04 - as 2017-01-06-1635/database_04.sql.Z
------------------------------------------------------------------
 Compressing backup as:  2017-01-06-1635.tar.gz 
------------------------------------------------------------------
Creating a tar file from the backup...
tar -czf 2017-01-06-1635.tar.gz 2017-01-06-1635
FTP'ing the file - 2017-01-06-1635.tar.gz 
Deleting the original backup directory and files.
Finished 2017-01-06-16:36:49

The backup script creates a new directory with a timestamp as the name of the directory. In this example, the directory was named “2017-01-06-1635“. After the backup is finished, it creates a single tar file and then deletes the backup directory and files. I delete the backup directory and files as the backup is now contained in the single tar file. It is easier to copy and delete a single tar file via FTP than to do the same with a directory containing multiple files.

DELETING OLD BACKUPS

The script to delete the old backups is fairly simple to configure. You will need to enter the host, username and password of your FTP user, and specify how many backups you want to keep on the server. Since I do a backup once an hour, I keep a week’s worth (168 copies) of backups on my server. I could have checked the date/time on the files and deleted the older files which were X number of days old, but I decided to just go with a certain number of files to keep. I also included a “print_output” variable if you want to suppress any output – simply change this value to anything but “yes” and the script won’t print any output.

# Here is where you will put your FTP server name (or IP address)
# and your username and password
my $host = "server_name.com";
my $user = "username";
my $password = "password";

# how many copies of the backup do you want to keep?
$total_files_to_keep = 168;

$print_output = "yes";

Uncomment this line if you want to see a list of the other files which will not be deleted.

# optional output - remove # to have it print remaining files
# if ($print_output eq "yes") { print "| $count of $total_files_available |   Keeping: $filename\n"; }

Here is a sample output from running the script:

root# perl mysql_delete_backup.pl 
--------------------------------------------
           Total files: 194
    Total backup files: 192
   Total files to keep: 168
 Total files to delete: 24
----------------------------------------------
| x of 192 |  Skipping: .
| x of 192 |  Skipping: ..
| 1 of 192 |  Deleting: 2017-12-29-1352.tar.gz
| 2 of 192 |  Deleting: 2017-12-29-1452.tar.gz
| 3 of 192 |  Deleting: 2017-12-29-1552.tar.gz
| 4 of 192 |  Deleting: 2017-12-29-1652.tar.gz
....
| 24 of 192 |  Deleting: 2017-12-30-1252.tar.gz
----------------------------------------------
Finished 2017-01-06-15:21:58

When I run this script on my FTP server, and I do a listing of the backup directory, it shows the single-period (.) or current directory and the double-period (..) or the parent directory. I take this into account by skipping all files which do not have “20” in the name (as in the first two letters of the year).

If you login to your FTP server and you don’t see the “.” and “..“, then you will need to remove the “-2” in this part of the script, and simply have $total_files_available = $total_files:

# subtract two because of the . and ..
$total_files_available = $total_files - 2;

Here is what I see when I FTP to my server: (notice the “.” and “..“)

root# ftp backups@scripts.com
Trying 1234:f1c0:4738:5088:cb9a:dksi:ebfa:3829...
Connected to scripts.com.
220 FTP Server ready.
331 Password required for backups
Password: 
230 User backups logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> dir
229 Entering Extended Passive Mode (|||58906|)
150 Opening ASCII mode data connection for file list
drwx---r-x   2 u63541528-backups ftpusers     4096 Jan  6 16:52 .
drwx---r-x   2 u63541528-backups ftpusers     4096 Jan  6 16:52 ..
-rw----r--   1 u63541528-backups ftpusers 45522630 Jan  5 22:52 2017-01-05-2252.tar.gz
-rw----r--   1 u63541528-backups ftpusers 45539118 Jan  5 23:52 2017-01-05-2352.tar.gz
-rw----r--   1 u63541528-backups ftpusers 45558328 Jan  6 01:52 2017-01-06-0152.tar.gz
-rw----r--   1 u63541528-backups ftpusers 45560794 Jan  6 07:52 2017-01-06-0752.tar.gz
....

The scripts are available on GitHub – https://github.com/ScriptingMySQL/PerlFiles. The scripts are named: mysql_backup.pl and mysql_delete_old_backups.pl.

I am not the best Perl programmer, so there may be a better/easier way to do this. Good luck with the scripts and let me know how they work for you. And follow me on Twitter at ScriptingMySQL.

Tony is the author of Twenty Forty-Four: The League of Patriots

Visit http://2044thebook.com for more information.

Filed under MySQL, MySQL Enterprise Backup, mysqldump, Perl, Scripting, Uncategorized

Use MySQL to store data from Amazon’s API via Perl scripts

December 22, 2015 Leave a comment

I really like Amazon.com and I have been a Prime member for several years. Along with millions of different items for sale, Amazon has an affiliate program, where you can earn money advertising products on your web site. When a visitor to your site clicks on a link and orders a product from Amazon, you earn a small commission on the total sale. As an affiliate, you have access to Amazon’s Product Advertising API for obtaining product information. But, you can use this information for many other purposes.

The Amazon API is like most other API’s, and their API web site provides you with code examples and explains how it all works. I am going to show you how a Perl program which you can use to access the API data and store it in a MySQL database. This program was modified from one of the examples on the Amazon API web site.

I wrote a book in 2014, and I wanted to be able to track the book’s ranking on Amazon. I have a couple of friends who wrote books as well, so I tracked their ranking at the same time. By using the API, I can get a lot of information about any product – including my book’s ranking. And, I can keep a historical record of the ranks in a MySQL database. You could also use this to track the price of a product, and you could have the script notify you if the price changed or went below a certain threshold. Example – if you want to know when the price of that 55″ 4K television drops below $599 – you can use the API to check the price once an hour – and send you an alert when the price drops. Most prices and ranks only change (at the most) once an hour, so running the script every five minutes isn’t necessary.

To access Amazon’s API, you will need to register as an Amazon affiliate, and obtain your own Access Keys (Access Key ID and Secret Access Key). In the Perl script, this information goes here:

use constant myAWSId	    => 'Access Key ID';
use constant myAWSSecret    => 'Secret Access Key';
use constant myEndPoint	    => 'ecs.amazonaws.com';

The following Perl script is an example of getting the rank for a particular item. To use the script and to access Amazon’s API, you have to supply the product’s ASIN (Amazon Standard Identification Number), or for a book, you can supply the ISBN (International Standard Book Number). The ASIN is usually found in the URL for a product – as in http://www.amazon.com/gp/product/B00R0ZM5W4 – where B00R0ZM5W4 is the ASIN. I use a MySQL table to store the ranking information (you can modify the table to also include other information). Here is the CREATE TABLE statement for this table:

CREATE TABLE `ranks` (
  `idranks` int(11) NOT NULL AUTO_INCREMENT,
  `product_id` varchar(15) DEFAULT NULL,
  `product_title` varchar(100) DEFAULT NULL,
  `product_rank` varchar(15) DEFAULT NULL,
  `rank_datetime` datetime DEFAULT NULL,
  PRIMARY KEY (`idranks`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

The script also has a sub-routine (named sub ConnectToMySql) which uses connection information stored in a text file one directory below the directory where the script is executed. In this example, the file name is accessAMAZON, and the file contains this connection information (in this order):

database_name
IP_address
mysql_user_name
mysql_password

You can hardcode the connection information directly into the script if you prefer.

The script will also output all of the potential variables and values for each ASIN/ISBN, and this information is stored in a text file with the name equal to the ASIN/ISBN, and with a prefix of “.txt”. Example: B00R0ZM5W4.txt There are over a hundred different variables

The script is named amazonrank.pl and may be found on github at: https://github.com/ScriptingMySQL/PerlFiles.

Here is the script: (remember to add your own Amazon access key information into the script)

#!/usr/bin/perl -w

##############################################################################################
# Copyright 2009,2013 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file 
# except in compliance with the License. A copy of the License is located at
#
#	   http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS"
# BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations under the License. 
#
#############################################################################################
#
#  Amazon Product Advertising API
#  Signed Requests Sample Code
#
#  API Version: 2009-03-31/2013-08-01
#
#############################################################################################

#use strict;
#use warnings;

#no warnings 'uninitialized';

use Data::Dumper;
use DBD::mysql;
use DBI;

use RequestSignatureHelper;
use LWP::UserAgent;
use XML::Simple;

use constant myAWSId		=> 'Access Key ID';
use constant myAWSSecret	=> 'Secret Access Key';
use constant myEndPoint		=> 'ecs.amazonaws.com';

my $Database = "amazonrank";

# see if user provided ItemId on command-line
# my $itemId = shift @ARGV || '0545010225';

# ----------------------------------------------------------------------------------
# this has to be near the top - as other parts of the script rely on these figures
# ----------------------------------------------------------------------------------

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime time;

$year = $year + 1900;
$mon = $mon + 1;

# add a zero if the value is less than 10

if ($sec < 10) { $sec = "0$sec"; }
if ($min < 10) { $min = "0$min"; }
if ($hour < 10) { $hour = "0$hour"; }
if ($mday < 10) { $mday = "0$mday"; }
if ($mon < 10) { $mon = "0$mon"; }
if ($year < 10) { $year = "0$year"; }
if ($wday < 10) { $wday = "0$wday"; }
if ($yday < 10) { $yday = "0$yday"; }
if ($isdst $fileout") or die print "Couldn't open log_file: $!";

print "Retrieving rank for $itemId - ";

# Set up the helper
my $helper = new RequestSignatureHelper (
	+RequestSignatureHelper::kAWSAccessKeyId => myAWSId,
	+RequestSignatureHelper::kAWSSecretKey => myAWSSecret,
	+RequestSignatureHelper::kEndPoint => myEndPoint,
);

# A simple ItemLookup request
my $request = {
	Service => 'AWSECommerceService',
	Operation => 'ItemLookup',
	Version => '2013-08-01',
	ItemId => $itemId,
	AssociateTag => 'scmy-20',
	ResponseGroup => 'Large',
};

# Sign the request
my $signedRequest = $helper->sign($request);

# We can use the helper's canonicalize() function to construct the query string too.
my $queryString = $helper->canonicalize($signedRequest);
my $url = "http://" . myEndPoint . "/onca/xml?" . $queryString;
#print "Sending request to URL: $url \n";

my $ua = new LWP::UserAgent();
my $response = $ua->get($url);
my $content = $response->content();
#print "Recieved Response: $content \n";

my $xmlParser = new XML::Simple();
my $xml = $xmlParser->XMLin($content);

# This prints out all of the item information into a text file
print OUT "Parsed XML is: " . Dumper($xml) . "\n";

if ($response->is_success()) {

	# Here is where you extract the information for each item
	my $title = $xml->{Items}->{Item}->{ItemAttributes}->{Title};
	my $rank = $xml->{Items}->{Item}->{SalesRank};
	my $price = $xml->{Items}->{Item}->{Offers}->{Offer}->{OfferListing}->{Price}->{FormattedPrice};

	# option to print to screen - uncomment this next line
	#	print "Item $itemId is titled \"$title\" and ranked $rank\n";

if (length($rank) > 1)

{
	$dbh = ConnectToMySql($Database);	
	$query = "insert into ranks (product_id, product_rank, rank_datetime, product_title) values ('$itemId', '$rank', '$DateTime','$title')";
	$sth = $dbh->prepare($query);
	$sth->execute();
	$dbh->disconnect;

	print "$rank || $title || $itemId || \$$price\n\n";
}

else

{
	print "Rank for: $title is unavailable.\n\n";
}

# optional sleep 
# sleep 5;


} else {
	my $error = findError($xml);
	if (defined $error) {
	print "Error: " . $error->{Code} . ": " . $error->{Message} . "\n";
	} else {
	print "Unknown Error!\n";
	}
}

sub findError {
	my $xml = shift;
	
	return undef unless ref($xml) eq 'HASH';

	if (exists $xml->{Error}) { return $xml->{Error}; };

	for (keys %$xml) {
	my $error = findError($xml->{$_});
	return $error if defined $error;
	}

	return undef;
}


# end - foreach
}

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

  # my ($db) = @_;

   open(PW, "<..\/accessAMAZON") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

I am not the best Perl programmer, nor am I an expert at the Amazon API, so there may be a better/easier way to do this. Good luck with the scripts and let me know how they work for you. And follow me on Twitter at ScriptingMySQL and TonyDarnell.

Tony is the author of Twenty Forty-Four: The League of Patriots

Visit http://2044thebook.com for more information.

Filed under MySQL, Perl, Scripting, Uncategorized Tagged with amazon, api, mysql, Perl, scripting

Add RSS feeds to your Twitter stream using MySQL and Perl

October 27, 2015 2 Comments

Adding good content to Twitter can be a pain. I can’t do it during working hours, and I don’t have much time at night. But, the more content you have, the more followers you can gain, and the more your original tweets can be seen (hopefully). I have written several posts about using the latest Perl-Twitter API – Net::Twitter::Lite::WithAPIv1_1, so you might want to check these out as well.

– Use MySQL and Perl to automatically find, follow and unfollow twitter users

– Using Perl to retrieve direct messages from Twitter, insert messages into a MySQL database and then delete the messages

– Using Perl and MySQL to automatically respond to retweets on twitter

– Using Perl to send tweets stored in a MySQL database to twitter

However, finding good supplemental content is easy. There are plenty of companies and web sites which offer RSS (Rich Site Summary or Really Simple Syndication) feeds that you can use on Twitter. In this post, I will show you how to capture the RSS feed from a web site and tweet the links with your Twitter account.

One warning – Twitter has strict guidelines for using API’s on their site. The rate limits may be found at https://dev.twitter.com/rest/public/rate-limiting. In my previous post, I included a Perl script (rate_limit.pl) that prints out the API rate limits for your application. You can use this script to monitor your usage. The key with using the Twitter API is to not be too aggressive, or your app will be banned by Twitter. For example, Twitter does not allow bulk follows and unfollows – so having patience is important. And if you use these scripts to send SPAM, your app and Twitter account will be banned.

For this post, I created two Perl scripts. The first grabs the RSS information and inserts it into a MySQL database. The second script pulls the information from the MySQL database and sends a tweet one record at a time. You can set up a cron job to do both of these scripts, but remember the RSS script needs to run first.

The key when parsing RSS feeds is to find a unique number/identifier which you can use for each feed item. Some RSS feeds will include a <guid> tag, but often you have to parse the URL link to get a unique number/identifier. The script uses this unique identifier to check and see if this RSS feed item is already in the database, so you don’t have duplicate RSS items. I tested these scripts on a dozen different RSS feeds, and it works really well.

Before you attempt to use these scripts, you will need to register your application with twitter via apps.twitter.com, and obtain the following:

consumer_key
consumer_secret
access_token
access_token_secret

We will need to create two databases. One to hold the RSS feed information, and another to hold the tweet history. I use the tweet history to delete older tweets with another Perl script I wrote. Here is the SQL to create the databases:

CREATE TABLE 'rss_oracle' (
  'id' int(9) NOT NULL AUTO_INCREMENT,
  'id_post' bigint(14) NOT NULL DEFAULT '0',
  'post_title' varchar(256) DEFAULT NULL,
  'post_url' varchar(256) DEFAULT NULL,
  'post_author' varchar(48) DEFAULT NULL,
  'post_date' datetime DEFAULT NULL,
  'tweet_sent' varchar(3) DEFAULT NULL,
  'tweet_sent_date' datetime DEFAULT NULL,
  PRIMARY KEY ('id','id_post')
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

CREATE TABLE `history` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `tweet` char(140) DEFAULT NULL,
  `tweet_id` varchar(30) DEFAULT NULL,
  `tweet_update` datetime DEFAULT NULL,
  `error` char(3) DEFAULT '',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

For our RSS feed, we will be grabbing the RSS from Oracle‘s web site. You will need to change the $source variable to whatever RSS feed you want to use. As you grab the data, you will need to test your script to see what non-ASCII characters are included in the title. This example script should translate most characters to ASCII ones, but you will want to check the output before actually tweeting the posts. When you find strange characters, you can simply truncate the RSS database table and re-run the script, or you can uncomment the print statements to see what is being returned.

In the subroutine ConnectToMySql used in the Perl scripts, I store the MySQL login credentials in a text file one directory below where my Perl script is located. This file – named accessTweets contains this information:

database_name
hostname or IP
MySQL user name
password

The scripts are also available on GitHub – https://github.com/ScriptingMySQL/PerlFiles. (For some reason, WordPress likes to change the code.)

rss_oracle.pl

#!/usr/bin/perl -w

use LWP::Simple 'get';
use Encode;
use utf8;
use Text::Unidecode;
use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use XML::FeedPP;

    my $source = 'https://www.oracle.com/corporate/press/rss/rss-pr.xml';
    my $feed = XML::FeedPP->new( $source );

foreach my $item ( $feed->get_item() ) {

	$post_id = $item->guid();
	$title = $item->title();

	# this is where you have to replace non-ASCII characters
	# each RSS feed will use different non-ASCII characters
	$title = decode_utf8( $title );
	$title =~ s/’/\^/g;
	$title =~ s/\&\#8217;/^/g;
	$title =~ s/\&\#8216;/^/g;
	$title =~ s/\&\#8212;/-/g;
	$title =~ s/\&\#8230;/-/g;
	$title =~ s/'/\^/g;
	$title =~ s/‘/\^/g;
	$title =~ s/’/^/g;
	$title =~ s/…/.../g;
	$title =~ s/—/-/g;
	$title =~ s/-/-/g;
	$title =~ s/–/-/g;
	$title =~ s/ 8212 /-/g;
	$title =~ s/ 8230 /-/g;
	$title =~ s/<em>//g;
	$title =~ s/</em>//g;
	$title =~ s/[^a-zA-Z0-9 ~,._*:?\$^-]//g;

	$link = $item->link();

	# uncomment this line to test
	#print "$post_id | $title | $link\n";
	
	# see if we already have this post in the RSS database	
	$dbh = ConnectToMySql($Database);
	$query = "select id_post FROM rss_oracle where id_post = '$post_id' limit 1";	
	$sth = $dbh->prepare($query);
	$sth->execute();

        #print "\n$query\n\n";

	# loop through our results - one user at a time
	while (@data = $sth->fetchrow_array()) {
		$id_post_found = $data[0];
	# end - while
	}

		#print "id_post_found $id_post_found \n";

		if (length($id_post_found) > 1)
			
		{
			#print "Found $id_post_found...\n";
			$id_post_found = "";
		}
			
		else
			
		{
			$dbh2 = ConnectToMySql($Database);
			$query2 = "insert into rss_oracle (id_post, post_title, post_url) values ('$post_id', '$title', '$link')";	
			$sth2 = $dbh2->prepare($query2);
			# during testing, comment this next line to prevent the data from being inserted into the database
			$sth2->execute();
			#print "$query2\n";

			$title = "";
			$link = "";
			$id_post_found = "";
			
		}

	# foreach my $item
    }

exit;

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

And here is the script to do the tweeting. You will need to add your Twitter Name to the $My_Twitter_User variable.

tweet-oracle.pl

#!/usr/bin/perl
 
use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;

$My_Twitter_User = "YourTwitterNameHere";

# ----------------------------------------------------------------------------------
# this has to be near the top - as other parts of the script rely on these figures
# ----------------------------------------------------------------------------------

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime time;

$year = $year + 1900;
$mon = $mon + 1;

# add a zero if the value is less than 10

if ($sec < 10) { $sec = "0$sec"; }
if ($min < 10) { $min = "0$min"; }
if ($hour < 10) { $hour = "0$hour"; }
if ($mday < 10) { $mday = "0$mday"; }
if ($mon < 10) { $mon = "0$mon"; }
if ($year < 10) { $year = "0$year"; }
if ($wday < 10) { $wday = "0$wday"; }
if ($yday < 10) { $yday = "0$yday"; }
if ($isdst prepare($query);
    $sth->execute();
    
    #print "$query \n";
    
			while (@data = $sth->fetchrow_array()) {
            
					$id_post = $data[0];
					$post_title = $data[1];
					$post_url = $data[2];

				$post_title =~s/  / /g;
				$post_title =~s/ ampamp / and /g;

			# while (@data = $sth->fetchrow_array()) {
			}
			
if (length($post_title)  100)

	{
		$title_trimmed = substr($post_title,0,105);
		$title_trimmed = "$title_trimmed...";
		#print "TRIM $title_trimmed\n";

			if ($title_trimmed =~ " Oracle ")

			{
				$title_trimmed =~ s/ Oracle / \#Oracle /;
			}

			else

			{
				$add_Hashtag = " \#Oracle ";
			}		


	}
	
	else
	
	{
		$title_trimmed = "$post_title";
		$title_trimmed =~ s/ Oracle / \#Oracle /;
		#print "x $title_trimmed\n";
	}


$tweet = "$title_trimmed $add_Hashtag \n\n$post_url";

$tweet =~ s/  / /g;
$tweet =~ s/  / /g;

$add_Hashtag = "";

#print "$tweet \n";

#exit;

# ----------------------------------------------------------------------------
# find carats and substitue for single quote
# ----------------------------------------------------------------------------

$tweet =~ s/\^/\'/g;

# ----------------------------------------------------------------------------
# send tweet
# ----------------------------------------------------------------------------


# Credentials for your twitter application
# You will need to substitute your own values for these variables
my $nt = Net::Twitter::Lite::WithAPIv1_1->new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "$access_token_secret",
      apiurl => 'http://api.twitter.com/1.1',
      ssl                 => 1
);
	# send the tweet
	my $results = eval { $nt->update("$tweet") };

#		    print "---DUMPER START---\n";
#			print Dumper $results;
#		    print "---DUMPER END---\n\n";


# ----------------------------------------------------------------------------
# update mysql with new date for last_tweet date/time
# ----------------------------------------------------------------------------

$dbh = ConnectToMySql($Database);
$query = "UPDATE rss_oracle SET tweet_sent_date = '$DateTime' , tweet_sent = 'yes' where id_post = '$id_post'";
$sth = $dbh->prepare($query);
$sth->execute();

# ----------------------------------------------------------------------------
# get the status id of the last tweet
# ----------------------------------------------------------------------------

my $statuses = $nt->user_timeline({ user => "$My_Twitter_User", count=> 1 });

for my $status ( @$statuses ) {
	$tweet_id = "$status->{id}";
#	print "Tweet ID $tweet_id\n";
}


# ----------------------------------------------------------------------------
# replace special characters
# ----------------------------------------------------------------------------

$tweet =~ s/\\\n/~/g;
$tweet =~ s/\n/~/g;
$tweet =~ s/\'/^/g;

# update mysql with new date for last_tweet date/time

$dbh = ConnectToMySql($Database);	
$query = "insert into history (tweet,tweet_id,tweet_update) values ('$tweet','$tweet_id','$DateTime')";
$sth = $dbh->prepare($query);
$sth->execute();

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

I am not the best Perl programmer, nor am I an expert at the Twitter API, so there may be a better/easier way to do this. Good luck with the scripts and let me know how they work for you. And follow me on Twitter at ScriptingMySQL and TonyDarnell.

Tony is the author of Twenty Forty-Four: The League of Patriots

Visit http://2044thebook.com for more information.

Filed under MySQL, Perl, Scripting, Uncategorized Tagged with mysql, Perl, scripting, Twitter

Use MySQL and Perl to automatically find, follow and unfollow twitter users

October 21, 2015 6 Comments

A friend of mine asked me how they could automatically follow and unfollow people on Twitter. But they didn’t want to follow just anyone and everyone. He had a Twitter account which they used for recruiting in a very narrow construction industry. He wanted to find people in the same industry and follow them – hoping they would follow him back and learn about his open jobs. When I joined Twitter back in 2008, I wrote a similar program to automatically follow/unfollow users, but the Twitter API has changed quite a bit since then. So I decided to re-write the program with the latest Perl-Twitter API – Net::Twitter::Lite::WithAPIv1_1.

Before you attempt to use these scripts, you will need to register your application with twitter via apps.twitter.com, and obtain the following:

consumer_key
consumer_secret
access_token
access_token_secret

When running these types of scripts on Twitter, you have to be careful to not surpass the rate limits that Twitter has established for using the API. I have included a script named rate_limit.pl – which can give you the limits you have for each API-call, and how many calls you have remaining before the limits are reset by Twitter. I have added pauses in the scripts to help prevent you from going over the rate limits (but you will need to check to ensure you don’t surpass them). You are free to modify these scripts (but I won’t be able to help you figure out how to do this.) The rate limits may be found at https://dev.twitter.com/rest/public/rate-limiting. The key with using the Twitter API is to not be too aggressive, or your app will be banned by Twitter. For example, Twitter does not allow bulk follows and unfollows – so having patience is important.

There are several programs involved, and all of them utilize MySQL databases to store the information. A summary of the scripts are as follows:

followers_find.pl – To use this script, you choose a Twitter user that has an audience similar to yours, and then follow their users. For example, if you want to follow cat lovers, you could grab the followers of the Twitter user named @Cat. This script will pull the last 5,000 followers of @Cat, and place those user ID’s into a database named follows_other_users. The script will also save the cursor information, so you can run this script multiple times to obtain a large pool of users, and avoid duplicates. This script stores the account you followed (@Cat) and the user_id of the follower.

friend_lookup.pl – This script takes 100 of the user ID’s from the follows_other_users database, pulls the user’s details from Twitter and inserts this information into the twitter_users database. The user information includes the name, the user ID, number of tweets, how many followers, how many people they are following, time zone and description. You can modify the database and the script to include or omit other pieces of information.

follow_user.pl – This script follows users from the twitter_users database, based upon the percentage of followers/following and the number of tweets (you can change the search criteria). For example, I didn’t want to follow someone who was following 2,000 people but only had 100 followers. I wanted the followers/following ratio to be a little more even. I also wanted to follow people who had posted at least 30 tweets. Even though the followers_find.pl script downloads the information for 5,000 users (at a time), you might only follow a couple hundred of these users who fit your criteria.

friends_follow_check.pl – This script will check to see if a user you followed has followed you back, and if not, then the script will unfollow that user. You will need to specify how many days to give someone to follow you back before you unfollow them. For example: You follow a group of users on 10/05/2015 (the database stores what date you follow someone). You decide to wait five days to see if anyone from this group follows you back. Therefore, on 10/10/2015, you can run this script and change the $date_to_delete variable to 2015-10-05, and the script will unfollow anyone you followed on 2015-10-05 (or prior) who is not following you back.

This diagram shows you the steps for each script and to what database they connect:

More details on each script:

followers_find.pl – Use this to grab followers of a related Twitter user by providing a value for $user_to_find_followers. This value should be the Twitter user’s name without the “@” symbol – and not their description name. The script will insert 5,000 followers at a time into follows_other_users and insert cursor information in user_cursors.

This script uses the followers_ids API call, which has a limit of 15 every 15 minutes. When you run the rate_limit.pl script, you can see how many calls are remaining:

'followers' => {
'/followers/ids' => {
'limit' => 15,
'remaining' => 14
'reset' => 1445026087,
},

friend_lookup.pl – Selects users from follows_other_users, gets the user’s details, and then inserts the information into twitter_users. The script can get information on 100 users at a time. Some users may produce an error, and the script will stop. If this happens, just re-run the script. If the script still has an error, delete that user from the database.

This script uses the lookup_users API call, which has a limit of 180 every 15 minutes. When you run the rate_limit.pl script, you can see how many calls are remaining:

'/users/lookup' => {
'limit' => 180
'remaining' => 179,
'reset' => 1445028838,
},

For some reason, when I ran this script, I would get an error on a particular user. I spent a lot of time trying to figure out why the error occurred, but there were too many unknowns as to why the error existed. So, when the script finds a bad user, it updates the follows_other_users database and marks that users with an error (err). You may have to delete a user from the follows_other_users if the script doesn’t automatically mark the user with an error and the script exits immediately after running it.

follow_user.pl – Selects users from twitter_users database and follows them if they meet certain criteria. The script also updates the twitter_users database as to whether or not they were followed, and what date/time they were followed.

This script uses the create_friend API call. The web site does not specify the limit, and the limit does not appear when you run the rate_limit.pl script. I only follow 10-20 new friends an hour – to avoid Twitter’s ban on automatic bulk follow/unfollow.

Before you run this script, you want to be sure that you have enough users in the twitter_users database that fit your search criteria. Use this SQL command to find the number of users available for you to follow: (and feel free to modify the criteria)

select user_id FROM twitter_users where sent_follow_request IS NULL and percent_follow > 90 and percent_follow  30;

The number of users to follow is set with the $limit variable, and I have it set to 250. This means the script will follow 250 users before quitting. I used a sleep command (a random-length pause between six minutes ($minimum = 360;) and twelve minutes ($maximum = 720;) between following users so Twitter doesn’t think you are a robot. You may adjust these values as well.

friends_follow_check.pl – Selects users you followed from the twitter_users database and unfollows them if they haven’t followed you. It updates twitter_users with the unfollow information.

After you run the follow_user.pl script, you will need to wait a few days to give people time to follow you back. You then will need to change the variable $date_to_delete to be a few days prior to the current date. I usually give people five days to follow me.

You can always change your search criteria to be less restrictive, in order to find more followers. But I have found that a strict search criteria removes most of the spammers.

This script uses the lookup_friendships API call, which has a limit of 15 every 15 minutes. When you run the rate_limit.pl script, you can see how many calls are remaining:

'/friendships/lookup' => {
'limit' => 15,
'remaining' => 14
'reset' => 1445031488,
},

In each of the scripts, you have an option to print out the results from the Twitter API call. You will need to uncomment (remove the #) from these lines:

#    print "---DUMPER START---\n";
#    print Dumper $followers_list;
#    print "---DUMPER END---\n\n";

Also, there are print statements that have been commented out as well. Uncomment them if you want to see the output.

Here are the CREATE TABLE statements for each database. Some fields are longer than you would think they should be, but I did this to leave room for special characters which are longer than one character (I had to use decode_utf8 on names and descriptions):

CREATE TABLE `follows_other_users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_to_find_followers` varchar(16) DEFAULT NULL,
  `follower_id` varchar(32) DEFAULT NULL,
  `looked_up_info` char(3) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

CREATE TABLE 'user_cursors' (
  'id' int(11) NOT NULL AUTO_INCREMENT,
  'user_id' varchar(16) DEFAULT NULL,
  'next_cursor' varchar(48) DEFAULT NULL,
  'previous_cursor' varchar(48) DEFAULT NULL,
  PRIMARY KEY ('id')
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

CREATE TABLE 'twitter_users' (
  'id' int(11) NOT NULL AUTO_INCREMENT,
  'name' varchar(56) DEFAULT NULL,
  `screen_name` varchar(32) DEFAULT NULL,
  'user_id' varchar(16) DEFAULT NULL,
  'sent_follow_request' varchar(3) DEFAULT NULL,
  'sent_request_datetime' datetime DEFAULT NULL,
  'followed_me' varchar(3) DEFAULT NULL,
  'unfollowed_them' varchar(3) DEFAULT NULL,
  'statuses_count' int(11) DEFAULT NULL,
  'following_count' int(11) DEFAULT NULL,
  'followers_count' int(11) DEFAULT NULL,
  'percent_follow' int(11) DEFAULT NULL,
  'time_zone' varchar(256) DEFAULT NULL,
  'description' varchar(4096) DEFAULT NULL,
  'creation_datetime' datetime DEFAULT NULL,
  PRIMARY KEY ('id')
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1

And here are the scripts. Remember you need to create your own keys and tokens and insert them into the script for $consumer_key, $consumer_secret, $access_token and $access_token_secret.

database_name
hostname or IP
MySQL user name
password

The scripts are also available on GitHub – https://github.com/ScriptingMySQL/PerlFiles.

followers_find.pl

#!/usr/bin/perl
# Updated 2015-10-25

use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
use Scalar::Util 'blessed';
# ----------------------------------------------------------------------------------
# this has to be near the top - as other parts of the script rely on these figures
# ----------------------------------------------------------------------------------

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime time;

$year = $year + 1900;
$mon = $mon + 1;

# add a zero if the value is less than 10

if ($sec < 10) { $sec = "0$sec"; }
if ($min < 10) { $min = "0$min"; }
if ($hour < 10) { $hour = "0$hour"; }
if ($mday < 10) { $mday = "0$mday"; }
if ($mon < 10) { $mon = "0$mon"; }
if ($year < 10) { $year = "0$year"; }
if ($wday < 10) { $wday = "0$wday"; }
if ($yday < 10) { $yday = "0$yday"; }
if ($isdst new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "access_token_secret",
      apiurl => 'http://api.twitter.com/1.1',
      ssl                 => 1
);


$count=1;

# twitter user name without the @
$user_to_find_followers = "Cat";

$dbh = ConnectToMySql($Database);
$query = "select user_id, next_cursor FROM user_cursorswhere user_id = '$user_to_find_followers' order by id desc limit 1";	
$sth = $dbh->prepare($query);
$sth->execute();

print "\n$query\n";

$count_users = 0;

#			print "name | friends_count | followers_count | statuses_count | percent_follow\% |$time_zone | description | creation_datetime\n";
# loop through our results - one tweet at a time
while (@data = $sth->fetchrow_array()) {
$cursor = $data[1];

if ($cursor == 0)

{
	$cursor = "-1";
}

}

print "Starting at cursor: $data[1]\n";
# Use the optional cursor parameter to retrieve IDs in pages of 5000. When the cursor parameter is used, 
# the return value is a reference to a hash with keys previous_cursor, next_cursor, and ids. 
# The value of ids is a reference to an array of IDS of the user's followers. 
# Set the optional cursor parameter to -1 to get the first page of IDs. 
# Set it to the prior return's value of previous_cursor or next_cursor to page forward or backwards. 
# When there are no prior pages, the value of previous_cursor will be 0. 
# When there are no subsequent pages, the value of next_cursor will be 0.

  eval {

		my $followers_list = $nt->followers_ids({
        screen_name => "$user_to_find_followers",
        cursor      => "$cursor",
        });

#		count => 1
#    print "---DUMPER START---\n";
#	print Dumper $followers_list;
#    print "---DUMPER END---\n\n";

			$next_cursor = $followers_list->{next_cursor_str};
			$previous_cursor = $followers_list->{previous_cursor_str};
			
			print "next_cursor $next_cursor - previous_cursor $previous_cursor \n";

		for my $status2 ( @{$followers_list->{ids}} ) {
        # print "$count $status $next_cursor\n";

			$follower_id = $status2;
		
			# uncomment to watch as it prints each user
			#print "$count $user_to_find_followers $follower_id\n";

			$dbh = ConnectToMySql($Database);	
			$query = "insert into follows_other_users(user_to_find_followers, follower_id) values ('$user_to_find_followers','$follower_id')";
			#print "\n $query\n";
			$sth = $dbh->prepare($query);
			$sth->execute();

		#sleep 1;

		$count++;

	# end for my $status
    }

# end eval    
};
			if ( my $err = $@ ) {
				die $@ unless blessed $err && $err->isa('Net::Twitter::Lite::Error');

				warn "HTTP Response Code: ", $err->code, "\n",
          	 "HTTP Message......: ", $err->message, "\n",
         	  "Twitter error.....: ", $err->error, "\n";
			}
		# put this into a database in case you want to search for more of their followers
		print "\n$user_to_find_followers $next_cursor $previous_cursor\n";

		$user_id = $user_to_find_followers;

		$dbh = ConnectToMySql($Database);	
		$query = "insert into user_cursors(user_id, next_cursor, previous_cursor) values ('$user_id','$next_cursor','$previous_cursor')";
		print "\n $query\n";
		$sth = $dbh->prepare($query);
		$sth->execute();

# ----------------------------------------------------------------------------------
# this has to be near the top - as other parts of the script rely on these figures
# ----------------------------------------------------------------------------------

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime time;

$year = $year + 1900;
$mon = $mon + 1;

# add a zero if the value is less than 10

if ($sec < 10) { $sec = "0$sec"; }
if ($min < 10) { $min = "0$min"; }
if ($hour < 10) { $hour = "0$hour"; }
if ($mday < 10) { $mday = "0$mday"; }
if ($mon < 10) { $mon = "0$mon"; }
if ($year < 10) { $year = "0$year"; }
if ($wday < 10) { $wday = "0$wday"; }
if ($yday < 10) { $yday = "0$yday"; }
if ($isdst < 10) { $isdst = "0$isdst"; }

$DateTime = "$year-$mon-$mday $hour:$min:$sec";

# ----------------------------------------------------------------------------------


print "Finished importing - $DateTime....\n";
print "\n----------------------------------------------------------------------------\n";

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

exit;

friend_lookup.pl

#!/usr/bin/perl
# Updated 2015-10-25

use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
use Scalar::Util 'blessed';
use Encode;
       
# ----------------------------------------------------------------------------
# get the relationship between my user name and another user name to see
# if they are following me or if I am following them
# ----------------------------------------------------------------------------

$number = 1;

# you are allowed 180 of these lookups every 15 minutes
# with a 15 second pause at the end of each one, you won't 
# go over the limit
while ($number new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "access_token_secret",
      apiurl => 'http://api.twitter.com/1.1',
      ssl                 => 1
);

open(OUT, ">dumper_out.txt") || die "Can't redirect stdout";
$dbh = ConnectToMySql($Database);
$query = "select follower_id, user_to_find_followers FROM follows_other_users where looked_up_info IS NULL limit 100";	
# run to see if you can debug why some users get an error
#$query = "select follower_id, user_to_find_followers FROM follows_other_users where looked_up_info = 'err' limit 100";	
$sth = $dbh->prepare($query);
$sth->execute();

print "\n$query\n";

$count_users = 0;

#			print "name | friends_count | followers_count | statuses_count | percent_follow\% |$time_zone | description | creation_datetime\n";
# loop through our results - one tweet at a time
while (@data = $sth->fetchrow_array()) {

	print "$data[0] ";

	$user_id = $data[0];
	$user_to_find_followers = $data[1];

if (length($data[0])  0)

	{

		$users_to_get = "$users_to_get, $user_id";

		push(@data2, "$user_id");

	}

	else

	{

		$users_to_get = "$user_id";
		push(@data2, "$user_id");

	}

	$count_users++;

# end - while (@data = $sth->fetchrow_array()) {
}

#print "$users_to_get\n";

$count = 1;

#while (@data2) {

			print "--------------------------------------------------------------------------------------------------\n";
	eval {

    			my $user_info = $nt->lookup_users({ 
    				user_id => [ "$users_to_get" ] 
    			});

	print OUT "---DUMPER START---\n";
	print OUT Dumper $user_info;
	print OUT "---DUMPER END---\n\n";

			if ( my $err = $@ ) {
				die $@ unless blessed $err && $err->isa('Net::Twitter::Lite::Error');

				warn "\n - HTTP Response Code: ", $err->code, "\n",
				"\n - HTTP Message......: ", $err->message, "\n",
				"\n - Twitter error.....: ", $err->error, "\n";
			}
		for my $status ( @$user_info ) {

			if (length($status->{name}) {name};
			$name =~ s/[^[:ascii:]]//g;
			$name =~ s/[^!-~\s]//g;
			$name = decode_utf8( $name );
			$name =~ s/\'/\^/g;
			print "Working on $name - ";

			$user_id = $status->{id};
			$following_count = $status->{friends_count};
			$followers_count = $status->{followers_count};
			$statuses_count = $status->{statuses_count};
			$time_zone = $status->{time_zone};

			$screen_name = $status->{screen_name};
			$screen_name =~ s/[^[:ascii:]]//g;
			$screen_name = decode_utf8( $screen_name );
			$screen_name =~ s/[^a-zA-Z0-9 _^-]//g;
			$screen_name =~ s/[^!-~\s]//g;

			$description = $status->{description};
			
			if (length($description)  'Wed Nov 09 19:38:46 +0000 2011',

			$created_at = $status->{created_at};
			@creation_date_array = split(" ",$created_at);
		
			$creation_date_month = $creation_date_array[1];
			
			if ($creation_date_month =~ "Jan") { $creation_date_month = "01"}
			if ($creation_date_month =~ "Feb") { $creation_date_month = "02"}
			if ($creation_date_month =~ "Mar") { $creation_date_month = "03"}
			if ($creation_date_month =~ "Apr") { $creation_date_month = "04"}
			if ($creation_date_month =~ "May") { $creation_date_month = "05"}
			if ($creation_date_month =~ "Jun") { $creation_date_month = "06"}
			if ($creation_date_month =~ "Jul") { $creation_date_month = "07"}
			if ($creation_date_month =~ "Aug") { $creation_date_month = "08"}
			if ($creation_date_month =~ "Sep") { $creation_date_month = "09"}
			if ($creation_date_month =~ "Oct") { $creation_date_month = "10"}
			if ($creation_date_month =~ "Nov") { $creation_date_month = "11"}
			if ($creation_date_month =~ "Dec") { $creation_date_month = "12"}
		
			$creation_date_day_of_month = $creation_date_array[2];
			$creation_date_year = $creation_date_array[5];
			$creation_date_time = $creation_date_array[3];
			$creation_datetime = "$creation_date_year-$creation_date_month-$creation_date_day_of_month $creation_date_time";

			# had to add this as the percentage formula below would fail
			if ($following_count < 1)
			
			{
			
				$following_count = 1;
			
			}
			if ($followers_count prepare($query);
			$sth->execute();
			$dbh2 = ConnectToMySql($Database);	
			$query2 = "update follows_other_users set looked_up_info = 'yes' where follower_id = '$user_id'";
			#print " $count ----  $query2\n";
			print "--------------------------------------------------------------------------------------------------\n";
			$sth2 = $dbh2->prepare($query2);
			$sth2->execute();

#sleep 1;

$count++;
		}
# end - eval
};
#599
$number++;

	# if we didn't grab all 100 users, change the last user's status to error
	#print "Count $count\n";
	if ($count prepare($query3);
		$sth3->execute();
		exit;
	}

# if there aren't any more users, quit
if (length($data[0]) fetchrow_array()) {
#}
#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}
close(OUT);
exit;

follow_user.pl

#!/usr/bin/perl
# Updated 2015-10-25

use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
use Scalar::Util 'blessed';

# ----------------------------------------------------------------------------
# follow users from database
# ----------------------------------------------------------------------------
# Credentials for your twitter application - blog
my $nt = Net::Twitter::Lite::WithAPIv1_1->new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "access_token_secret",
      apiurl => 'http://api.twitter.com/1.1',
      ssl                 => 1
);

# find the stats and info for the following users

$limit = 250;
$percent_follow_minimum = 80;
$percent_follow_maximum = 140;
$statuses_count_minimum = 30;

$dbh = ConnectToMySql($Database);
$query = "select user_id FROM twitter_users where sent_follow_request IS NULL and percent_follow > $percent_follow_minimum and percent_follow  $statuses_count_minimum limit $limit";	
$sth = $dbh->prepare($query);
$sth->execute();

print "\n$query\n\n";

$count_users = 1;

# 107 following 114 followers

# loop through our results - one tweet at a time
while (@data = $sth->fetchrow_array()) {

# ----------------------------------------------------------------------------------
# this has to be near the top - as other parts of the script rely on these figures
# ----------------------------------------------------------------------------------

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime time;

$year = $year + 1900;
$mon = $mon + 1;

# add a zero if the value is less than 10

if ($sec < 10) { $sec = "0$sec"; }
if ($min < 10) { $min = "0$min"; }
if ($hour < 10) { $hour = "0$hour"; }
if ($mday < 10) { $mday = "0$mday"; }
if ($mon < 10) { $mon = "0$mon"; }
if ($year < 10) { $year = "0$year"; }
if ($wday < 10) { $wday = "0$wday"; }
if ($yday < 10) { $yday = "0$yday"; }
if ($isdst create_friend({ user_id => "$user_id" });
    
#		    print "---DUMPER START---\n";
#			print Dumper $friend;
#		    print "---DUMPER END---\n\n";

			if ( my $err = $@ ) {
				die $@ unless blessed $err && $err->isa('Net::Twitter::Lite::Error');

				warn "\n - HTTP Response Code: ", $err->code, "\n",
				"\n - HTTP Message......: ", $err->message, "\n",
				"\n - Twitter error.....: ", $err->error, "\n";
			
			# end if
			};
			
			
	# end - eval
	};

			$dbh2 = ConnectToMySql($Database);	
			$query2 = "update twitter_users SET sent_follow_request = 'yes', sent_request_datetime  = '$DateTime' where user_id = '$user_id'";
			#print " $query2\n";
			#print " database updated.\n"
			$sth2 = $dbh2->prepare($query2);
			$sth2->execute();

# pause for a random time so twitter doesn't think you are a robot
# minimum and maximum time in seconds to sleep
$minimum = 360; 
$maximum = 720;
$random_sleep = int($minimum + rand($maximum - $minimum));

print " - sleeping for $random_sleep seconds\n";
sleep $random_sleep;

$count_users++;

# end - while (@data = $sth->fetchrow_array())
};

exit;

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

exit;

friends_follow_check.pl

#!/usr/bin/perl
# Updated 2015-10-25

use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
use Scalar::Util 'blessed';

# ----------------------------------------------------------------------------
# see if a user follows me and/or if i follow them
# ----------------------------------------------------------------------------

# Credentials for your twitter application - blog
my $nt = Net::Twitter::Lite::WithAPIv1_1->new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "access_token_secret",
      apiurl => 'http://api.twitter.com/1.1',
      ssl                 => 1
);

# limit must be 100 or less
$limit = 100;

# how many days do you want to wait until you unfriend someone?

$date_to_delete = "2015-10-21";

# find the stats and info for the following users

$dbh = ConnectToMySql($Database);
$query = "select user_id, sent_request_datetime FROM twitter_users where sent_follow_request = 'yes' and sent_request_datetime prepare($query);
$sth->execute();

print "\n$query\n\n\n";

$count_users = 0;
$count = 0;

# loop through our results - one tweet at a time
while (@data = $sth->fetchrow_array()) {

	$user_id = $data[0];
	$sent_request_datetime{$user_id} = $data[1];

	if ($count_users > 0)

	{
		$users_to_get = "$users_to_get, $user_id";
		push(@data2, "$user_id");
	}

	else

	{
		$users_to_get = "$user_id";
		push(@data2, "$user_id");
	}

$count_users++;

# end - while
}

print "$users_to_get\n\n";

#exit;

eval {

	my $friend = $nt->lookup_friendships({ user_id => "$users_to_get" });
    
#    print "---DUMPER START---\n";
#	print Dumper $friend;
#    print "---DUMPER END---\n\n";

			if ( my $err = $@ ) {
				die $@ unless blessed $err && $err->isa('Net::Twitter::Lite::Error');

				warn "\n - HTTP Response Code: ", $err->code, "\n",
				"\n - HTTP Message......: ", $err->message, "\n",
				"\n - Twitter error.....: ", $err->error, "\n";
			
			# end if
			};

	for $data_list ( @$friend ) {

	$count++;
   
		for $role ( keys %$data_list ) {
        	

        	if ($role =~ "id_str")
        	
        	{
        			# since the connection info is an array, find the values
        			$user_id_str = $data_list->{$role};
 
 					$user_id = $user_id_str;
 
 			# get user_id
 			
 			#print "$user_id_str - ";
 
 
 
 			# once you have the status of the connection and the user_id
 			# you can check to see if they are following you or not
 
				if ($status_friend =~ "followed_by")
        	
				{
        	
					print "*************\nThis person $user_id_str follows you. - $sent_request_datetime";
					print "Status: $status_connection1 $status_connection2\n*************\n";

					$dbh2 = ConnectToMySql($Database);
					$query2 = "update twitter_users set followed_me = 'yes' where user_id = '$user_id_str'";	
					$sth2 = $dbh2->prepare($query2);
					$sth2->execute();

					print "\n$query2\n";
					print "########\n $count of $limit sleeping....\n########\n";
					sleep 155;

				}
    		
				else
    		
				{
					print "This person $user_id_str DOES NOT follow you. - $sent_request_datetime{$user_id}\n";
					print "Status: $status_connection1 $status_connection2\n";

		eval {

		my $friend = $nt->destroy_friend({ user_id => "$user_id" });
    
#		    print "---DUMPER START---\n";
#			print Dumper $friend;
#		    print "---DUMPER END---\n\n";

			if ( my $err = $@ ) {
				die $@ unless blessed $err && $err->isa('Net::Twitter::Lite::Error');

				warn "\n - HTTP Response Code: ", $err->code, "\n",
				"\n - HTTP Message......: ", $err->message, "\n",
				"\n - Twitter error.....: ", $err->error, "\n";
			
			# end if
			};
			
			
	# end - eval
	};

					$dbh3 = ConnectToMySql($Database);
					$query3 = "update twitter_users set followed_me = 'no' where user_id = '$user_id_str'";	
					$sth3 = $dbh3->prepare($query3);
					$sth3->execute();

					print "\n$query3\n";

					$dbh4 = ConnectToMySql($Database);
					$query4 = "update twitter_users set unfollowed_them = 'yes' where user_id = '$user_id_str'";	
					$sth4 = $dbh4->prepare($query4);
					$sth4->execute();

					print "\n$query4\n";
					print "########\n $count of $limit sleeping ";
					# pause for a random time so twitter doesn't think you are a robot
					# minimum and maximum time in seconds to sleep
					$minimum = 60; 
					$maximum = 120;
					$random_sleep = int($minimum + rand($maximum - $minimum));

					print " for $random_sleep seconds\n";
					sleep $random_sleep;

				# unfollow this user
    		    		
				}

 			# end - if ($role =~ "id_str")
 			}

        	# check the status of the connection        	
        	if ($role =~ "connections")
        	
        	{
        			# since the connection info is an array, find the values
        			$status_connection1 = $data_list->{$role}[0];
        			$status_connection2 = $data_list->{$role}[1];
        			$status_connection3 = $data_list->{$role}[2];
        	
        	$status_friend = "$status_connection1 $status_connection2 $status_connection3";

    		# if ($role =~ "connections")
    		}
    		
    		# for $role ( keys %$data_list ) {
    		}
    		
    		# end - for $data_list ( @$friend ) {
    		}

# end - eval
};

# end - while
#}

print "\n\n";

exit;

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

rate_limits.pl

#!/usr/bin/perl
# Updated 2015-10-25
 
use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
use Scalar::Util 'blessed';
#use Encode;
use JSON;
  
# ----------------------------------------------------------------------------
# get the relationship between my user name and another user name to see
# if they are following me or if I am following them
# ----------------------------------------------------------------------------

# Credentials for your twitter application - blog
my $nt = Net::Twitter::Lite::WithAPIv1_1->new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "access_token_secret",
      apiurl => 'http://api.twitter.com/1.1',
      ssl                 => 1
);

	eval {

    			my $user_info = $nt->rate_limit_status;

#	print "---DUMPER START---\n";
print Dumper $user_info;
#	print "---DUMPER END---\n\n";

			if ( my $err = $@ ) {
				die $@ unless blessed $err && $err->isa('Net::Twitter::Lite::Error');

				warn "\n - HTTP Response Code: ", $err->code, "\n",
				"\n - HTTP Message......: ", $err->message, "\n",
				"\n - Twitter error.....: ", $err->error, "\n";
			
			# end if
			};


	$friendships_show_remaining = $user_info=>{friendships};
	print "friendships_show_remaining $friendships_show_remaining\n";

#		print Dumper $friendships_show_remaining;
			
for my $item( @{$user_info_data->{friendships}} ){
    print $item->{'/friendships/show'} . "\n";
};

exit;

Tony is the author of Twenty Forty-Four: The League of Patriots

Visit http://2044thebook.com for more information.

Filed under MySQL, Perl, Scripting, Twitter Tagged with mysql, Perl, Twitter

Using Perl to retrieve direct messages from Twitter, insert messages into a MySQL database and then delete the messages

September 18, 2015 Leave a comment

In two earlier posts, I gave some examples on how to use Perl to send tweets stored in a MySQL database to Twitter, and then how to automatically reply to your retweets with a “thanks”. In this post, I will show you how to automatically download your direct messages from Twitter, store the messages in a MySQL database, and then delete them.

I don’t like the way Twitter makes me read my direct messages. Granted, the majority of them are not real messages. The message is usually thanking me for following the sender, and then there is a personal website link or a link to a product they are selling. But if I want to delete a direct message, I have to click on the message, click the trash can, and then confirm I want to delete the message. This process is too cumbersome.

I wrote a Perl script that connects to Twitter, downloads your direct messages, inserts them into a MySQL database, and then deletes the direct message. I had a year of direct messages in my Inbox, and in a few minutes, they were gone. But I still had a copy in my MySQL database, in case I wanted to go back and read them or respond.

Just like in the first post, you will need to register your application with twitter via apps.twitter.com, and obtain the following:

consumer_key
consumer_secret
access_token
access_token_secret

One caveat: twitter has a rate limit on how often you may connect with your application – depending upon what you are trying to do. See Rate Limiting and Rate Limits for more information. So, check your limits before you try downloading a large number of direct messages.

Just like the other two apps, we will be using the Net::Twitter module, as well as a few more modules which are listed in the beginning of the script.

There are two kinds of direct messages – the messages you sent and the messages you have received. So, we will be looking at two different Perl scripts to retrieve each kind of message. In the Net::Twitter module, there are a lot of different variables you can capture. In these examples, I only grabbed what I thought I needed. If there are other data variables you want, you will have to modify the tables and scripts.

First, you will need to create two databases to store your direct messages – one for the sent messages and one for the received messages. Here are the CREATE TABLE statements for both tables:

CREATE TABLE 'MESSAGES_SENT ' (
  'id' int(10) NOT NULL AUTO_INCREMENT,
  'creation_datetime' datetime DEFAULT NULL,
  'message_id' bigint(20) DEFAULT NULL,
  'sender_screen_name' varchar(16) DEFAULT NULL,
  'recipient_screen_name' varchar(16) DEFAULT NULL,
  'message_text' varchar(140) DEFAULT NULL,
  'sender_friends_count' int(10) DEFAULT NULL,
  'sender_time_zone' varchar(64) DEFAULT NULL,
  'sender_description' varchar(160) DEFAULT NULL,
  PRIMARY KEY ('id')
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1


CREATE TABLE 'MESSAGES_RECEIVED' (
  'id' int(10) NOT NULL AUTO_INCREMENT,
  'creation_datetime' datetime DEFAULT NULL,
  'message_id' bigint(20) DEFAULT NULL,
  'sender_screen_name' varchar(16) DEFAULT NULL,
  'recipient_screen_name' varchar(16) DEFAULT NULL,
  'message_text' varchar(140) DEFAULT NULL,
  'sender_friends_count' int(10) DEFAULT NULL,
  'sender_time_zone' varchar(64) DEFAULT NULL,
  'sender_description' varchar(160) DEFAULT NULL,
  PRIMARY KEY ('id')
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1

I am only storing a few of the variables. In each script, if you want to see all of the variables which are available, you will need to uncomment this line:

#          print Dumper $statuses;

When using the Dumper command, be sure that you are only downloading a single message. Otherwise, the script will pull all of the variables and their values for all of the messages. You only need one message to see all of the variables. To only retrieve one message, be sure that the value $number_of_messages is equal to 1:

$number_of_messages = 1;

Afterwards, you can change this value to whatever you want – just be sure to watch your Twitter limits.

The print Dumper $statuses; line will display all of the possible variables for a single message. For the “sent” script, there are about 140 variables. For the “received” script, there are about 67 variables.

Here is the “Get direct messages I sent to other people” script, which uses the MESSAGES_SENT database:

#!/usr/bin/perl
 
use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
    
# ----------------------------------------------------------------------------
# get twitter direct messages sent
# ----------------------------------------------------------------------------

# you will need to fill in this information about your application and your twitter account
my $nt = Net::Twitter::Lite::WithAPIv1_1->new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "$access_token_secret",
      ssl                 => 1
);

$number_of_messages = 1;

  eval {
      my $statuses = $nt->sent_direct_messages({ count => $number_of_messages });

# uncomment this line and set the count above to 1 (one) to see all of the variables that are available
#          print Dumper $statuses;

	for my $status ( @$statuses ) {
		
		$creation_date = $status->{created_at};

		# convert $creation_date to MySQL datetime format
		#  0   1   2  3         4     5
		# Fri Sep 04 07:32:05 +0000 2015
		
		@creation_date_array = split(" ",$creation_date);
		
		$creation_date_month = $creation_date_array[1];
		
		if ($creation_date_month =~ "Jan") { $creation_date_month = "01"}
		if ($creation_date_month =~ "Feb") { $creation_date_month = "02"}
		if ($creation_date_month =~ "Mar") { $creation_date_month = "03"}
		if ($creation_date_month =~ "Apr") { $creation_date_month = "04"}
		if ($creation_date_month =~ "May") { $creation_date_month = "05"}
		if ($creation_date_month =~ "Jun") { $creation_date_month = "06"}
		if ($creation_date_month =~ "Jul") { $creation_date_month = "07"}
		if ($creation_date_month =~ "Aug") { $creation_date_month = "08"}
		if ($creation_date_month =~ "Sep") { $creation_date_month = "09"}
		if ($creation_date_month =~ "Oct") { $creation_date_month = "10"}
		if ($creation_date_month =~ "Nov") { $creation_date_month = "11"}
		if ($creation_date_month =~ "Dec") { $creation_date_month = "12"}
		
		$creation_date_day_of_month = $creation_date_array[2];
		$creation_date_year = $creation_date_array[5];
		$creation_date_time = $creation_date_array[3];
		$creation_datetime = "$creation_date_year-$creation_date_month-$creation_date_day_of_month $creation_date_time";

		$message_id = $status->{id};
		
		$sender_screen_name = $status->{sender}{screen_name};
		$sender_screen_name =~ s/\'/\~/g;

		$recipient_screen_name = $status->{recipient_screen_name};
		$recipient_screen_name =~ s/\'/\~/g;
				
		$message_text = $status->{text};
		# remove carriage returns and single tick marks
		$message_text =~ s/\n/ /g;
		$message_text =~ s/\'/\~/g;

		$sender_friends_count = $status->{recipient}{followers_count};

		$sender_time_zone = $status->{sender}{time_zone};
		
		$sender_description = $status->{sender}{description};
		$sender_description =~ s/\n/ /g;
		$sender_description =~ s/\'/\~/g;

		# uncomment this line if you want to print
		# print "$creation_date_day_of_month - $creation_datetime - $message_id - $sender_screen_name - $recipient_screen_name - $message_text - $sender_friends_count - $sender_time_zone - $sender_description\n";


$dbh = ConnectToMySql($Database);	
$query = "insert into messages_sent (creation_datetime,message_id,sender_screen_name,recipient_screen_name,message_text,sender_friends_count,sender_time_zone,sender_description) values ('$creation_datetime','$message_id','$sender_screen_name','$recipient_screen_name','$message_text','$sender_friends_count','$sender_time_zone','$sender_description')";
#print "\nquery $query\n";
$sth = $dbh->prepare($query);
$sth->execute();


# stop the program if we have an error with the database
if ( $sth->err )
{
	die "ERROR! return code:" . $sth->err . " error msg: " . $sth->errstr . "\n";
}

else

{
	 my $destroy_id = eval { $nt->destroy_direct_message("$message_id") };
}



# you can change this so you don't go over your twitter connection limits
sleep 5;


	# end for my $status
	}

# end eval
  };

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

Here is the “Get direct messages I received from other people” script, which uses the MESSAGES_RECEIVED database:

#!/usr/bin/perl
 
use Net::Twitter::Lite::WithAPIv1_1;
use DBI;
use DBD::mysql;
use Data::Dumper;
       
# ----------------------------------------------------------------------------
# get twitter direct messages received
# ----------------------------------------------------------------------------

# you will need to fill in this information about your application and your twitter account
my $nt = Net::Twitter::Lite::WithAPIv1_1->new(
      traits              => [qw/API::RESTv1_1/],
      consumer_key        => "$consumer_key",
      consumer_secret     => "$consumer_secret",
      access_token        => "$access_token",
      access_token_secret => "$access_token_secret",
      ssl                 => 1
);

$number_of_messages = 1;

  eval {
      my $statuses = $nt->direct_messages({ count => $number_of_messages });

			# uncomment this line and set the $number_of_messages variable above to 1 (one) to see all of the variables that are available
#			print Dumper $statuses;

	for my $status ( @$statuses ) {
		
		$creation_date = $status->{created_at};

		# convert $creation_date to MySQL datetime format
		#  0   1   2  3         4     5
		# Fri Sep 04 07:32:05 +0000 2015
		
		@creation_date_array = split(" ",$creation_date);
		
		$creation_date_month = $creation_date_array[1];
		
		if ($creation_date_month =~ "Jan") { $creation_date_month = "01"}
		if ($creation_date_month =~ "Feb") { $creation_date_month = "02"}
		if ($creation_date_month =~ "Mar") { $creation_date_month = "03"}
		if ($creation_date_month =~ "Apr") { $creation_date_month = "04"}
		if ($creation_date_month =~ "May") { $creation_date_month = "05"}
		if ($creation_date_month =~ "Jun") { $creation_date_month = "06"}
		if ($creation_date_month =~ "Jul") { $creation_date_month = "07"}
		if ($creation_date_month =~ "Aug") { $creation_date_month = "08"}
		if ($creation_date_month =~ "Sep") { $creation_date_month = "09"}
		if ($creation_date_month =~ "Oct") { $creation_date_month = "10"}
		if ($creation_date_month =~ "Nov") { $creation_date_month = "11"}
		if ($creation_date_month =~ "Dec") { $creation_date_month = "12"}
		
		$creation_date_day_of_month = $creation_date_array[2];
		$creation_date_year = $creation_date_array[5];
		$creation_date_time = $creation_date_array[3];
		$creation_datetime = "$creation_date_year-$creation_date_month-$creation_date_day_of_month $creation_date_time";

		$message_id = $status->{id};
		
		$sender_screen_name = $status->{sender}{screen_name};
		$sender_screen_name =~ s/\'/\~/g;
		
		$message_text = $status->{text};
		# remove carriage returns and single tick marks
		$message_text =~ s/\n/ /g;
		$message_text =~ s/\'/\~/g;

		$sender_friends_count = $status->{sender}{friends_count};

		$sender_time_zone = $status->{sender}{time_zone};
		
		$sender_description = $status->{sender}{description};
		$sender_description =~ s/\n/ /g;
		$sender_description =~ s/\'/\~/g;

		$recipient_screen_name = $status->{recipient_screen_name};
		$recipient_screen_name =~ s/\'/\~/g;
		
		# uncomment this line if you want to see the output
		# print "$creation_datetime - $message_id - $sender_screen_name - $recipient_screen_name - $message_text - $sender_friends_count - $sender_time_zone - $sender_description\n";

$dbh = ConnectToMySql($Database);	
$query = "insert into messages_received (creation_datetime,message_id,sender_screen_name,recipient_screen_name,message_text,sender_friends_count,sender_time_zone,sender_description) values ('$creation_datetime','$message_id','$sender_screen_name','$recipient_screen_name','$message_text','$sender_friends_count','$sender_time_zone','$sender_description')";
#print "\nquery $query\n";
$sth = $dbh->prepare($query);
$sth->execute();

# stop the program if we have an error with the database
if ( $sth->err )
{
	die "ERROR! return code:" . $sth->err . " error msg: " . $sth->errstr . "\n";
}

else

{
	 my $destroy_id = eval { $nt->destroy_direct_message("$message_id") };
}

# you can change this so you don't go over your twitter connection limits
sleep 5;

	# end for my $status
	}

# end eval
  };

#----------------------------------------------------------------------
sub ConnectToMySql {
#----------------------------------------------------------------------

   my ($db) = @_;

   open(PW, "<..\/accessTweets") || die "Can't access login credentials";
   my $db= ;
   my $host= ;
   my $userid= ;
   my $passwd= ;

   chomp($db);
   chomp($host);
   chomp($userid);
   chomp($passwd);
   
   my $connectionInfo="dbi:mysql:$db;$host:3306";
   close(PW);

   # make connection to database
   my $l_dbh = DBI->connect($connectionInfo,$userid,$passwd);
   return $l_dbh;

}

In the subroutine ConnectToMySql, I store the MySQL login credentials in a text file one directory below where my Perl script is located. This file – named accessTweets contains this information:

database_name
hostname or IP
MySQL user name
password

You may hardcode these values into the script if you don’t want to mess with storing them in a file.

I tested these scripts on two twitter accounts, and everything worked for me – but I ran out of messages quickly. Let me know if you have problems. I am not the best Perl programmer, nor am I an expert at the Twitter API, so there may be a better/easier way to do this.