| Review: MySQL Data loading tools for importing data
Feature: There are a remarkable number of ways to get data into MySQL databases
The MySQL suffers a welcome problem of an embarrassment of rich utilities for importing data into its database. But lurking in that embarrassment is a hard problems - MySQL's storage engines present database designers with a tough questions - which engine to use for their application.
Now many users will say the question is between MyISAM (used primarily for querying and data warehousing applications) and INNODB (used for high transaction systems with much table locking and updates). However, MySQL allows for a number of storage engines depending on the database task at hand. Since this will be just a demo database used primarily for queries we have chosen to use MyISAM. However, developers are encouraged to read Chapter 14 Storage Engines and Table Types to get the low down on this important topic especially given the special clustering and new high performance engines recently made available to MySQL users.
How to Do the Data Load
In some cases data loading is really data transfer - users are either replicating data between master and slave databases, using Replication procedures or they are doing a one time data migration from some existing database to MySQL. Finally, there is a whole category of adhoc data loads from one or more data sources including CSV-Comma Separated Variables in text files, Excel spreadsheets, or other desktop data sources. In this review, I am considering the latter case of creating the tables and loading up a database from desktop resources - in this case the Lahman Baseball database using CSV files. However, we also cover the database migration task as well with a review of the MySQL Migration Toolkit review here.
Of course the broad topic of creating a MySQL database from several desktop data sources can quickly reel out of hand. So what is done here is to simplify to filling in the some of the first tables of the Lahman Baseball database. In addition, I am using the GUI MySQL Administrator rather than the command line mode of operation .I think readers will appreciate the ease of use advantages that MySQL Administrator brings to the database creation and loading process.
Step by Step for BaseBall Managers
Our database load problem is actually a fun one. We are going to create the tables associated with the Lahman Baseball database which has in 21 tables all the statistical records for all baseball players going back to 1876. The tables are available in two formats: 1)an Access database and 2)as a series of 21 CSV files. Since MySQL can read CSV files directly we are going to use that method; however the Migration Toolkit does provide a means of reading Access .mdb database files so go here to see how that approach works.
I am starting from scratch so I will have to create the database (appropriately named baseball) and then the first table,awardsmanagers. All the info about the 21 tables and their fields and other properties are contained in the readme54.txt file from lahman54_csv.zip file. One problem primary keys, foreign keys and index information is not supplied in the zip (but is in the Access .mdb file) - so we are going to have to some DB intuition work.
I am going to use MySQL's GUI Administrator and Query Browser tools but I will have commentary about use of the MySQL Command Line as well. Our first step is to launch MySQL Administrator.
The first screen shows the login for MySQL Administrator:
This is the consistent look and feel of the signon and connection dialog used with all of the MySQL GUI tools. We have not supplied a predefined stored connection. Instead, we enter host, username and password and are taken to the Administrator. I click on Catalogs in the command panel on the top left of the screen - and the following info appears:
In the Administrator I will first create the database, baseball, then create the the first table, awardsmanagers. To create the database, right click anywhere in the lower left panel labeled Schemata (= Databases for all the rest of us). Choose the Create New Table menu item and a popup window asks to enter the Schema name - to which I reply baseball - and then database baseball is created.
Now all of the information that was supplied during installation of MySQL is used now to configure the database - collation sequence, character set, etc. However, Command Line experts will claim that the following step is just as easy - just enter the Command lIne Processor and then issue the command: Create Database baseball; - and you are done. Enter Help Create Database; and all the parameters for creating the database are revealed. So why use the Administrator? Bear with me, we shall find out soon enough.
The next step is to create the table awardsmangers. Click on the the baseball database in the list of Schemata/Databases and ten click on the Create Table button on the bottom of the screen. That will bring you to the Administrator's MySQL Table Editor as seen below. This is the reason why I use Administrator.
I just entered the table name awardsmanagers and the table will be created. But to get all the advanced table options right (none needed in the case of our baseball database but if you do any data warehousing or time critical transaction processing I assure you that you will get into advanced table options. And on the command line if you misspell or make a mistake it is Error 1064 and try to figure out where you went wrong. Or back up, delete and repent your sins and try, try again.
I like the explanations as seen above right next to each option - minimizes mistakes and does require me to memorize the options syntax. For this highly fallible typer, it is a god-send.
But for those who have perfect memories and infallibly fast typing skills please do Command Line away. For the rest of us I have filled in most of the fields for awardsmanagers:
Notice that you are just filling in a table about your table - name, datatype, nulls allowed, etc. But what makes the table editor so helpful is that you have a number of aids to filling in the datafields correctly. First, there is a pulldown for the datatypes - no misspellings or TIMEDATE (its DATETIME)mistakes. Next, Primary Key and Not Null columns are convenient - and you can't create a Primary Key without clicking Not Null first. When you do choose a datatype, its options popup in the Flag column of the table so you can click UNSIGNED for an integer, etc.
In short, I find it gets harder to make mistakes in creating or altering tables (same screen is used for ALTER TABLE). And even better, when you press the APPLY CHANGES button, the program shows you the CREATE TABLE (or ALTER TABLE) command as it asks you to confirm the changes. Very nice way to confirm you are getting what you think you ordered.
So when we exit from the Table Editor the Schema table shows the new table:
now we have created our first baseball database table lets see if MySQL can see it. For that operation and loading of the data I switch to the Query Browser. Again I know I could do this using the Command Line - but again there are benefits to using the Query Browser.
Click Tools | MySQL Query Browser from the Administrator menu and here is what pop up:
When we do the query, The table is empty as expected - loading up the data is the next step. Also note how easy it is to get the query started. I just dragged the table name onto the query windows and the general select statement was automatically generated.
So now that we know the table is ready - how do we load it. To get the exact syntax:
Query Browser provides detailed syntax for all the DCL-Data Control Language and DML-Data Manipulation Language of MySQL. It is in the Help box in the lower right hand panel. By clicking on LOAD DATA INFILE we have all the syntax readily copied for insertion into our Query Window. And the Help here is a notch above and much more quickly accessed than that in MySQL's Master Help file. This along with drag and drop convenience plus output formatting is the reason I find myself working in the Query Browser more often than on the Command Line.
So with our syntax HELP we build up the LOAD FILE command:
Here is where we found MySQL a little weak. First, we had to move our CSV datafiles over to where MySQL expected to find them - in this case in the same directory where the baseball database is located under the C:/Program files/MySQL/Servers5.0/data/baseball directory. Second, for reasons unknown I absolutely could not get MySQL to read in awardsmanagers.csv - I had to rename it to awardsmanagers.txt to get the load to work. Third, the error messages for the load process can be cryptic and crammed into the lower lefthand corner of the Query Browser's status bar.
In contrast both Oracle and DB2 were much more helpful in establishing the mapping between database fields and the CSV or other data input files. In addition their error messages erred on the side of verbosity - but at least that allowed a notably faster debugging. Hence, users should take a look at the MySQL Migration Toolkit or some of the commercial MySQL query and loading tools like SQLYog, Navicat, and RapidSQL if data loading is going to be a big part of your MySQL workload. The commercial tools can read from Excel, PDF, HTML, dBase as well as other JDBC/ODBC compliant database files which is the only files the MySQL Migration Toolkit supports.
So now that we have the data loaded, we go to see if the Query sees the data:
And to our satisfaction the data has been loaded properly. By the way visual inspection of all the fields entered is important when you do a text file load because of delimiters for text fields being clobbered by single or double quotes being used in a large text field. By moving the slider at the bottom of the screen it is possible to do a quick inspection - try that with the Command Line processor.
MySQL provides better facilities for loading in data than for its 3.x and 4.x databases. With the new MySQL Migration Toolkit and MySQL Query Browser the process is simpler and less error prone. Nonetheless, MySQL dories not quite have the polish of i either its commercial query counterparts or its other enterprise database competitors. As data interoperability becomes more important inter-database joins, replication, Web Services based queries and good old fashioned ETL will continue to play a more important role in defining how well a database performs. All databases are still too far away from doing inter-database joins although both IBM with its Integrator tool and Oracle with its Sunopsis purchase are moving quickly on this important front. It will be interesting to follow the MySQL response.
Top of Page Home Tutorials
Developers Weblog Graphics Reviews and Tips