Hacking a huge Drupal database import

This article describes how to use regexes and some bash tricks and bad voodoo to split a SQL file into pieces in order to get it into MySQL in a reasonable amount of time. Parsing SQL is notoriously painful and error prone, so if you see mistakes here (I'm sure there are), please comment!

I recently got called in to work on a project for a customer I had never worked with before.  It is a very well known media / retail company and the job was to build a new interface for a media player / video library they required.  Pretty standard stuff, some views a couple quicktabs, a bit of ajax coding and a lot of pixel-f**king.  Anyway, near the end of the week long project when it came time to stage it, I realized I had a pretty big problem.  This site had 2 million users and several tables of profile data and other information.  My SQL file was over 3GB and took 6-8 hours to import.  Eek!  

So knowing that I wasn't going to write my update functions 100% perfect the first time, and I would likely have to import several times until I was confident it was working okay, I needed to find a way to get the DB to import in a more reasonable amount of time.  All I had at this point was a sql file of the production DB and 6hrs to get the code ready.  

Here's what I did:

(keep in mind this is on OSX, GNU tools maybe slightly differnet)

split -a5 '^DROP TABLE IF EXISTS [^;]+;$' db_backup.sql _db_split

What this does is look for the string DROP TABLE IF EXISTS, splits based on that and creates files called _db_splitXXXXX where XXXXX is a unique string.

Okay, now I have a bunch of files that look like this:

ls -l _db_split*

-rw-r--r-- 1 jacob staff       837 Jul 16 23:24 _db_splitaaaaa

-rw-r--r-- 1 jacob staff       679 Jul 16 23:24 _db_splitaaaab

-rw-r--r-- 1 jacob staff       937 Jul 16 23:24 _db_splitaaaac

-rw-r--r-- 1 jacob staff      1666 Jul 16 23:24 _db_splitaaaad

-rw-r--r-- 1 jacob staff       601 Jul 16 23:24 _db_splitaaaae

-rw-r--r-- 1 jacob staff     99276 Jul 16 23:24 _db_splitaaaaf
 
It goes on and on.
 
Let's make those a little easier to read
 
for f in _db_split*; do new_f=$(egrep 'CREATE TABLE `([a-z0-9_]+)' -o $f | cut -d' ' -f 3 | cut -c2-100); mv $f dump_$new_f.sql; done;
 
This one finds he table name from inside the file (I'm sure there is a nicer way to do this, probably should just use ruby, but this works).  Then it renames the file from _db_splitXXXXX to dump_tablename.txt.  Now I have something like this:
 
ls -l dump*
-rw-r--r-- 1 jacob staff       679 Jul 16 23:24 dump_access.sql

-rw-r--r-- 1 jacob staff       937 Jul 16 23:24 dump_accesslog.sql

-rw-r--r-- 1 jacob staff      1666 Jul 16 23:24 dump_actions.sql

-rw-r--r-- 1 jacob staff       601 Jul 16 23:24 dump_actions_aid.sql

-rw-r--r-- 1 jacob staff     99276 Jul 16 23:24 dump_apachesolr_search_node.sql

-rw-r--r-- 1 jacob staff       728 Jul 16 23:24 dump_authmap.sql
 
Okay, we don't want to import the cache data, or watchdog, but we do want the table definitions.
So I defined a function to do this:
function sql_remove_insert() { out=$(cat $1 | ruby -e 'puts STDIN.read.split("Dumping data for table")[0]'); echo "$out" > $1; }
 
And I run it against the tables I don't need to be importing:
 
for t in `ls dump_cache* dump_watchdog* dump_settings_audit_log_*`; do sql_remove_insert $t; done;
 
That makes a difference, but we're still at > 2GB.  The real issue is the 2 million users.  So assuming that the important users who create the content (i.e. administrators) are all within the first 10,000 or so accounts, I wrote a hacky (and probably fragile) regular expression to remove any insert records where the uid is > 10,000. And then I ran it on any table which has uid as the first column:
for t in {dump_users,dump_users_roles,dump_users_uuid,dump_password_reset_users}; do perl -pi -e 's/,\([0-9]{5,10},(?!\),).*\),/,/g' $t.sql; done;
 
That last bit brought the DB down to only 300MB or so and it imported in a couple hours alowing me to stage my changes, mess it up, restore and get it working in the end.
 
Bash-FU! Hope this helps soemone trying to import a huge SQL file into MySQL.

 

 

Tags: 

Comments

mysqldump arguments that can help

I use the following arguments with mysqldump:

alias dump_mysql='mysqldump -Q --add-drop-table -c --order-by-primary --extended-insert=FALSE --no-create-db=TRUE'

There are two key parts - disabling extended inserts (--extended-insert=FALSE) and using the longer "complete" insert statements (-c); these have the result of creating a complete INSERT statement for every individual record, which makes GREP calls *much* easier to do.

Yes, that's true.  In fact,

Yes, that's true.  In fact, if I had access to the DB to make myself another dump (itself a fairly long process), I would probably use --tables and --no-data and a few other tricks.  Sadly, not the case here :)

6-8hrs for 3GB??? That's

6-8hrs for 3GB??? That's unbelievably slow. It sounds like you are running the stock my.cnf. Here's what I use for my dev machine (2GB RAM):

[mysqld]
key_buffer = 64M
join_buffer_size = 512K
max_allowed_packet = 16M
query_cache_size = 64M
table_cache = 512
sort_buffer_size = 64K
read_buffer_size = 256K
read_rnd_buffer_size = 256K
net_buffer_length = 2K
thread_stack = 128K
tmp_table_size = 96M
max_heap_table_size = 96M
innodb_buffer_pool_size = 128M
innodb_additional_mem_pool_size = 2M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT

I'd recommend spending an hour finding out what these settings do. It will save you hours of waiting around for MySQL to do its thing (or writing parsing scripts). If you've got more RAM jack up innodb_buffer_pool_size till your active data set fits in RAM.

Nice post man. Thanks a lot

Nice post man. Thanks a lot for the kind of perfect topic I have not a lot of information about it but I have got an extra unique info in your unique post. I also like posting comments because helps build community. Can you tell me one thing? Is there any blog on PDS in this site? I badly need that.