Hacking a huge Drupal database import

This article describes how to use regexes and some bash tricks and bad voodoo to split a SQL file into pieces in order to get it into MySQL in a reasonable amount of time. Parsing SQL is notoriously painful and error prone, so if you see mistakes here (I'm sure there are), please comment!

I recently got called in to work on a project for a customer I had never worked with before.  It is a very well known media / retail company and the job was to build a new interface for a media player / video library they required.  Pretty standard stuff, some views a couple quicktabs, a bit of ajax coding and a lot of pixel-f**king.  Anyway, near the end of the week long project when it came time to stage it, I realized I had a pretty big problem.  This site had 2 million users and several tables of profile data and other information.  My SQL file was over 3GB and took 6-8 hours to import.  Eek!  

So knowing that I wasn't going to write my update functions 100% perfect the first time, and I would likely have to import several times until I was confident it was working okay, I needed to find a way to get the DB to import in a more reasonable amount of time.  All I had at this point was a sql file of the production DB and 6hrs to get the code ready.  

Here's what I did:

(keep in mind this is on OSX, GNU tools maybe slightly differnet)

split -a5 '^DROP TABLE IF EXISTS [^;]+;$' db_backup.sql _db_split

What this does is look for the string DROP TABLE IF EXISTS, splits based on that and creates files called _db_splitXXXXX where XXXXX is a unique string.

Okay, now I have a bunch of files that look like this:

ls -l _db_split*

-rw-r--r-- 1 jacob staff       837 Jul 16 23:24 _db_splitaaaaa

-rw-r--r-- 1 jacob staff       679 Jul 16 23:24 _db_splitaaaab

-rw-r--r-- 1 jacob staff       937 Jul 16 23:24 _db_splitaaaac

-rw-r--r-- 1 jacob staff      1666 Jul 16 23:24 _db_splitaaaad

-rw-r--r-- 1 jacob staff       601 Jul 16 23:24 _db_splitaaaae

-rw-r--r-- 1 jacob staff     99276 Jul 16 23:24 _db_splitaaaaf
It goes on and on.
Let's make those a little easier to read
for f in _db_split*; do new_f=$(egrep 'CREATE TABLE `([a-z0-9_]+)' -o $f | cut -d' ' -f 3 | cut -c2-100); mv $f dump_$new_f.sql; done;
This one finds he table name from inside the file (I'm sure there is a nicer way to do this, probably should just use ruby, but this works).  Then it renames the file from _db_splitXXXXX to dump_tablename.txt.  Now I have something like this:
ls -l dump*
-rw-r--r-- 1 jacob staff       679 Jul 16 23:24 dump_access.sql

-rw-r--r-- 1 jacob staff       937 Jul 16 23:24 dump_accesslog.sql

-rw-r--r-- 1 jacob staff      1666 Jul 16 23:24 dump_actions.sql

-rw-r--r-- 1 jacob staff       601 Jul 16 23:24 dump_actions_aid.sql

-rw-r--r-- 1 jacob staff     99276 Jul 16 23:24 dump_apachesolr_search_node.sql

-rw-r--r-- 1 jacob staff       728 Jul 16 23:24 dump_authmap.sql
Okay, we don't want to import the cache data, or watchdog, but we do want the table definitions.
So I defined a function to do this:
function sql_remove_insert() { out=$(cat $1 | ruby -e 'puts STDIN.read.split("Dumping data for table")[0]'); echo "$out" > $1; }
And I run it against the tables I don't need to be importing:
for t in `ls dump_cache* dump_watchdog* dump_settings_audit_log_*`; do sql_remove_insert $t; done;
That makes a difference, but we're still at > 2GB.  The real issue is the 2 million users.  So assuming that the important users who create the content (i.e. administrators) are all within the first 10,000 or so accounts, I wrote a hacky (and probably fragile) regular expression to remove any insert records where the uid is > 10,000. And then I ran it on any table which has uid as the first column:
for t in {dump_users,dump_users_roles,dump_users_uuid,dump_password_reset_users}; do perl -pi -e 's/,\([0-9]{5,10},(?!\),).*\),/,/g' $t.sql; done;
That last bit brought the DB down to only 300MB or so and it imported in a couple hours alowing me to stage my changes, mess it up, restore and get it working in the end.
Bash-FU! Hope this helps soemone trying to import a huge SQL file into MySQL.





mysqldump arguments that can help

Damien McKenna's picture

I use the following arguments with mysqldump:

alias dump_mysql='mysqldump -Q --add-drop-table -c --order-by-primary --extended-insert=FALSE --no-create-db=TRUE'

There are two key parts - disabling extended inserts (--extended-insert=FALSE) and using the longer "complete" insert statements (-c); these have the result of creating a complete INSERT statement for every individual record, which makes GREP calls *much* easier to do.

Yes, that's true.  In fact,

Jacob Singh's picture

Yes, that's true.  In fact, if I had access to the DB to make myself another dump (itself a fairly long process), I would probably use --tables and --no-data and a few other tricks.  Sadly, not the case here :)

Add new comment | Jacob Singh

Thanks for another excellent article. Where else may anybody get that kind of information in such a perfect approach of writing?
I've a presentation subsequent week, and I'm on the search for such information.

Outstanding piece of work you

adebe's picture

Outstanding piece of work you have done. This type of post is rarely found. This site has proved its metals in the way of giving extra ordinary information. colostomy

6-8hrs for 3GB??? That's

6-8hrs for 3GB??? That's unbelievably slow. It sounds like you are running the stock my.cnf. Here's what I use for my dev machine (2GB RAM):

key_buffer = 64M
join_buffer_size = 512K
max_allowed_packet = 16M
query_cache_size = 64M
table_cache = 512
sort_buffer_size = 64K
read_buffer_size = 256K
read_rnd_buffer_size = 256K
net_buffer_length = 2K
thread_stack = 128K
tmp_table_size = 96M
max_heap_table_size = 96M
innodb_buffer_pool_size = 128M
innodb_additional_mem_pool_size = 2M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT

I'd recommend spending an hour finding out what these settings do. It will save you hours of waiting around for MySQL to do its thing (or writing parsing scripts). If you've got more RAM jack up innodb_buffer_pool_size till your active data set fits in RAM.

You extravasation

You extravasation unornamented us with untold a whopping grouping of grouping. Your signal is restive and you personalised to lot it with all. I assets your ply.

I really like the work that

Tecfrigo - Mercatus's picture

I really like the work that has gone into making the post. I will be sure to tell my blog buddies about your content keep up the good work.Tecfrigo - Mercatus

Drupal can automatically

jaffa's picture

Drupal can automatically notify the administrator about new versions of modules, themes, or the Drupal core. Such a feature can be useful for security fixes.  Thanks a lot.


dr simeons hcg diet



rose's picture

Why does every one do hacking? is there nothing better then hacking.Essay Writer|Essay expert|

Thanks for the tips, maybe I

Web Directory's picture

Thanks for the tips, maybe I can use this ended my tufted marketing and I've been use untold anulus media in run a interaction and they someone existing a big amend on me.

Web Directory

So, first, I would like to

new orlando homes's picture

So, first, I would like to say thanks for your post. It is always necessary new orlando homes for us to have a copy of the text file to computer and keep it safe.

Well this type of details is

Water Damage Experts in Philadelphia's picture

Well this type of details is really worthy of in search of, reliable details for viewers and a value for you as will definitely display the quality of you. It’s excellent to have these types of content.  Water Damage Experts in Philadelphia

Just stumbled across your

rutennl's picture

Just stumbled across your blog and was instantly amazed with all the useful information that is on it. Great post, just what i was looking for and i am looking forward to reading your other posts soon! coder salary

I wanted to say that it's

strovon's picture

I wanted to say that it's nice to know that someone else also mentioned this as I had trouble finding the same info elsewhere. This was the first place that told me the answer. jaket distro

Really your post is very good

Liposonix Austin's picture
Really your post is very good and I’m pleased about it. It’s tough to sort the Liposonix Austin good from the bad sometimes, I really impressed by your post.

Nice post man. Thanks a lot

PDS's picture
Nice post man. Thanks a lot for the kind of perfect topic I have not a lot of information about it but I have got an extra unique info in your unique post. I also like posting comments because helps build community. Can you tell me one thing? Is there any blog on PDS in this site? I badly need that.