Wednesday, October 12, 2005

Version Control of Database Data
Applying the concepts of version control to databases

Background:
I've found that there is very little information overall on this subject. Moreover, I understand (correct me from wrong) that there are no complete database versioning systems out there.

I have put some thought into this semo-automatic way of accomplishing the task. Feel free to use and adapt the concepts however you like. Note: This is not a tutorial or a beginner's-guide, but more a road-map of a viable concept of database version

A way of accomplishig version control of database data:

1. Create a dbdata-directory under your trunk:

trunk
|-proc
|-www
|-conf
|-dbdata

2. Then, write a script that will perform the following operations, for each database:
a. Extract "show table status from dbname" to trunk/dbdata/dbname.status.sql
b. Use this to determine:
- Which table-schemas has been modified?
- Which tables' data has been modified?
(Of course, if it exists in db but not in dbdata/ or vice versa, both should be seen as modified)
c. Use the information from b. to extract the approriate table's schemas into trunk/dbdata/dbname.tablename.schema.sql (with CREATE TABLE, important), and the approriate table's data into trunk/dbdata/dbname.tablename.data.sql, overwriting the previous one's, if any. (Use complete inserts, and remember to prior to this make sure all tables have a primary key!)
d. Run the approriate VCS-command to delete those .sql-files belonging to tables that do not exist in the database anymore.
e. Run the approriate VCS-command to add all un-versioned .sql-files in trunk/dbdata/

3. Hook this script up on a pre-commit hook. A VCS status-command after this script's execution will have the VCS show exactly which table's schemas and data has been modified, and a commit will, because of the .sql-files' text-format, only save the diffed changes in the repository, saving lots of space compared to putting the table-binaries under version control (which in itself is not recommended).

Viola! You have automatic commits of your database schema and data, in a format that is suitable for the VCS.

4. Now, write yet another shell-script that will perform the following actions:
a. Check which table's schemas and which table's data has been modified by comparing the checked out files with the files belonging to the revision that the working-copy is updating from.

In case of conflict, see the table as not modified, in order to prevent application of conflicted files to the database. Resolve manually (see below) and commit once and then update in order to get out of the conflicted state.

b. For each modified table: Apply DROP, then the ...schema.sql, then the ...data.sql

Note: b. will handle deletion of tables completely by having no .sql-files to execute after the DROP. The creation is handled by the CREATE TABLE in ...schema.sql.

5. Hook this script up to post-update.

Voila again! You have automatic update of the database data integrated with your VCS.

Resolving conflicts / Merging database schemas and data:
Use a tool like SQLYog Enterprise's Database Schema and Data synchronization wizards to merge the the two ...schema.sql.r##-files.

This is more of a mind-game than actual practise, as I have not myself used this method yet. Feel free to comment on additions, adjustments and shortages with this method!

Other resources:
- Another blog about the subject, for a more comprehensive and complex solution
- Thread at Subversion's mailing list discussing Database Version Control
- Some guy proposing making an OOS-project of DVC

Forum posts that eventually lead up to this blog-entry:
- .sql patch-file generator - Synchronize two DBs (July 16th, 2005, 01:30 AM)
- Putting a database under version control (July 16th, 2005 10:345 AM) [No replies]

1 comment:

dalloliogm said...

Thank you for the post and the links.
I was looking for some way to put a database of mine under version control, but after realizing people are not used to do so, I abandoned the idea :(.

Now I just save the name of the input files used to fill the database in a column, and use Makefiles to store all the commands used to change it or add any field. I use a database to store research results and therefore I don't need to worry about doing versioning automatically (mine is not a web application).

I also saw this extension for elixir/sqlalchemy, in python, which seems interesting:
- http://elixir.ematia.de/trac/browser/elixir/trunk/tests/test_versioning.py