BioTeam.Net HOWTO #1

 

“BioSQL with MySQL:

Rolling your own relational sequence database”

 

Additional titles considered:

 

 

Summary:

A step-by-step diary written by a database novice describing how to use freely available tools and schemas from the Open Bioinformatics Foundation to create a MySQL-based relational database repository for biological sequence and annotation data.

 

Author:

Chris Dagdigian, dag@sonsorol.org

http://BioTeam.net

 

Version:

1.1 - Last updated May 30, 2002

 

Revision History:

1.0; Original version – May 2002

1.1; Reader feedback and error corrections; first ‘complete’ revision – May 2002

 

Acknowledgements:

·         Keith Allen, PhD, Paradigm Genetics, USA

·         Mark Lambrecht, PhD - University of Leuven, Centre of Microbial and Plant Genetics, Belgium

·         Adam Witney, St. George’s Hospital Medical School, University of London, UK

 

Additional Resources:

Open Bio Database Access website, http://obda.open-bio.org

Open-bio-l mailing list, http://open-bio.org/mailman/listinfo/open-bio-l

 

 

Contents:

            Background & purpose

            Assumptions

            About the OBDA effort

            About BioSQL

            About Bioperl-db

            Step-by-step guide

1.     Installing MySQL the lazy way (via RPM) and configuring

2.     Installing perl interfaces to MySQL

3.     Installing Bundle::BioPerl via CPAN.pm

4.     Installing BioPerl via CPAN.pm

5.     Downloading & installing bioperl-db directly from the OBF CVS repository

6.     Downloading biosql-schema directly from the OBF CVS repository

7.     Putting it all together: creating a biosql instance within MySQL

8.     Loading a test protein dataset: SwissProt release 40

9.     Observations

10.  Next steps…

 

 

 

 

Background & purpose

 

While working on an informatics/Bio-IT project at Harvard University’s new Bauer Center for Genomics Research (http://cgr.harvard.edu) I met some people who were interested in storing sequence data into a relational database so they could avoid having to parse or index big flat text files.

 

Knowing that (a) The Open Bioinformatics Foundation had started work on solution and (b) I am a complete and utter moron when it comes to relational databases I thought that it would be a fun experiment to document my efforts at building such a server starting with a bare-bones RedHat Linux 7.2 server.

 

This document is essentially a diary of my experiences. It took me roughly a day to get to the point where I could load Swissprot into a biosql database via Jason Stajich’s load_seqdatabase.pl script.

 

Another reason I wrote this is that as a member of the Open Bioinformatics Foundation board of directors I am well aware that our website(s) are horrific and that our documentation often lags behind the functionality of our code base.

 

In particular the bioperl website gives little or no mention to “bioperl-db” and the related obda.open-bio.org website is totally devoid of any real information.

 

In fact- one practically needs to be an existing developer on one of our existing projects to even know that this stuff exists.

 

So- besides being a fun way to spend a day this is also a way for me to get myself more up to speed with the OBDA website and related work so that I can eventually help contribute to the website and project documentation efforts.

 

Comments/corrections/additions/flames regarding this document are welcome. Just send them along to me at dag@sonsorol.org

 

 

 

Assumptions

 

·         This document describes installing a biosql-compliant database on an x86-based server running RedHat 7.2.

 

·         Different hardware architectures or OS’s may require different install procedures. Specifically the server used for this process was a Dell PowerEdge 6450 system with dual Xeon CPUs, 4GB physical memory and the 2.4.18 linux kernel.

 

·         This document assumes the user has root access to the server system for the purposes of installing software and perl modules.

 

·         This document assumes at least basic knowledge of RPM packages and how to install them

 

·        The process described here includes downloading code from the internet via anonymous CVS which some firewalls may not allow. Explaining the use of CVS-over-SSH is not within the scope of this document.

 

 

About the Open Bio Database Access (OBDA) effort

 

OBDA is standard developed by the Open Bioinformatics Foundation to standardize the way all of the OBF-affiliated projects access sequence data resources. This includes local sequence resources (indexed flatfiles), remote databases (web accessible sequence resources), CORBA servers (implementing the BioCORBA/BSANE specification) and relational databases using the BioSQL schema.

 

The website for the OBDA effort will eventually be http://obda.open-bio.org/ . At this time though the site is only online in template form. The OBF is actively soliciting people to take over the website and OBDA documentation efforts.

 

The primary discussion forum for the OBDA effort is the “open-bio-l” mailing list. The list signup page is located at: http://open-bio.org/mailman/listinfo/open-bio-l . The list archives are online at http://open-bio.org/pipermail/open-bio-l/.

 

 

 

About BioSQL

 

The website and primary email discussion list for all things BioSQL related will be the ones listed above in the OBDA section.

 

A BioSQL ER diagram in PDF form can be downloaded directly (PDF) from the OBF anonymous CVS server. A formatted listing of the tables described in the schema can be viewed online directly (HTML) from the same server.

 

 

About BioPerl-db

 

The bioperl-db project was started by Ewan Birney with major work contributed by Elia Stupka and continued support coming from the bioperl community.  Its purpose is a standalone sequence database with little external dependencies and tight integration with the bioperl distribution.  Support for more databases and bindings in java and python by Biojava and Biopython projects are welcomed and encouraged.  All questions and comments should be directed to the bioperl list <bioperl-l@bioperl.org> and more information can be found about the related projects at http://bio.perl.org and http://www.open-bio.org.

 

Bioperl-db is a standalone module located within the CVS code repository operated by the BioPerl Project. Anonymous CVS access and tarball downloads are available at http://cvs.bioperl.org. Look in the bioperl repository for the “bioperl-db” module.


 

 

Step by step guide

 

 

Step 1 Install MySQL the lazy way (via RPM) and configure

 

While logged in as root I installed the following RedHat RPMS:

 

mysqlclient9-3.23.22-6

mysql-3.23.41-1

mysql-server-3.23.41-1

mysql-devel-3.23.41-1

 

The install was verified by issuing the following command:

 

            # rpm –qa | grep mysql

mysqlclient9-3.23.22-6

mysql-3.23.41-1

mysql-server-3.23.41-1

mysql-devel-3.23.41-1

 

MySQL version warning: Mark Lambrecht recommends the use of MySQL version 3.23 or later. When trying to install version 3.22 on an Alpha-based system he received errors relating to the creation of the “fattribute_to_feature” table.

 

By default the RPM MySQL install puts sample configuration files into /usr/share/mysql/

 

The base directory that the RPM MySQL install uses is /var/lib/mysql

 

The global configuration file is kept at /etc/my.cnf

 

It is important to edit the configuration file at /etc/my.conf BEFORE starting the database for the first time particularly if you want (like me) to have the MySQL data files live somewhere other than /var/lib/mysql.

 

In my case I wanted to do two main things to the global config file, (a) change the default location to the much larger /usr/local/ partition and (b) incorporate some of the config suggestions found in /usr/share/mysql/ specific to large dedicated database servers.

 

WARNING; BioSQL databases can be very large. It may be wise to double check the size of your /var partition before continuing on with a totally default RPM-based install. The 1GB /var partition present on the Dell 6450 server is far too small. As an example:  The swissprot40 database is 307MB in size. When this dataset was loaded into MySQL the resulting database files and indices consumed roughly 399MB of disk space.

 

NOTE: Several BioSQL users have found that very large sequence records will greatly exceed some of the default settings that MySQL uses. In particular one may need to set the configuration value “max_allowed_packet” to a fairly high value (several megabytes at least). If you encounter any error messages that are similar to: “DBD::mysql::st execute failed: MySQL server has gone away…” than it is likely that max_allowed_packet needs to be set to a larger value. In an email post to the bioperl-l (http://bioperl.org/mailman/listinfo/bioperl-l) mailing list Keith Allen covers his experience with this problem in excellent detail. The full email message can be read at the following URL: http://bioperl.org/pipermail/bioperl-l/2002-May/007987.html

 

 

After revising the value of max_allowed_packet to ‘3M’ due to feedback from Keith Allen (see above Note) the global configuration file /etc/my.conf now looks like this:

 

[mysqld]

port=3306

datadir=/usr/local/mysql-database

socket=/usr/local/mysql-database/mysql.sock

skip-locking

set-variable= key_buffer=256M

set-variable= max_allowed_packet=3M

set-variable= table_cache=256

set-variable= sort_buffer=1M

set-variable= record_buffer=1M

set-variable= myisam_sort_buffer_size=64M

set-variable= thread_cache=8

set-variable= thread_concurrency=4

server-id= 1

tmpdir= /tmp/      

 

    

[mysql.server]

user=mysql

basedir=/usr/lib

 

[safe_mysqld]

err-log=/var/log/mysqld.log

pid-file=/var/run/mysqld/mysqld.pid

 

 

[mysqldump]

quick

set-variable= max_allowed_packet=16M

 

[mysql]

no-auto-rehash

safe-updates

 

[isamchk]

set-variable      = key_buffer=128M

set-variable      = sort_buffer=128M

set-variable      = read_buffer=2M

set-variable      = write_buffer=2M

 

[myisamchk]

set-variable      = key_buffer=128M

set-variable      = sort_buffer=128M

set-variable      = read_buffer=2M

set-variable      = write_buffer=2M

 

[mysqlhotcopy]

interactive-timeout

 

 

Just in case I missed anything I decided to add a symbolic link back to /var/lib/mysql – this turns out to be important because my configuration file still has “basedir=/var/lib” which is a total mistake.

 

 

# cd /var/lib

# ln -s /usr/local/mysql-database/ ./mysql

 

Now we can start the MySQL datbase...

 

/etc/rc.d/init.d/mysql start

 

 

Important! Protect your database by setting a root MySQL password:

 

# mysqladmin -u root password 'xxxxxxx'

 

 

After installing, configuring, starting and enabling a root password it should be possible to logoff as root and re-login as a ‘normal’ system user. 

 

To test this I made the switch from root user to my normal 'dag' user account and confirmed that I can login as root to the mysql instance:

 

[dag@pe3 scripts]$ mysql -u root -p

Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.

Your MySQL connection id is 8 to server version: 3.23.41-log

 

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

 

mysql> exit

Bye

 

 

Success! We now have a working MySQL instance.

 

 

 

Step 2 Install the perl interfaces to MySQL

 

In future steps we are going to use perl to help us talk to the MySQL database. In order to do this we need some additional perl modules installed. We need:

 

·         DBI:: - The perl generic database interface

·         DBD::MySQL – The perl MySQL database driver module

 

Given the lazy RPM install we previously did with the MySQL database it makes sense to install the preconfigured RPM versions of these modules that RedHat provides. The specific modules installed were:

 

·         perl-DBI-1.18-1.i386.rpm

·         perl-DBD-MySQL-1.2216-4.i386.rpm

 

Installation of these modules went smoothly. This was a welcome relief as given the still traumatic memories of spending days trying to get DBD::Oracle to install and work cleanly on Compaq Alphaservers and HP-UX machines.

 

 

Step 3 Install Bundle::BioPerl via CPAN.pm

 

Note: Skip this step if BioPerl is already on your system.

 

Some of the perl scripts needed to help load the database with sequence data we will be using in future steps are themselves dependent on the BioPerl distribution (http://bioperl.org/)

for critical functions such as sequence parsing etc.

 

The BioPerl distribution itself has its own dependencies on Perl modules and other programs that may or may not be found by default on a system. These external dependencies are not required but if installed they will give the BioPerl distribution additional functionality and features.

 

The BioPerl people have created a special CPAN package called “Bundle::BioPerl” that can assist with downloading and automatically installing all of the CPAN resident modules that BioPerl likes to use. By itself Bundle::BioPerl is next to useless but when used with CPAN.pm it becomes very convenient.

 

For the purposes of building a BioSQL database we don’t really care about all of the various external BioPerl dependencies enough to give special attention to the issue. The proper ‘lazy’ way then is just to fire up CPAN.pm and let it do all of the work.

 

People unfamiliar with the CPAN.pm module can read its documentation by issuing the system command ‘perldoc CPAN’.

 

To fire up the CPAN module in shell mode issue the following command as root:

 

# perl –MCPAN –e shell

 

If this is the first time that CPAN.pm has been used there will be an initial setup and configuration phase where CPAN configures itself to the local network environment. The user is also prompted to select from a list of available mirror sites.

 

To install Bundle::BioPerl while in the CPAN shell environment:

 

cpan> install Bundle::BioPerl

 

That’s about it. The process may take some time depending on the speed of the server and available network connection.

 

NOTE:

When attempting this for the first time the installation of Bundle::BioPerl completely failed because one of the perl modules  implementing SOAP functionality listed in the Bundle is apparently only distributed as a .ZIP archive rather than the more traditional tar.gz format.  The install process failed because CPAN.pm had not been configured to know where a suitable ‘unzip’ utility could be found. The solution was to exit from the CPAN shell, manually install the unzip rpm and then restart the CPAN shell.  Within the restarted CPAN shell environment the following command will make CPAN aware of the newly installed unzip utility:

 

cpan> o conf unzip  /usr/bin/unzip

cpan> install Bundle::BioPerl

 

 

Step 4 Install BioPerl via CPAN.pm

 

Note: Skip this step if BioPerl is already on your system.

 

Now that the various helper modules and external dependencies of BioPerl have been installed it should now be possible to use the same CPAN.pm shortcut to automatically download and install the full BioPerl distribution.

 

Asking CPAN to install the ‘Bio::Seq’ module is enough to get it to download and install the full distribution:

 

cpan> install Bio::Seq

 

BioPerl is a very large distribution (possibly the single largest collection of related perl modules in the entire CPAN repository) – you may wish to download and manually install the full BioPerl distribution directly from http://bioperl.org/.

 

 

Step 5 Download and install bioperl-db directly from the OBF CVS repository

 

Getting the latest and greatest version of the bioperl-db distribution involves downloading straight from the public source code server located at http://cvs.open-bio.org/

 

Instructions for downloading code via anonymous CVS are available on the cvs.open-bio.org website. Your firewall or local network configuration may not allow anonymous CVS connections to occur. See below for a workaround:

 

How to use the web and avoid anonymous CVS altogether:

If a firewall or network gateway device blocks CVS connection attempts it should be possible to download code and full distributions directly from the http://cvs.open-bio.org/ website. The CGI gateway to the CVS repository has a nice feature entitled “Download Tarball”. When clicked the server will generate a fresh code checkout and will send it to your browser as a .tar.gz archive file which can be saved locally to disk. The files may all be named “cvs_root.tar.gz” so beware of potential file overwrite issues when downloading multiple files or codebases.

 

How to checkout via anonymous:

 

The anonymous CVS password is ‘cvs’

Full instructions are available at http://cvs.open-bio.org

Example session with output:

 

# cvs -d:pserver:cvs@cvs.open-bio.org:/home/repository/bioperl login

CVS password:

#

# cvs -d:pserver:cvs@cvs.open-bio.org:/home/repository/bioperl co bioperl-db

cvs server: Updating bioperl-db

U bioperl-db/BUGS

U bioperl-db/Changes

U bioperl-db/LICENSE

U bioperl-db/MANIFEST

U bioperl-db/MANIFEST.SKIP

U bioperl-db/Makefile.PL

U bioperl-db/README

cvs server: Updating bioperl-db/Bio

cvs server: Updating bioperl-db/Bio/DB

U bioperl-db/Bio/DB/Annotation.pm

 

<FULL OUTPUT CUT FOR BREVITY>

 

Installing bioperl-db

 

The bioperl-db distribution is in the standard CPAN-style format so installation is similar to any other external perl module:

 

Once the distribution is uncompressed and unpacked there are only three commands necessary:

 

# perl Makefile.PL

# make

# make install

 

NOTE: This install will fail if the DBD:: and DBI::Mysql modules cannot be found.

 

 

Step 6 Download and install biosql-schema directly from the OBF CVS repository

 

Reference the above discussion on anonymous CVS, firewalls and the ability to download tar archives directly from the http://cvs.open-bio.org/ website.

 

The OBDA schema for BioSQL databases lives in its own CVS code repository because its use spans many of the Open Bio projects and coding efforts.

 

The repository is called “biosql”, the module is called “biosql-schema”. The CVS root for the respository can be found at “/home/repository/biosql”.

 

 

Example anonymous CVS checkout session:

 

[root@pe3]# cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/biosql login

Logging in to :pserver:cvs@cvs.open-bio.org:2401/home/repository/biosql

CVS password:

[root@pe3 biosql]# cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/biosql checkout biosql-schema

cvs server: Updating biosql-schema

cvs server: Updating biosql-schema/doc

U biosql-schema/doc/README

U biosql-schema/doc/biosql-ERD.pdf

U biosql-schema/doc/biosql.html

cvs server: Updating biosql-schema/scripts

U biosql-schema/scripts/create_mysql_db.pl

U biosql-schema/scripts/transform_sql.pl

cvs server: Updating biosql-schema/sql

U biosql-schema/sql/README

U biosql-schema/sql/biopipelinedb-mysql.sql

U biosql-schema/sql/biosql-accelerators-pg.sql

U biosql-schema/sql/biosqldb-mysql.sql

U biosql-schema/sql/biosqldb-pg.sql

U biosql-schema/sql/biosqldb-views-pg.sql

U biosql-schema/sql/makefile

cvs server: Updating biosql-schema/sql/ontology

U biosql-schema/sql/ontology/biosqldb-ontology-mysql.sql

U biosql-schema/sql/ontology/biosqldb-ontology-pg.sql

cvs server: Updating biosql-schema/t

[root@pe3 biosql]#

 

 

The files found in the biosql-schema do not need to be installed or otherwise processed. We are going to use the scripts and documents found in this module directly.

 

 

Step 7 – Putting it all together: creating a biosql instance within MySQL

 

Inside the biosql-schema/scripts folder there is a perl script called "create_mysqldb.pl". This script is just a helper script that automates the process of connecting to the datbase and issuing  the SQL statements that cause the proper database tables to be created. The script reads the actual SQL commands from the file found in biosql-schema/sql/biosql-mysqldb.sql

 

After looking at how simple this script is I felt the need to bypass it completely and just perform the steps manually for curiosity sake.

 

Here is the output from creating an initial database named ‘biosql’:

 

 

[dag@pe3]$ mysql -u root -p

Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.

Your MySQL connection id is 9 to server version: 3.23.41-log

 

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

 

mysql> create database biosql;

Query OK, 1 row affected (0.00 sec)

mysql> exit

Bye

 

 

Now that the database is created we need to connect to the database and issue the SQL commands contained in biosql-scheme/sql/biosqldb-mysql.sql file:

 

Within the mysql client interface we can use the command "source" to read in SQL statements from a file.

 

Here is the output from that process:

 

[dag@pe3]$ mysql -u root -p biosql

Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.

Your MySQL connection id is 11 to server version: 3.23.41-log

 

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

 

mysql> source sql/biosqldb-mysql.sql

 

Query OK, 0 rows affected (0.00 sec)

Records: 0  Duplicates: 0  Warnings:

 

<snip - lots of boring output cut here...>

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 1 row affected (0.00 sec)

 

Query OK, 0 rows affected (0.00 sec)

mysql>

 

 

Now that we are still connected to datbase ‘biosql’ let us see what we created by issuing the command "show tables;"

 

 

mysql> show tables;

+----------------------------+

| Tables_in_biosql           |

+----------------------------+

| biodatabase                |

| bioentry                   |

| bioentry_direct_links      |

| bioentry_qualifier_value   |

| bioentry_reference         |

| bioentry_taxa              |

| biosequence                |

| cache_corba_support        |

| comment                    |

| dbxref                     |

| dbxref_qualifier_value     |

| location_qualifier_value   |

| ontology_term              |

| reference                  |

| remote_seqfeature_name     |

| seqfeature                 |

| seqfeature_location        |

| seqfeature_qualifier_value |

| seqfeature_relationship    |

| seqfeature_source          |

| taxa                       |

+----------------------------+

21 rows in set (0.00 sec)

 

mysql> exit

Bye

 

Excellent! We have just created a BioSQL database within MySQL !

 

 

Step 8 - Loading a test protein dataset (Swissprot) into the database

 

Before tackling all of GenBank, a smaller dataset is probably a good idea as a test case to learn about the process of loading up the new biosql database with sequence data.

 

The dataset selected was SwissProt downloaded from ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/sprot40.dat

 

·         Total uncompressed size: 307MB

·         Apparent number of sequence entries: 101,707

 

 

The database loader script used was the  "load_seqdatabase.pl" script from the /scripts/ folder inside the bioperl-db distribution.

 

The script is pretty basic. There are some internal configuration settings inside the script that can be manually configured. Each of the configurable options can also be overridden by command line arguments.

 

The basic syntax for the load_seqdatabase.pl script is as follows:

 

./load_seqdatabase.pl <options> <dataset identifier> /path/to/data/file(s)

 

Some of the more important options and arguments that can be hard-coded into the perl script or passed via the command line are:

 

-h       (hostname of the mysql server)

-sqldb (the name of the biosql database instance that was created)

-dbuser (username to connect to MySQL with)

-dbpass (MySQL user password)

-format (the format of the sequence datafile that is going to be read/loaded)

 

One of the key options needed is the “dataset identifier”. This identifier allows one to store many ‘biodatabases” inside the single BioSQL instance. 

 

Given that we are trying to load release 40 of the Swissprot dataset a logical identifier could be something like “swiss”, “swissprot”, “sprot” or “swissprot40” etc.

 

 

Here we go…

 

[dag@pe3 biosql]$ ./load_seqdatabase.pl -host localhost -sqldb biosql -dbuser root -dbpass XXX -format swiss swissprot40 /n/data/mirrors/swissprot/sprot40.dat

Reading /n/data/mirrors/swissprot/sprot40.dat

[dag@pe3 biosql]$

 

 

Wow! It worked!

 

 

 

Step 9 – Observations

 

 

 

 

 

Step 10 – What next?

 

This document is a work in progress. Among the things I’d like to do next: