=begin text # Sample Chado database Worked steps on how to create a sample chado database, load with chado.xml using XORT then view with GBrowse and BioPerl DB/Das/Chado See http://www.gmod.org/ for parts # Prerequisites PostgreSQL db and perl modules required for it Apache web server for GBrowsing. BioPerl and other parts for use with GBrowse (follow install notes) # Notes The below steps were tested on Solaris and MacOSX, in tcsh shell, using Argos genome database/web server (http://eugenes.org/argos) for the prerequisites. Real soon now, will have this installed for testing with GMOD Argos sample CentaurBase, at http://eugenes.org/argos/centaurbase/ Need -r chado_1_01 -- frozen schema revision, Fri Jun 6 19:52:34 2003 but need -r HEAD (later) for chado/modules/sequence/gff-bridge/ # Author Don Gilbert, Oct 2003, gilbertd@indiana.edu # STEPS ----------------------------------------------------------- # STEP 1 # Software: cvs checkout needed software, schema #------------------------------------------------------ Current Chado schema, XORT loader/dumper, and GBrowse viewer software are available from cvs.sourceforge.net:/cvsroot/gmod. Note that these steps are only good for the moment - updates may alter needed steps. General CVS checkout for GMOD tools (http://www.gmod.org/cvs.shtml) set cvsd='-d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/gmod' cvs $cvsd login [no password] cvs $cvsd co modulename #Note: I used cvsd='-d:ext:dongilbert@cvs.sourceforge.net:/cvsroot/gmod' # as anon cvs wasnt working Where modulenames are: schema/chado/ - for schema loading into postgres db - need frozen schema version 'chado_1_01' to match sample chado.xml data cvs $cvsd co -r chado_1_01 schema/chado/ schema/chado/modules/sequence/gff-bridge/ - need more current version of these for chado-gbrowse use cvs $cvsd co -r HEAD schema/chado/modules/sequence/gff-bridge/ schema/XMLTools/ - for software to load in chado.xml data to postgres db - dgg added various bug fixes to XORT to get this working cvs $cvsd co -r HEAD schema/XMLTools/ Generic-Genome-Browser/ - for software to view genome annotations from postgres db - I used http://www.gmod.org/ GBrowse instructions for this instead of CVS # STEP 2 # Data: locate some Chado.xml data ... #------------------------------------------------------ Sample chado.xml data cut from fruitfly small chromosome 4 (~80 genes) is available at this location: rsync rsync://flybase.net/flybaseftp/genomes/centaur/chado_1_01/ ; # list rsync -av rsync://flybase.net/flybaseftp/genomes/centaur/chado_1_01 . ;# copy to local dir # or ncftp -R ftp://flybase.net/genomes/centaur/chado_1_01 gunzip chado_1_01/*chado.xml.gz # STEP 3 # Chado1: create db #------------------------------------------------------ -- your postgres environment should be initialized set root=`pwd` set chado=$root/schema/chado/modules/ set xort=$root/schema/XMLTools/XORT/ set dbname='chadoxA' ; # use a name you like dropdb $dbname createdb $dbname 'chado_1_01 schema; chado-sample xml data' -- this is what I used, to make it in flybase database directory createdb -D FB_PGDATA $dbname 'chado_1_01 schema; chado-gadfly v7.0_0728 xml data' # STEP 4 # Chado2: load in schema #------------------------------------------------------ # Chado2: load in chado main schema /bin/rm $dbname.ddl touch $dbname.ddl set md=`cat $chado/idb-full.modules` foreach s ($md) echo $s cat $chado/$s >> $dbname.ddl end # Chado2: load in XORT functions cat $xort/Config/function_view.ddl >> $dbname.ddl # Chado2: load in extra Chado-Gbrowse bridge schema cat $chado/sequence/gff-bridge/sequence-gff-views.sql >> $dbname.ddl cat $chado/sequence/gff-bridge/sequence-gff-funcs.pgsql >> $dbname.ddl # STEP 5 # Chado3: edit schema to make it work ! #------------------------------------------------------ # Notes: dang; have to edit .ddl for these changes before using # ALTER TABLE synonym ALTER COLUMN synonym_sgml DROP NOT NULL; # No good, need to allow NULL: ALTER TABLE pub ALTER COLUMN type_id SET DEFAULT 0; # ALTER TABLE pub ALTER COLUMN type_id DROP NOT NULL; # also edit XORT/Config/function_view.ddl to fix table _appdata errors perl -pi.old -0777 \ -e 's/(create table pub [^;]+)\stype_id int not null,/${1} type_id int,/s;' \ -e 's/synonym_sgml varchar\(255\) not null/synonym_sgml varchar\(255\)/;' \ -e 's/alignment_evidence_id varchar\(50\) not null,/alignment_evidence_id serial not null,/;' \ -e 's/prediction_evidence_id varchar\(50\) not null,/prediction_evidence_id serial not null,/;' \ -e 's/_appdata_id not null/_appdata_id int not null/;' \ -e 's/\(_appdata_id\),/\(_appdata_id\)/;' \ $dbname.ddl diff $dbname.ddl* # store into Postgres cat $dbname.ddl | psql $dbname createlang plpgsql $dbname # STEP 6 # Chado4: pre-load data not in chado.xml sample #------------------------------------------------------ ## add some needed top-level table data that XORT won't create... ## tables db and contact seem to be essential to pre-exist ## should be able to use XORT to dump this info from primary chado db cat < $xort/Config/$dbname.properties < file index, and source, scaffold ranges # tab-separated-values of: # base-start | file-index OR class-name | file-index | location source 324 1..15080473 ; class-name | file-index | location 1 356 ; base-start | file-index 150805 51058 ; base-start | file-index The .idx files are a sparse binary index by ID number, which gnomap uses, but gbrowse_fb doesn't use yet. Its format is a packed record of (file-record-at,record-size). use constant RECORDSIZE => length(pack("NN", 1, 500)); sub readId { my $idnum= $_[0] =~ /0*(\d+)/; my( $record, $data); my $idloc = $idnum * RECORDSIZE; seek(IDX, $idloc, 0); read(IDX, $record, RECORDSIZE); my($at, $size) = unpack("NN", $record); seek(DATA, $at, 0); read(DATA, $data, $size); return $data; } There is a third index, idmap.tsv, which is plain text table of "ID[tab][Chromosome[tab]BaseStart..BaseEnd", and gbrowse_fb uses this to handle ID queries. Now flybase/gmod/FFFdb.pm reads the entire entire table (ugh) for each ID query call -- I need to teach it to read the .idx binary index. A gbrowse_fb.conf file looks like [GENERAL] description = D. melanogaster (Annotation Rel. 3.1; Jan03) # path settings are relative to DOCUMENT_ROOT of httpd # adaptor = textdb::fffdb datapath = /fly/features default_name = X:100000-200000 .. Your GBrowse current version is now installed as part of Argos's common tool set (along with gbrowse_fb). See e.g. http://eugenes.org/argos/gbrowse Over the weekend, I tested creating a genome database with Chado-Postgres using XORT loader and Flybase chado.xml data, then running GBrowse w/ Bio::DB::Das::Chado adapator. See these test notes at http://eugenes.org/argos/docs/chadopg-oct03-notes.txt This has steps for you to duplicate if you like. Working Gbrowse views of this sample data are at * running off sample chado postgres db http://eugenes.org/argos/flybase/common/cgi-bin/gbrowse?source=chadotest * "don's" gbrowse running off flatfile feature database http://eugenes.org/argos/flybase/common/cgi-bin/gbrowse_fb?source=chr4 -- Don #----------------- to: gmod-devel@lists.sourceforge.net gmod-gbrowse@lists.sourceforge.net flybase@morgan.harvard.edu Subject: Worked example of Chado.xml -> Chado Postgres -> Gbrowse (fly data) Dear folks-interested-in-chado-genome-db-tests, Over the weekend, I tested creating a genome database with Chado-Postgres using XORT loader and Flybase chado.xml data, then running GBrowse w/ Bio::DB::Das::Chado adapator. See these test notes at http://eugenes.org/argos/docs/chadopg-oct03-notes.txt This has steps for you to duplicate if you like. Working Gbrowse views of this sample data are at * running off sample chado postgres db http://eugenes.org/argos/flybase/common/cgi-bin/gbrowse?source=chadotest * "don's" gbrowse running off flatfile feature database http://eugenes.org/argos/flybase/common/cgi-bin/gbrowse_fb?source=chr4 (flatfile is faster :) -- Don Gilbert #----------- to gmod-schema@lists.sourceforge.net Over the weekend, I tested loading Chado-Postgres using XORT and flybase chado.xml, then running GBrowse w/ Bio::DB::Das::Chado adapator. See these test notes at http://eugenes.org/argos/docs/chadopg-oct03-notes.txt I used current chado schema (chado_1_01 freeze) from gmod at cvs.sourceforge.net Source data is the chado.xml scaffold dumps that Pinglei Zhou has been producing for flybase using XORT. There are a couple of chado schema related things I ran into, loading it back into Pg and using with GBrowse. You may consider them XORT or GBrowse/Chado adaptor problems instead of schema ones: - pub.type_id is null in chado.xml dumps (essentially a dummy value); not null in schema - alignment_evidence_id and prediction_evidence_id - why are these of type varchar? In XORT dump, they appear like other serial ids, so I changed them to serial in schema and XORT loader was happy. - synonym.synonym_sgml and synonym.name seem to be almost always the same data, but synonym_sgml is missing in chado.xml dumps. You may want to consider dropping this duplicate field, and using one name, plus an enum/cvterm_id field to indicate if the name is sgml/plain-text/other-type. Then if you have a synonym with sgml and plain text that differ, make them into two synonyms... In chado.xml dump, name is not null and synonym_sgml is null, in schema this is reversed. The SQL in Bio::DB::Das::Chado.pm looks for synonym_sgml and not for synonym.name. - the chado.xml dumps have some null feature.name, but always non null uniquename (matching schema), but the SQL in Bio::DB::Das::Chado.pm looks for feature.name (the common, more useful name) and not uniquename. Some SQL wizard may have a solution for adaptors such as Chado.pm to look for one feature.commonname which would be either (.name or .uniquename if .name is null) -- This is the perl edit I needed to do on chado.ddl before loading into Postgres, to get this to work perl -pi.old -0777 \ -e 's/(create table pub [^;]+)\stype_id int not null,/${1} type_id int,/s;' \ -e 's/synonym_sgml varchar\(255\) not null/synonym_sgml varchar\(255\)/;' \ -e 's/alignment_evidence_id varchar\(50\) not null,/alignment_evidence_id serial not null,/;' \ -e 's/prediction_evidence_id varchar\(50\) not null,/prediction_evidence_id serial not null,/;' \ -e 's/_appdata_id not null/_appdata_id int not null/;' \ -e 's/\(_appdata_id\),/\(_appdata_id\)/;' \ $dbname.ddl # =========== WORKING NOTES ==================================================== Gbrowse install failed on my MacOSX system: wanted to install stuff in /Library/Perl using perl Makefile.PL PREFIX=/bio/biodb/common/perl/gbrowse did job by-hand .. foreach p ($pods) pod2html --infile=$p --outfile=$p.html --title=$p \ --htmlroot=/gbrowse --htmldir=./ --podpath=./ end #--------- ## where do we find chromosome_arm feature length ?? ## need to get XORT dumpter and harvard folks to add chromosome_arm, scaffold feature info ## somewhere (ONCE per csome,scaff not for each gene object !) ## or get from full postgres chado db - query on csome features select f.name,f.uniquename,f.feature_id,f.seqlen,f.dbxref_id,f.type_id from feature f where f.uniquename like '4' # csome type_id = 210 select f.name,f.uniquename,f.seqlen,f.feature_id,f.dbxref_id,f.type_id,f.organism_id from feature f where f.type_id = 210 name | uniquename | seqlen | feature_id | dbxref_id | type_id | organism_id ------+------------+----------+------------+-----------+---------+------------- X | X | 21780003 | 6 | 1228 | 210 | 1 U | U | 11561901 | 7 | 1229 | 210 | 1 3L | 3L | 23352213 | 4 | 1225 | 210 | 1 2R | 2R | 20302755 | 2 | 1224 | 210 | 1 3R | 3R | 27890790 | 3 | 1226 | 210 | 1 4 | 4 | 1237870 | 5 | 1227 | 210 | 1 2L | 2L | 22217931 | 1 | 1223 | 210 | 1 name|uniquename|seqlen|feature_id|dbxref_id|type_id|organism_id X|X|21780003|6|1228|210|1 U|U|11561901|7|1229|210|1 3L|3L|23352213|4|1225|210|1 2R|2R|20302755|2|1224|210|1 3R|3R|27890790|3|1226|210|1 4|4|1237870|5|1227|210|1 2L|2L|22217931|1|1223|210|1 chado_gadfly9=# select * from dbxref where dbxref_id >= 1223 and dbxref_id <= 1229; dbxref_id | db_id | accession | version | description -----------+-------+-----------+---------+------------- 1223 | 1 | 2L | | 1224 | 1 | 2R | | 1225 | 1 | 3L | | 1226 | 1 | 3R | | 1227 | 1 | 4 | | 1228 | 1 | X | | 1229 | 1 | U | | dbxref_id | db_id | accession | version | description -----------+-------+------------+---------+------------- 182 | 2 | SO:0000105 | | chado_gadfly9=# select * from organism; organism_id | abbreviation | genus | species | common_name | comment -------------+--------------+---------------+--------------+-------------+--------- 1 | Dmel | Drosophila | melanogaster | fruit fly | 2 | Comp | Computational | result | | chado_gadfly9=# select * from cvterm where cvterm_id = 210; cvterm_id | cv_id | name | definition | dbxref_id -----------+-------+----------------+------------+----------- 210 | 6 | chromosome_arm | | 182 chado_gadfly9=# select * from cv; cv_id | name | definition -------+------------------------+------------ 1 | pub type | 2 | property type | 3 | analysis property type | 4 | relationship type | 5 | synonym type | 6 | SO | 7 | cellular_component | 8 | molecular_function | 9 | biological_process | ## GBrowse/Chado problem using 1 chado.xml scaffold - chromosome_arm has no .seqlen field ## gbrowse Chado/Segment.pm comes up with 0 length and later div by 0 error ## pg install on cricket - low memory problem (? 256mb?) IpcMemoryCreate: shmget(key=7302001, size=1499136, 03600) failed: Invalid argument This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 1499136 bytes), reduce PostgreSQL's shared_buffers parameter (currently 64) and/or its max_connections parameter (currently 32). If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for. The PostgreSQL Administrator's Guide contains more information about shared memory configuration. =end text