Parsing big files into AceDB

Richard Durbin rd at sanger.ac.uk
Tue Dec 10 04:15:30 EST 2002


Cache 1 must be big enough for every single object to fit into it.  I think there is some protection
when actually entering data, which is why you didn't crash then (you might have done other important
unretrievable things in your session).  You 2.7M records I suspect each take around 50 bytes (I can't
remember exactly, and there is some overhead), pushing you over your allocated 128 Mb.  You could 
increase cache 1 again, but much better would be to split up the Pep_homols, by assigning them to
subsequences of SUPERLINK_HU7 restricted to specific regions.

e.g.

Sequence SUPERLINK_HU7
Subsequence HU7-1 1 110000
Subsequence HU7-2 100001 210000
Subsequence HU7-3 200001 310000
...

Sequence HU7-1
Pep_homol        "rs+nr:gi|12697147|emb|CAC28313.1|" "Blastx_rs+nr" 39 20646 20443 67 134
Pep_homol        "rs+nr:gi|18543746|ref|XP_088768.1|" "Blastx_rs+nr" 19 12346 12236 64 100
Pep_homol        "rs+nr:gi|18543746|ref|XP_088768.1|" "Blastx_rs+nr" 19 20643 20443 5 71
...

Sequence HU7-3
Pep_homol        "rs+nr:gi|7861746|gb|AAF70384.1|AF189263_1" "Blastx_rs+nr" 3 1029 889 705 751
...

// Note that the coords 1029, 889 in HU7-3 correspond to 101029, 100889 in HU7.
// And that the homols stored in the subsequences should fit inside the subsequence extent as
// described in the parent.

Actually, better than this would be to use an SMAP'd class to hold the Pep_homol data such as

Sequence SUPERLINK_HU7
Homol_data HU7-1 1 110000
Homol_data HU7-2 100001 210000
Homol_data HU7-3 200001 310000
...

Homol_data HU7-1
Pep_homol        "rs+nr:gi|12697147|emb|CAC28313.1|" "Blastx_rs+nr" 39 20646 20443 67 134
Pep_homol        "rs+nr:gi|18543746|ref|XP_088768.1|" "Blastx_rs+nr" 19 12346 12236 64 100
Pep_homol        "rs+nr:gi|18543746|ref|XP_088768.1|" "Blastx_rs+nr" 19 20643 20443 5 71
...

Homol_data HU7-3
Pep_homol        "rs+nr:gi|7861746|gb|AAF70384.1|AF189263_1" "Blastx_rs+nr" 3 1029 889 705 751
...

But you should be using 4_9 to do this, and may need to add some new classes and models if your
models file is older.

Richard 

Nicolas Berkowicz wrote:
> 
> Hello,
> 
> I am having some problems accessing datas.
> 
> We successfully loaded two big files (251 MB and 194 MB), one containing an uniq sequence object with
> 2,760,063 records attached to it:
> 
> (...)
> Sequence SUPERLINK_HU7
> Pep_homol        "rs+nr:gi|12697147|emb|CAC28313.1|" "Blastx_rs+nr" 39 20646 20443 67 134
> Pep_homol        "rs+nr:gi|18543746|ref|XP_088768.1|" "Blastx_rs+nr" 19 12346 12236 64 100
> Pep_homol        "rs+nr:gi|18543746|ref|XP_088768.1|" "Blastx_rs+nr" 19 20643 20443 5 71
> Pep_homol        "rs+nr:gi|7861746|gb|AAF70384.1|AF189263_1" "Blastx_rs+nr" 3 15029 14889 705 751
> Pep_homol        "rs+nr:gi|12831207|ref|NP_075579.1|" "Blastx_rs+nr" 4 15029 14889 784 830
> Pep_homol        "rs+nr:gi|22045991|ref|XP_171346.1|" "Blastx_rs+nr" 232 22998 23486 300 462
> Pep_homol        "rs+nr:gi|22045991|ref|XP_171346.1|" "Blastx_rs+nr" 232 22648 22917 212 301
> Pep_homol        "rs+nr:gi|22045991|ref|XP_171346.1|" "Blastx_rs+nr" 232 24406 24645 571 648
> Pep_homol        "rs+nr:gi|22045991|ref|XP_171346.1|" "Blastx_rs+nr" 232 25825 26019 728 792
> (...)
> 
> This other contains the cross references datas:
> 
> (...)
> Protein "rs+nr:gi|19884164|sp||NU1M_DROAM_1"
> From_Database rs+nr
> DNA_homol SUPERLINK_HU7 "Blastx_rs+nr" 9 28 87 139774911 139774726
> DNA_homol SUPERLINK_HU7 "Blastx_rs+nr" 9 102 149 139774684 139774541
> 
> Protein "rs+nr:gi|17226744|gb|AAL37914.1|AF324956_1"
> From_Database rs+nr
> DNA_homol SUPERLINK_HU7 "Blastx_rs+nr" 33 25 62 50069566 50069453
> DNA_homol SUPERLINK_HU7 "Blastx_rs+nr" 33 263 297 50001770 50001666
> DNA_homol SUPERLINK_HU7 "Blastx_rs+nr" 33 63 107 50060935 50060801
> (...)
> 
> At this point if I try to access the object "SUPERLINK_HU7" (using either xace or tace) it always
> crashes.
> 
> If I use xace/tace, it tells me that the cache1 is full, and asks me if I want the write access
> (whatever my answer is, it still crashes).
> 
> ===> My machine is a:
> 
> SunOS 5.8 Processor: i386
> Memory: 2048M
> 
> ===> limit
> cputime         unlimited
> filesize        unlimited
> datasize        unlimited
> stacksize       8480 kbytes
> coredumpsize    0 kbytes
> vmemoryuse      unlimited
> descriptors     256
> 
> ===> We are using AceDB 4_9c
> 
> ===> My cachesize.wrm uses these values:
> 
> CACHE1 = 128000
> CACHE2 = 256000
> DISK =  8000
> 
> Thanks very much
> 
> Nicolas
>





More information about the Acedb mailing list