bionet.molbio.gene-linkage FAQ under construction

Darrell Root rootd at ohsu.edu
Mon May 16 00:17:30 EST 1994


This is a PRELIMINARY ROUGH DRAFT of a FAQ (frequently asked questions list)
for the bionet.gene-linkage newsgroup

A FAQ is a document where commonly-asked questions can be answered by the 
experts in that area.  By putting common questions (and their answers!) into
one document, researchers will waste less time searching the internet for
answers to common questions.

This is my attempt to start a FAQ for bionet.gene-linkage   When reading
through this preliminary rough-draft, please remember the following:

1) I am not an expert in these topics (except for sun speed optomization)
2) I do not even know many of the questions which should be asked/answered
3) The questions I have asked/answered are biased torward my areas of
research, so other (valid!) areas may not be represented at all

What I would like to do is create a first-class FAQ for genetic-linkage
researchers to use.

I cannot do this alone.  Below is my preliminary-rough-draft.  Read through
it.  Pick a question I have not: 1) asked, 2) answered completely, or 3)
answered correctly.  State the question clearly, answer it to the best of
your ability (even if you just add one point which I failed to mention),
and email the result to rootd at ohsu.edu

I will incorporate new submissions and post an updated FAQ during
the May28-29 weekend.  Eventually I will arrange for the FAQ to be
archived at the normal FAQ archives (if this project is successful).

Think of all the time you waste looking for stuff on the internet, all of
which could be saved if a first-class FAQ was created!

For the record, this project is not supported by any of our grants.  I wrote
this on my own time.

Darrell Root
rootd at ohsu.edu

BIONET.GENE-LINKAGE FREQUENTLY-ASKED-QUESTIONS
1) Where can I obtain the bionet.gene-linkage FAQ?
2) What anonymous-ftp sites have programs/utilities useful for genetic 
linkage analysis?
3) I think I know the name of a program I want, but I don't know where
I can find it
4) I have an ftp site with gene-linkage programs/utilities on it.  How
do I get registered with the archie servers?
5) What gopher sites have useful genetic-linkage information?
6) What database management programs do people use for genetic-linkage data?
7) What programs are available for pedigree drawing?
8) Why are some programs used primairly for chromosome mapping, while
others are used for disease-mapping?
9) What programs are used for chromosome mapping?
10) What programs are used for disease-gene mapping?
11) How do you calculate MAXHAP?
12) What programs are available to help detect errors in linkage data?
13) What books are helpful when learning about genetic linkage analysis?
14) How can I increase the speed of the linkage/fastlink package on my
workstation?
15) I set up 300 megs of paging space on my workstation, but now I'm running
out of hard-drive.  Is there any way I can use my hard drive space more
effeciently?
16) But I don't know how to do all this optomization, and my research assistant
is spending all his/her time trying to figure it out.
17) What genetic-linkage databases are available on the internet?




1) Where can I obtain the bionet.gene-linkage FAQ?
[rootd;15may94]

It is available by anonymous-ftp from ursula.ee.pdx.edu in
/pub/users/cat/rootd.  Once it is no longer "preliminary" I will
make certain it is on all the normal FAQ archive sites.

2) What anonymous-ftp sites have programs/utilities useful for genetic 
linkage analysis?
[rootd;15may94]

corona.med.utah.edu has Jurg Ott's Linkage package for many platforms, including
	some binaries

york.ccc.columbia.edu also has Jurg Ott's linkage package, but it is on a
	platform running VMS, and is difficult for us UNIX types to 
	"look around"

softlib.cs.rice.edu has FASTLINK, the optomized C versions of linkage which
	continue to undergo massive improvements

genome1.hgen.pitt.edu has Multimap, a lisp-based expert system which uses
	an optomized version of crimap to map chromosomes

ftp.bchs.uh.edu has some useful IBM programs, including:
	peddraw (a DOS pedigree drawing program--completely different
		from the B. Dyke MacIntosh peddraw 4.x)
	fastmap produces a quick approxomation to multipoint lod scores
	dolink	A DOS genetic database/analysis-setup program
	easistat A simple DOS statistics package
	easigraf Draws graphs of lod scores 

prep.ai.mit.edu	is the home of GNU (the free software foundation) which
	produces free software (such as the gcc compiler, and the emacs
	editor).

wuarchive.wustl.edu is the largest anonymous ftp-site on the planet.
	They have the whole GNU/free software foundation distribution,
	and tons of other stuff. 

mendel.welch.jhu.edu has all the files for OMIM (online mendelian 
	inheritance in man) and GDB (genome-data-base).  Searching within
	the search program is much easier.
	

[I need an ftp site for crimap]

[There are many more sites with useful stuff.  Email information to
rootd at ohsu.edu and I will add them to this list]


3) I think I know the name of a program I want, but I don't know where
I can find it
[rootd;15may94]

There is a database program called archie, which maintains a list of all
files in registered anonymous-ftp sites.  You can telnet to an archie
server, and have it search the database.  Each site is updated every 30
days, so very recently posted programs might not be listed yet.

To use archie, you need to telnet to one of the archie server sites, which
are:
archie.rutgers.edu	archie.sura.net
archie.unl.edu		archie.ans.net
archie.mcgill.ca	(thanks to O'Reilly's Internet book for this list)

Use the login name "archie" and nothing as your password.  Here is a
simple archie login an search:

bigbox% telnet archie.unl.edu
login: archie
password: 			<--just hit return, not like anonomous-ftp

unl-archie> find linkmap
# Search type: sub.
# Your queue position: 2
# Estimated time for completion: 00:24
working... -

Host gatekeeper.dec.com    (16.1.0.2)
Last updated 21:04  9 Apr 1994

    Location: /contrib/src/pa/m3-2.07/src/driver/boot-DS3100
      FILE    -rw-r--r--    4000 bytes  23:00  2 Jun 1992  M3LinkMap_i.c
      FILE    -rw-r--r--   14027 bytes  23:00  2 Jun 1992  M3LinkMap_m.c

    Location: /contrib/src/pa/m3-2.07/src/driver/linker/src
      FILE    -rw-r--r--    1307 bytes  00:00  4 Dec 1991  M3LinkMap.i3
      FILE    -rw-r--r--    3078 bytes  00:00  4 Dec 1991  M3LinkMap.m3

unl-archie> 

Unfortunately, these linkmap programs have nothing to do with
J Ott's linkage package.  Most gene-linkage programs are not on
registered ftp sites


4) I have an ftp site with gene-linkage programs/utilities on it.  How
do I get registered with the archie servers?
[rootd;15may94]

send email to archie-admin at bunyip.com  with the domain-name of the
ftp site and the email address of the administrator.  If you are the
administrator of the ftp-site identify yourself as such.


5) What gopher sites have useful genetic-linkage information?
[rootd;15may94]

gopher.gdb.org has background information on the human genome project,
	and archives of the "Human Genome News" newsletter.

Editor's note: There are many more, including the genethon gopher site
	(who's address I do not know)


6) What database management programs do people use for genetic-linkage data?
[rootd;15may94]

Paradox:	This is a full database-management system available from
		Borland computer company for IBM machines.  Like most other
		"full feature" databases, it is reliable and supported on most
		IBM platforms, but not tailored specifically to the needs of
		genetic researchers.  It has a good educational discount.
		We use it, but have to repeatedly set up our report-formats
		for linkage output.  Getting liped output format is nontrivial.

Linksys:	This custom-made database program was written by J Attwood
		and S Bryant.  Although they continue to use it, Dr Attwood
		suggests using dolink instead.  Linksys is not currently
		available at any ftp sites

Dolink:		This DOS custom database program (by D Curtis I think??)
		manages genetic data and sets up input files for your
		analysis.  It is available from ftp.bchs.uh.edu

Kindred:	This new DOS database program, distributed by Epicenter
		Software, is specifically designed for linkage analysis.
		A free demo is available by calling (818)-304-9487.  In
		addition to database duties, this program (according to
		the ad, not from personal experience) will draw pedigrees,
		haplotype marker data, and can output in linkage format.
		The demo did not work on our IBM because our monitor is
		from the stone age.  We were able to get the demo to run
		on a Power-PC Mac with SoftWindows emulation, but it crashed
		the Mac when we hit the escape-key during the demo.  Be
		forewarned: the list price is about $500.

CEPH:	This database is specifically designed for chromosome mapping
	with ceph-style-pedigrees.  It can output data in ped.out
	format or linkage format.   Our version (5.0) fails when we
	output over 90 markers, but not the entire dataset.  Santosh
	Gupta wrote a program (called mkcrigen) which converted the
	ped.out files to .gen files.  Unfortunately we only have an
	old binary which was compiled with a maximum of about 85
	markers.  If you try to convert a ped.out file to a .gen file
	with more than 85 markers, your final .gen file is messed up.
	Santosh Gupta modified the program to work with 500 markers, but
	we do not have any source code for mkcrigen (any version) and we
	do not have a binary for the improved version.
		Some other labs output the data in linkage format and
	convert that to .gen format.  We don't like that because that
	separates the marker name from the marker data, and can result
	in errors.
		I believe that the ceph database is available on the
	ceph ftp site, but I do not have the address.

[Please send comments on database programs you use]

7) What programs are available for pedigree drawing?
[rootd;15may94]

peddraw(IBM version): This program (Possibly written by Dave Curtis)
	is a pedigree drawing program for IBMs available from ftp.bchs.uh.edu
	in the /pub/gene-server/dos directory.  I have never used it.

ftree:	This is another IBM pedigree program written by Rodney C.P.(?) at
	the University of Alabama.  I have a copy, but do not know where
	this program is available.  I don't use it, but some old pedigrees
	in a notebook look very pretty.

peddraw(Mac Version): This program, written by B Dyke, P Mamelka, and 
	J MacCleur, is available from:
		Paul Mamelka
		Department of Genetics
		Soutwest Foundation for Biomedical Research
		PO. Box 28147
		San Antonio, TX 78228-0147
	An upgrade from a previous version is $10 (current version = 4.4)
	Documentation costs $10
	I THINK the program itself costs $35, but that may be too high.

8) Why are some programs used primairly for chromosome mapping, while
others are used for disease-mapping?
[rootd;15may94]

Any family can be used for chromosome mapping, so CEPH has picked
a particular family "shape" and generated a large database with these
families.  Programs designed for chromosome mapping can be optomized for
using these families, reducing the time needed for calculations.

Only families afflicted with a disease can be used for disease-gene-mapping.
As a result, programs designed for disease-gene-mapping need to be able to
deal with arbitrary pedigrees.  In addition, these programs need to be able
to handle incomplete-penetrance.

9) What programs are used for chromosome mapping?
[rootd;15may94]

crimap:	This program has been used for chromosome mapping for years.
	It has options which can generate maps, calculate order probablities,
	and printout recombination data.  It works on .gen files with data
	from CEPH-style families.

multimap: This Lisp-based expert system uses an optomized version of crimap
	to create a chromosome map.  It is available via anonymous ftp
	from genome1.hgen.pitt.edu.  The authors (T Matise, M Perlin, and
	A Chakravarti) continute to improve the code, add new functions,
	and provide excellent support.  When used with the crimap
	chrompic option (to find double-recombinations to identify
	possible errors), it is incredibly useful.

10) What programs are used for disease-gene mapping?
[rootd;15may94]

Simlink: This fortran program (by L Ploughhman and M Boehnke) simulates
	linkage analysis on a family, and gives you an "estimate the
	probability, or power, of detecting linkage given family history
	information on a set of identified pedigrees."  It allows the
	researcher to determine whether a family has sufficient informativeness
	to detect linkage.  In addition, it can help the researcher to decide
	how far apart to seperate their genetic probes without "missing" the
	disease locus (ie. Do I use probes seperated by 30cM? or will 40cM
	be close enough given the informativeness of this family).
		This can save the researcher considerable time and money.
	The researcher won't waste money doing a genome search on an
	insufficiently-informative family.  Large families can be "trimmed"
	during the initial genome-search, and then the entire family can
	be used later during marker-localization.  Simlink data can be
	useful on grant applications (to prove that the family you propose
	to analyze is sufficiently informative).
		Simlink requires large quantities of memory.  It was
	written for IBM's, but has been ported to many platforms including
	Sequent symmetry S8000's.

Liped:	This IBM program (written by Jurg Ott) calculates probabilities
	for genetic linkage between disease-markers and genetic-markers.
	It's input file differentiates between phenotypes and genotypes.
	As a result, this program is easiest to use when your data is
	from "old-style" genetic-markers (such as blood phenotype data).

Linkage: This package of programs, written by Jurg Ott in Pascal, calculates
	genetic linkage probablilities.  It consists of several analysis
	programs (each of which do a particular type of analysis) and several
	utility programs (which makes the analysis programs easy to use).
	Versions are available for IBM's and unix platforms.
	Here are some of the analysis programs:
	mlink: 2-point lod-score calculations at fixed recombination distances
	linkmap: multipoint lod-score calcuations at fixed distances
	ilink: calculates the recombination distance with the highest lod-score

fastlink: This is a port of the linkage package to C (by A Schaffer, R
	Cottingham, and R Idury).  The initial port increased the speed
	by an order of magnitude.  They continue to optomize the algorithm
	and code, resulting in continued speed improvements.  In addition,
	fastlink allows you to compile in "fast" or "slow" mode (the slow
	version of fastlink is still much faster than the old linkage programs).
	The "fast" version uses a ton of memory, but uses that memory to 
	contain some of the intermediate results which are repetitively
	recalculated in the "slow" version (and the old linkage package).
	We obtain good results by setting up 300 megs of virtual memory
	on our sparc and using the fast version (at one point we ran
	a fastlink linkmap run with 700 haplotypes).
		The fastlink programs are also more portable.  Earlier
	versions of fastlink required installation of p2c (the free-software
	foundation's pascal-to-C converter).  That is no longer necessary.

emaillink: I am developing an email-server for the fastlink programs, which
	will allow users to submit linkage data for analysis via email
	(similar to NIH's "Blast" DNA homology search server).  The system
	is currently working, but needs improvement before we betatest.
	Betatesting should begin later this month.


11) How do you calculate MAXHAP?
[rootd;15may94]

Maxhap is the maximum possible number of haplotypes in your analysis.
You multiply together the number of alleles at each locus used in a particular
run (not all the loci in your dataset, just the loci you use).  Remember that
affection status counts as two alleles, regardless of the number of liability
classes.

For example, if a dataset has the following information:

affection status: 4 liability classes
marker A: 3 alleles
marker B: 4 alleles
marker C: 5 alleles

And your run includes a linkmap run between affection-status, A, and B,
then your MAXHAP must be (at least) 2*3*4


12) What programs are available to help detect errors in linkage data?

By linkage data, I mean any genetic-linkage dataset, not just those for
Ott's Linkage package.  This is an important question, and I simply do
not know the answer.

I've used the crimap-chrompic option, and played with xpic/phap a little
bit, but I really hope some people send me some information on this topic.

13) What books are helpful when learning about genetic linkage analysis?
[rootd;15may94]

Jurg Ott's Analysis of Human Genetic Linkage is THE work in this area,
but it is very advanced and difficult for many people (including me) to
understand (although I haven't tried recently, I should try again...)

Rumors indicate that Jurg Ott is also writing a book on how to use
the linkage package.

Please send me other suggestions.


14) How can I increase the speed of the linkage/fastlink package on my
workstation?
[rootd;15may94] [aha, finally a question I can confidently answer!]

a. Use fastlink (it will increase your speed by an order of magnitude)
b. Setting up tons of paging space (using the hard-drive as virtual memory)
	and use the "fast" versions of fastlink.  300 megs is usually plenty.
c. Use gcc (the GNU/free software foundation C compiler) to compile fastlink
	(gcc produces machine language that is about 10% faster than sun's
	C compiler).
d. Install the generic-small kernel instead of the generic kernel (the generic
	kernel has device files for almost EVERYTHING.  The generic-small
	kernel is configured for a system without many devices and without
	many users).  Installing a generic-small kernel is an option during
	system installation on sun workstations.
e. Reconfigure your kernel so it has only devices which you need. This is
	a task for an experienced system administrator.  This should give you
	a small improvement in overall system speed, but if you are already
	running the generic-small kernel, additional improvement may be so
	small that it's not worth the trouble.  If the generic-small kernel
	is insufficent for your system (so you were forced to install the
	generic kernel) this step is a MUST.  The generic kernel will slow
	down your workstation significantly, and most of the device-support
	is unnecessary.
f. Don't run your linkage analyses in the background, because running programs
	in the background gives them a lower priority (on suns it reduces the
	priority level by 3 out of a total range of 40).  Either do the runs
	in the foreground (which is fine as long as you don't plan to log out)
	or you can use the root password to renice the pedin process by -3
	to compensate (negative nice values give a higher priority).
		If you need to log out, you can use the screen command
	(distributed by GNU/free software foundation) and "detach" a 
	session so you can log out without programs terminating.  Later
	you can log back in and "reattach" the session, which continued to
	run while you were logged out.  The screen command is available at
	prep.ai.mit.edu, and is also on the O'Reilly Unix Power Tools
	CD-ROM.
		According to the sun documentation, renicing below -10
	can interfere with the operating system and actually reduce the
	process' speed.  I just run them at a priority/nice level of 0
	(the standard default level).  That gives me reasonable response with
	my other applications, but still lets fastlink run at a decent speed.
g. Run with 100% penetrance
	Runs with 100% penetrance can run faster than runs with incomplete
	penetrance.  Of course, if you have an unaffected obligate
	carrier, this won't work.  In addition, incomplete-penetrance
	runs may be necessary for your research to be "good" (decisions
	like this are why the professors make the big bucks :-)

Of course, buying more RAM will increase your speed.  I've heard that
increasing RAM from 16 to 32 megs will result in a large increase in speed.
Increasing RAM from 32-64 megs will result in a significant increase.
Increasing beyond 64megs is not particulairly helpful.  Note that this data
is anecdotal in nature (I haven't seen it myself), but it makes intuitive
sense to me.  If someone sends me some SIMMS for our sparcII, I'll be 
glad to test it out :-)

[note: I run on a sun sparcII.  I'd like to hear data from people on other
platforms.  I'd especially like to hear data on the speed-RAM relationship.


15) I set up 300 megs of paging space on my workstation, but now I'm running
out of hard-drive.  Is there any way I can use my hard drive space more
effeciently?

Paging space is hard-drive space which is used as virtual RAM.  Unix boxes
use paging space constantly, swapping processes out to the hard-drive and
into RAM constant.

There are two types of paging-space on sun systems (and many other types of
Unix systems as well): paging files, and paging sectors.

Paging files are actual files (you can do an ls and find them in a directory
somewhere) in the filesystem.  Paging sectors are separate disk partitions,
and as such are not in the filesystem.

A filesystem has two types of overhead.  Consider the following output:
bigbox% df
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/sd0a               7735    5471    1491    79%    /
/dev/sd0g             151399  127193    9067    93%    /usr
/dev/sd3a             306418  266644    9133    97%    /usr2
bigbox% df -i
Filesystem             iused   ifree  %iused  Mounted on
/dev/sd0a                951    3913    20%   /
/dev/sd0g              10218   66390    13%   /usr
/dev/sd3a               6278  150394     4%   /usr2

	The top df command shows the space available on "bigbox" in k.
Note that, although sd3a has 306 megs, of which 267 megs are used, only
9 megs are available.  This is because the filesystem saves a "10%" rainy
day fund, so 10% of the filesystem is unusable.  Although you can reduce this
percentage (with the root password and using an arcane command), it is not
recommended.  According to sun's documentation, when the filesystem gets more
than 90% full the speed of the filesystem will begin to rapidly drop.
	When you have a 100 meg paging file, there is a corresponding 10 megs
of "rainy-day-fund" which you cannot access, so setting up a 100 meg paging
file requires 110 megs of paging space.
	But when you use a seperate partition as a paging sector, no 10%
rainy-day fund is necessary.  100 megs of raw disk space will give you
100 megs of virtual-RAM.

	The bottom df command shows the number of inodes available in the
filesystem.  An inode points to files, and is part of the filesystem that
you rarely need to look at.  By default, when you create a filesystem in a
partition, one inode is created for every 2k in the partition.  The
306 meg partition has 156,000 inodes, but only 4% of them are used.
	I don't know how large an inode is (a quick search through my
documentation failed to find it) but I would guess that an inode is 
256 bytes.  If that's true, the 150,000 unused inodes above are wasting
37.5 megs of disk-space.  One inode for every 2k is too much.
	When you create a 100 meg paging file, you only use 1 inode, but
that 100 megs of filesystem has a corresponding 50,000 inodes!
	If you create a paging-sector, you are not using a filesystem, so
no inodes are necessary.  In addition, when you create a filesystem, you
can reduce the number of inodes to something more reasonable (like one
inode for every 10k of disk space).  I generally don't mess with the
inode count on my / and /usr partitions, since that contains the
operating system.  Make certain not to reduce the default inode number
too much: YOU DONT WANT TO RUN OUT OF INODES.
	We converted our 350 megs of paging files to paging sectors, and
got another 70 megs of free disk space as a result (20%)!


16) But I don't know how to do all this optomization, and my research assistant
is spending all his/her time trying to figure it out.

Unix system administration is a complex task which requires experience.
An experienced sysadmin can do in minutes what it would take you hours
(or days) to accomplish.  In addition, an experienced sysadmin won't make
stupid mistakes very often (lets see, while I was learning on-the-job
I ruined our backup tape during an upgrade {luckily the upgrade
was successful!}, moved a directory inside itself as root, botched email
service a couple times, and spent tons of time figuring out how to accomplish
simple tasks).

Most universities have small budgets for their system administrators.  Many
head sysadmins have recruited students to assist them.  Basically the students
slave away for nothing, learn tons of stuff, barely pass their classes, become
unix gods, and get hired for 40k+/year if/when they graduate/flunk out.

If your university has a sysadmin group like this, you can probably "hire"
them to support your machine for about $6/hour at about 4 hours/week*machine.
The head-sysadmin will be happy to give some money to their more-experienced
volunteers, the volunteers get another line on their resume+additional
experience, and you get experienced sysadmins to run your machine.  In
addition, most sysadmin groups have an automated nightly backup.  Just
think: your machine gets backed up EVERY NIGHT AUTOMATICALLY!

At Portland State University the Electrical Engineering sysadmin group
has been hired to maintain the unix machines of four other departments,
at an average price of $15/week*machine (no additional price for xterms!)
The quality of the service is excellent (especially since the most
experienced volunteers are usually the ones given the money), there is
no annual training-gap as people leave (since the experienced volunteers are
constantly training the new ones) and you have the entire resources
and experience of the sysadmin group to help you.

Of course, test them by deleting an unimportant file and seeing if they
can restore it from backups (the backup test is the most important in
system administration--have you tested your backups lately?).  If they
successfully restore the file from backups, give them the sun-optomization
list (above two questions) and watch as the most experienced volunteer
turns the optomization into a recruit-training session :-)  They may even
have a contest to see how small they can make your kernel-configuration file!


17) What genetic-linkage databases are available on the internet?

medline is a database for searching for articles in journals.
	If you are in the pacific-northwest, you can get to medline
	using telnet.  Just telnet to uwin.u.washington.edu and go
	into the library databases.  It can even email you the output
	if you wish!
		Many libraries and many internet service providers 
	have medline services online.  Some interfaces are better than
	others (we don't even bother using the one at OHSU--it's too
	painful...)  Your local library can probably supply you with
	information.

Victor McKusick wrote a book: Mendelian Inheritance in Man.  It is
	continuously updated online at Johns-Hopkins University 
	(making it online-MIM or OMIM).  Combined with the Genome-
	Data-Base, it is available online at welchlab.welch.jhu.edu
		You need to get an account.  Send email to 
	help at welch.jhu.edu for information.



More information about the Gen-link mailing list