A question of probability.

wijsman at max.u.washington.edu wijsman at max.u.washington.edu
Sat Oct 16 15:59:31 EST 1993

>> The situation:
>> 	A family is segregating an autosomal dominant trait. There is known 
>> genetic heterogeneity associated with this trait, but a "good proportion" of 
>> families show tight linkage to a marker on chromosome 22. The family under 
>> investigation shows a LOD of 3.0 (or whatever) at theta=0 (i.e. no 
>> recombinants) using this candidate marker on 22.
>> The question: 
>> 	BASED ON THE LOD SCORE ALONE (and not on an estimation of the 
>> proportion of families that are linked to 22 ) what is the probability 
>> that the trait in this particular family is tightly linked to the marker on 
>> 22?  How does one go about this type of calculation? Any references that 
>> might help?

I am not quite sure what the question is here - if we know that some
families are linked to this locus, then we should use that information to
compute the probability that the disease locus in this family is also
linked to this marker.  We should not consider the lod score in this family
in isolation from the information about genetic heterogeneity.  Only if we
do not know the disease location (or any of the disease locations) can we
estimate a probability of linkage for this family alone based on its lod
score of 3.0. The probability of linkage given a lod score of 3.0 is based
on a Baysian arguement which incorporates the prior probability of linkage. 
If we have some information already that there is a form of the disease
which is linked to the marker, then this information increases the prior
probability of linkage over that which is appropriate if we do not have
this extraneous information.
>> 	When searching for linkage to a previously unknown locus, a LOD of 
>> 3.0 approximates to a 95% probability of linkage (not the 1000 to 1 
>> odds frequently cited). How is this changed if we are looking for linkage to 
>> a single candidate locus as in the present example?

> That depends on the map length and marker density, hence the number
> of "independent" tests.

A lod score of 3.0 approximates a 95% probability of linkage (in humans)
because although (for lod=3) the probability of the data under the
hypothesis of linkage is 1000 times more likely than the probability of the
data under the null hypothesis of free recombination, there is a very low
prior probability that we would choose 2 linked loci from a random set of
markers.  This prior probability (as Toby Bradshaw notes) is a function of
the map length; longer map=lower prior probability, shorter map=higher
prior probability.  (But it isn't a function of the marker density!)  The
1000:1 odds describes the probability of the data GIVEN linkage divided by
the probability of linkage GIVEN the null hypothsis.  Note that this
involves comparing the probability of the data under two different
hypotheses.  However, what is being asked by "what is the probability of
linkage GIVEN a lod score of 3" is not the same thing as "what is the
probability of the data GIVEN linkage (or free recombination)".  To get the
probability of linkage, apply a Baysian arguement which incorporates the
prior probability of linkage, and the probability of getting lod=3 under
the null and alternative hypotheses.

>> How is this changed if we are looking for linkage to 
>> a single candidate locus as in the present example?

Depends on how strong is the candidate locus as a candidate.  A really good
candidate will substantially increase the prior probability of linkage,
thus requiring a lower lod score for the same posterior probability of
linkage. Exactly how much this change will be might be hard to quantify.  I
tend to be pretty skeptical of "candidates" since it seems like in many
situations we know just enough to come up with some reason that many genes
could be candidates, but not enough to have really firm reasons for these
arguements.  There are, however, sometimes really good candidate loci.

> In a single test, the LOD score and p value are the same, if my
> intuition is correct.  Ellen Wijsman -- are you listening?

Technically no, but in practice yes, if we take 1/10^lod as the p value. 
The p value is the probability of this strong a result or stronger under
the null hypothesis.  Call this a (the type-1 error).  Set b to be the
probability of failing to reject the null hypothesis when the alternative
is true ( (1-b) is the power of the test).  For linkage analysis this isn't
actually a constant since it will be harder to reject the null hypothesis
for loose than for close linkage given a fixed a and critical value for the
lod score.  But for the sake of a simple arguement (& that we only can use
alpha-numeric characters in this forum) let us assume that b is a constant
for some critical value (e.g., lod=3).  The sequential design on which
linkage analysis in humans was originally based sets the critical lod score
to accept the hypothesis of linkage as lod(crit) = log10( [1-b]/a ).  If
b=0.01 (or something equally small) then lod(crit) =~ log10( 1/a ), where a
is the p-value of the test.

By using a & b & the prior probability of linkage (say, g), we can compute
the probability of linkage given that we have gotten the critical lod score
with the observed data.  The following is an quick & dirty approximate
arguement.  To abbreviate:  L stands for linkage, r=recombination fraction,
D=observed data.  Approximate p(D | L) = p(D | r) (this isn't quite true).

    p( L | D ) = p(D | L)*p(L)/{p(D | L)*p(L) + p(D | not-L)*p(not-L)}. 
But p(D | L) = (1-b), and p(D | not-L) = a.  So, 
    p( L | D ) = (1-b)*p(L)/{ (1-b)*p(L) +a*p(not-L) }. 

We can estimate p(L) for unmapped loci:  if we need to be within 30 cM to
call the situation" linkage", then if the human genome is 3000 cM long,
p(L) =~ 30/3000 = 0.03, and p(not-L) = 0.97.  Substituting a=0.001, and
b=0.01, and p(L) = 0.03, we get: 
    p(L | D) = .99*.03/{.99*.03 + .001*.97) = .968.

But if we know that, e.g., 50% of the families have a form of the disease
which is linked to the marker, then p(L) = 0.5, not the much smaller 0.0.
This gives p(L | D) = .99/{.99 + .001) which is almost 1.0.  Likewise, a
candidate gene will also raise p(L), with the same consequences.  But it is
harder to quantify p(L) for a candidate locus than for genetic
heterogeneity.  (And even for heterogeneity, to do things right you would
have to integrate over the possible heterogeneity values).

Good references:  to get a lucid and quick description of the Baysian part
of the arguement, look up the paper by Haldane & Smith, 1947 (or 1948).  If
I remember the title it has something to do with color blindness in the
title.  I don't have the reference here at home.  J Ott's book also has a
pretty good description of the basic theory (chapter 4 in the 2nd edition,
or chapter 3 in the first edition).  This book also has the Haldane & Smith
reference in it.  Morton's original 1955 paper in Am J Hum Genet has the
information also, but is much harder to read.

Ellen Wijsman
Div of Medical Genetics, RG-25
and Dept of Biostatistics
University of Washington
Seattle, WA   98195
wijsman at u.washington.edu

More information about the Gen-link mailing list