SUMMARY: comparing allele frequencies

Wed Mar 23 09:23:52 EST 1994

          To:IN%"genetic-linkage at"
          Subj:RE: comparing allele frequencies
          Don Bowden "bowden at" writes:
               This is probably a dumb question, but... we have genotyped a 
          microsatellite  in  a  large number  of  caucasian  and  
          african-american  samples and would like to compare the allele  
          frequency distribution  to  see if they are different. I did this  
          using  a contingency table to calculate chi square and there is a 
          significant difference. I seem to recall though that if any of 
          the cells has less than 5 elements in it, chi square is not the 
          appropriate way to go. What is the right way to do this, and more  
          importantly,  is  there some textbook or other source which would  
          show  a simple-minded molecular biologist how to do it?
          Following  is a summary of the many responses I received; I  left 
          off  the names to protect the innocent, but will be happy to  get 
          people in touch with each other if they want. Thanks for the many 
          helpful suggestions.....
          This is not a dumb question.  It is not easy to deal with 
          statistical  analysis  of loci with lots of alleles, as is  
          typical  of micro-satellite  repeats.  You could look at Bruce  
          Weir's  "Data Analysis"  book; there is some stuff on tests 
          involving  multiple locus markers.  Depending on the number of 
          alleles it may be easy or  hard;  there has been quite a bit 
          published in the  last  few years on statistical tests involving 
          extraordinarily  polymorphic systems, but this literature hasn't 
          made it into books yet.
               You  are  correct to be leery of tests which  are  based  on 
          large sample approximations when your samples aren't big  enough.  
          The "5" rule for the chi-square test is more a rule-of-thumb than 
          a hard-and-fast rule.  For tables with not too many cells, it  is 
          often  possible to use exact permutation tests  instead.   Rather 
          than  just consult a textbook if you are unsure of what  you  are 
          doing,  why  don't you see if your university has  a  statistical 
          consulting service?  At least they might steer you to appropriate 
          analyses,  even  if  you  have  to  carry  out  the   
          computations  yourself.....
          Bruce Weir's book, Genetic Data Analysis (Sinauer, 1990) provides 
          a thorough and expert (but not simple) discussion. Be aware  that 
          this is a hot and extremely controversial question at the  moment 
          (if  we  knew the one, or any one, definitively  correct  way  of 
          comparing  allele frequency distributions for samples drawn  from 
          two  populations of humans, typed for multiallelic  DNA  markers, 
          and  from  the comparison estimating accurately how  much  allele 
          frequencies  truly vary between populations, most of the  
          controversy  concerning  forensic DNA typing could go away).  
          What  you specifically  need  to  be aware of  is  that  several  
          competing "definitive"  solutions  exist at the moment, and 
          Weir's  is  the only one....
               You  are  right to question the accuracy of  the  Chi-square 
          result  in the case where some of the cell numbers are < 5  in  a 
          2Xn  contingency   table.  The best way to do the test  is  by  a 
          Monte-Carlo simulation, where many random datasets are  generated 
          that  all have the same marginal totals that your data have.  The 
          Chi-square  value  is calculated for all of the  tables  and  the 
          position of your table among all the tables is used as the  
          measure of significance.
               A  biologically relevant reference for doing this is:  Roff, 
          D.A.  and  Bentzen, P. 1989. The analysis  of  mitochondrial  DNA 
          polymorphisms:  chi-square  and  the problem  of  small  samples. 
          Molecular Biology and Evolution 6:539-454
               A book that addresses this issue is: Agresti, A. 1990  
          Categorical Data Analysis, Wiley Pub Co.
               A  couple of years ago I wrote a DOS program to do the  same 
          kind of analysis.....
          Yes, Monte Carlo is a good way to do the tests, if programs
          already exist.  However, if the tables are very large and
          sparse, standard Monte Carlo (just keeping the marginals fixed)
          is very slow - there are methods based on Markov Chain Monte 
          methods which become an option then, but these methods haven't
          to my knowledge been implemented in such a fashion that for
          your data programs already exist.
               You should post a summary of your replies - you are 
          undoubtedly  not the only person who wants to do such tests.   
          They  are also relevant for case-control studies, e.g., when you 
          might be
          interested in linkage disequilibrium in the vicinity of a mapped
          disease locus.....
               Thanks for forwarding the message  regarding the analysis of 
          sparse,  many-celled contingency tables (by  Monte-Carlo  
          simulations).  My  small, home-brewed program called  was  
          designed  to analyze  these  kinds  of data. I've used it  to  
          compare  allele frequency  distributions of RFLP VNTR's for the 
          forensic  lab  at the  Royal Canadian Mounted Police and the FBI. 
          I've  distributed the program to anyone who requests it....
               There  is a problem with markers with large numbers  of  
          alleles.  Your approach is basically right .... and you  are  
          right chi-square  is  not very robust with less than 5 per  cell.  
          Most stat  people  will tell you to collapse cells until  you  
          get  at least 5 per cell... e.g. take the 110 and 112 bp alleles 
          and  put  them in one cell....
               According  to my old version of Steele and Torrie, you  need 
          to  simply apply a correction for continuity to your  test.  They 
          quote Yates as proposing the reduction of the absolute  deviation 
          (observed-expected) by 0.5....
               You  calculated the chi-square value as sum [ (O-E)^2_
          E].  A better  statistic  [but the exact same data structure] is  
          the  G test. It is more closely distributed as a chi-square,  
          especially when  class  numbers are small. The rule of thumb is 
          to  have  no expected class number less than 1, for this test. If 
          classes  are too  small, a column can be lumped with anther 
          column of  a  rare allele. The larger the table (more than 2X2), 
          the less  sensitive this  test  is to small expected numbers. See  
          Sokal  and  Rohlf, Biometry,  Second Ed., 1981, Freeman, pages 
          731-747. Your  design  is  probably  model II (maybe I, but not  
          III).  Especially  seen pages  744-6.  This is a test for 
          independence,  homogeneity,  or heterogeneity. The G-test might 
          be called a likelihood ratio test in another text.....
          You  should  use a Fisher exact test (because of the  small  cell 
          sizes  as  you  surmised) but probably a variation  of  the  test 
          (since Fisher's test is for 2 X 2 tables) that was described  by: 
          SW  Guo  and  EA Thompson "Performing the exact  test  of  
          Hardy-Weinberg  proportion for multiple alleles" Biometrics 
          48:361-372, 1992. Guo and Thompson also have a more detailed 
          technical report available from Elizabeth Thompson at the 
          University of Washington [(206)  543-7237] and ask the secretary 
          to send Technical  Report #187 and a program available on 
          request. While you are not necessarily testing H-W, the methods 
          are easily adapted to the comparisons of population allelic 
          distributions you are doing....
               If you are interested in doing exact tests of Hardy-Weinberg 
          Equilibrium,  there is a nice program available from Sun-Wei  Guo 
          at  the University of Michigan. His programs are written  in  `C' 
          and  require compilation on your own machine. I've compiled  them 
          on a Sun Workstation and found them very easy to use.....

          are easily adapted to the comparisons of population allelic 
          distributions you are doing....
               If you are i

More information about the Gen-link mailing list