Austin P. So (Hae Jin) nobody at
Mon May 29 23:54:16 EST 2006

Rex Eastbourne wrote:
> Thanks again for replying. The k-means algorithm should be a snap. But
> how do I convert the proteins, which are in the format
> "UPSP_SLDJK_HUMAN_P12182" to vectors that can be handled by the
> mathematical algorithm (i.e. what is the "distance" between two
> proteins)? Is there already a program that does this? (I understand
> there's something on the NCBI's website.)

So, if I understand the format of the data:

1. "UPSP_SLDJK_HUMAN_P12182" is just a name...say it is a row id.
2. with that name (i.e. in each row), you will have a series of data 
points, each data point corresponding the amount of protein found in 
patient X (technically you don't have to know if they have the disease 
or not).
3. each column (i.e. patient data) will therefore be a 
(multidimensional) data vector, with each protein being an "axis".

		patient1	patient2	patient3	patient4
protein1	1	50	49	3
protein2	2	35	30	1
protein3	30	20	20	31

In this way you can apply (hierarchical) k-means clustering on the 
column "vectors".

Note that you may not get anything either since ultimately your analysis 
is only as good as your data...


