Description of apcluster

 This program has been downloaded from
 http://www.psi.toronto.edu/affinitypropagation/apcluster.m
 It is the Matlab implementation of affinity propagation
 Mateda uses in different situations in which clustering of data is needed
 See the web page for conditions of use and appropriate references

 [idx,netsim,dpsim,expref]=apcluster(s,p)

 APCLUSTER uses affinity propagation (Frey and Dueck, Science,
 2007) to identify data clusters, using a set of real-valued
 pair-wise data point similarities as input. Each cluster is
 represented by a data point called a cluster center, and the
 method searches for clusters so as to maximize a fitness
 function called net similarity. The method is iterative and
 stops after maxits iterations (default of 500 - see below for
 how to change this value) or when the cluster centers stay
 constant for convits iterations (default of 50). The command
 apcluster(s,p,'plot') can be used to plot the net similarity
 during operation of the algorithm.

 For N data points, there may be as many as N^2-N pair-wise
 similarities (note that the similarity of data point i to k
 need not be equal to the similarity of data point k to i).
 These may be passed to APCLUSTER in an NxN matrix s, where
 s(i,k) is the similarity of point i to point k. In fact, only
 a smaller number of relevant similarities are needed for
 APCLUSTER to work. If only M similarity values are known,
 where M < N^2-N, they can be passed to APCLUSTER in an Mx3
 matrix s, where each row of s contains a pair of data point
 indices and a corresponding similarity value: s(j,3) is the
 similarity of data point s(j,1) to data point s(j,2).

 APCLUSTER automatically determines the number of clusters,
 based on the input p, which is an Nx1 matrix of real numbers
 called preferences. p(i) indicates the preference that data
 point i be chosen as a cluster center. A good choice is to 
 set all preference values to the median of the similarity
 values. The number of identified clusters can be increased or
 decreased  by changing this value accordingly. If p is a
 scalar, APCLUSTER assumes all preferences are equal to p.

 The fitness function (net similarity) used to search for
 solutions equals the sum of the preferences of the the data
 centers plus the sum of the similarities of the other data
 points to their data centers.

 The identified cluster centers and the assignments of other
 data points to these centers are returned in idx. idx(j) is
 the index of the data point that is the cluster center for
 data point j. If idx(j) equals j, then point j is itself a
 cluster center. The sum of the similarities of the data
 points to their cluster centers is returned in dpsim, the
 sum of the preferences of the identified cluster centers is
 returned in expref and the net similarity (sum of the data
 point similarities and preferences) is returned in netsim.

 EXAMPLE

 N=100; x=rand(N,2); % Create N, 2-D data points
 M=N*N-N; s=zeros(M,3); % Make ALL N^2-N similarities
 j=1;
 for i=1:N
   for k=[1:i-1,i+1:N]
     s(j,1)=i; s(j,2)=k; s(j,3)=-sum((x(i,:)-x(k,:)).^2);
     j=j+1;
   end;
 end;
 p=median(s(:,3)); % Set preference to median similarity
 [idx,netsim,dpsim,expref]=apcluster(s,p,'plot');
 fprintf('Number of clusters: %d\n',length(unique(idx)));
 fprintf('Fitness (net similarity): %f\n',netsim);
 figure; % Make a figures showing the data and the clusters
 for i=unique(idx)'
   ii=find(idx==i); h=plot(x(ii,1),x(ii,2),'o'); hold on;
   col=rand(1,3); set(h,'Color',col,'MarkerFaceColor',col);
   xi1=x(i,1)*ones(size(ii)); xi2=x(i,2)*ones(size(ii)); 
   line([x(ii,1),xi1]',[x(ii,2),xi2]','Color',col);
 end;
 axis equal tight;

 PARAMETERS
 
 [idx,netsim,dpsim,expref]=apcluster(s,p,'NAME',VALUE,...)
 
 The following parameters can be set by providing name-value
 pairs, eg, apcluster(s,p,'maxits',1000):

   Parameter    Value
   'sparse'     No value needed. Use when the number of data
                points is large (eg, >3000). Normally,
                APCLUSTER passes messages between every pair
                of data points. This flag causes APCLUSTER
                to pass messages between pairs of points only
                if their input similarity is provided and
                is not equal to -Inf.
   'maxits'     Any positive integer. This specifies the
                maximum number of iterations performed by
                affinity propagation. Default: 500.
   'convits'    Any positive integer. APCLUSTER decides that
                the algorithm has converged if the estimated
                cluster centers stay fixed for convits
                iterations. Increase this value to apply a
                more stringent convergence test. Default: 50.
   'dampfact'   A real number that is less than 1 and
                greater than or equal to 0.5. This sets the
                damping level of the message-passing method,
                where values close to 1 correspond to heavy
                damping which may be needed if oscillations
                occur.
   'plot'       No value needed. This creates a figure that
                plots the net similarity after each iteration
                of the method. If the net similarity fails to
                converge, consider increasing the values of
                dampfact and maxits.
   'details'    No value needed. This causes idx, netsim,
                dpsim and expref to be stored after each
                iteration.
   'nonoise'    No value needed. Degenerate input similarities
                (eg, where the similarity of i to k equals the
                similarity of k to i) can prevent convergence.
                To avoid this, APCLUSTER adds a small amount
                of noise to the input similarities. This flag
                turns off the addition of noise.

 Copyright (c) Brendan J. Frey and Delbert Dueck (2006). This
 software may be freely used and distributed for
 non-commercial purposes.

0001 % This program has been downloaded from
0002 % http://www.psi.toronto.edu/affinitypropagation/apcluster.m
0003 % It is the Matlab implementation of affinity propagation
0004 % Mateda uses in different situations in which clustering of data is needed
0005 % See the web page for conditions of use and appropriate references
0006 %
0007 % [idx,netsim,dpsim,expref]=apcluster(s,p)
0008 %
0009 % APCLUSTER uses affinity propagation (Frey and Dueck, Science,
0010 % 2007) to identify data clusters, using a set of real-valued
0011 % pair-wise data point similarities as input. Each cluster is
0012 % represented by a data point called a cluster center, and the
0013 % method searches for clusters so as to maximize a fitness
0014 % function called net similarity. The method is iterative and
0015 % stops after maxits iterations (default of 500 - see below for
0016 % how to change this value) or when the cluster centers stay
0017 % constant for convits iterations (default of 50). The command
0018 % apcluster(s,p,'plot') can be used to plot the net similarity
0019 % during operation of the algorithm.
0020 %
0021 % For N data points, there may be as many as N^2-N pair-wise
0022 % similarities (note that the similarity of data point i to k
0023 % need not be equal to the similarity of data point k to i).
0024 % These may be passed to APCLUSTER in an NxN matrix s, where
0025 % s(i,k) is the similarity of point i to point k. In fact, only
0026 % a smaller number of relevant similarities are needed for
0027 % APCLUSTER to work. If only M similarity values are known,
0028 % where M < N^2-N, they can be passed to APCLUSTER in an Mx3
0029 % matrix s, where each row of s contains a pair of data point
0030 % indices and a corresponding similarity value: s(j,3) is the
0031 % similarity of data point s(j,1) to data point s(j,2).
0032 %
0033 % APCLUSTER automatically determines the number of clusters,
0034 % based on the input p, which is an Nx1 matrix of real numbers
0035 % called preferences. p(i) indicates the preference that data
0036 % point i be chosen as a cluster center. A good choice is to
0037 % set all preference values to the median of the similarity
0038 % values. The number of identified clusters can be increased or
0039 % decreased  by changing this value accordingly. If p is a
0040 % scalar, APCLUSTER assumes all preferences are equal to p.
0041 %
0042 % The fitness function (net similarity) used to search for
0043 % solutions equals the sum of the preferences of the the data
0044 % centers plus the sum of the similarities of the other data
0045 % points to their data centers.
0046 %
0047 % The identified cluster centers and the assignments of other
0048 % data points to these centers are returned in idx. idx(j) is
0049 % the index of the data point that is the cluster center for
0050 % data point j. If idx(j) equals j, then point j is itself a
0051 % cluster center. The sum of the similarities of the data
0052 % points to their cluster centers is returned in dpsim, the
0053 % sum of the preferences of the identified cluster centers is
0054 % returned in expref and the net similarity (sum of the data
0055 % point similarities and preferences) is returned in netsim.
0056 %
0057 % EXAMPLE
0058 %
0059 % N=100; x=rand(N,2); % Create N, 2-D data points
0060 % M=N*N-N; s=zeros(M,3); % Make ALL N^2-N similarities
0061 % j=1;
0062 % for i=1:N
0063 %   for k=[1:i-1,i+1:N]
0064 %     s(j,1)=i; s(j,2)=k; s(j,3)=-sum((x(i,:)-x(k,:)).^2);
0065 %     j=j+1;
0066 %   end;
0067 % end;
0068 % p=median(s(:,3)); % Set preference to median similarity
0069 % [idx,netsim,dpsim,expref]=apcluster(s,p,'plot');
0070 % fprintf('Number of clusters: %d\n',length(unique(idx)));
0071 % fprintf('Fitness (net similarity): %f\n',netsim);
0072 % figure; % Make a figures showing the data and the clusters
0073 % for i=unique(idx)'
0074 %   ii=find(idx==i); h=plot(x(ii,1),x(ii,2),'o'); hold on;
0075 %   col=rand(1,3); set(h,'Color',col,'MarkerFaceColor',col);
0076 %   xi1=x(i,1)*ones(size(ii)); xi2=x(i,2)*ones(size(ii));
0077 %   line([x(ii,1),xi1]',[x(ii,2),xi2]','Color',col);
0078 % end;
0079 % axis equal tight;
0080 %
0081 % PARAMETERS
0082 %
0083 % [idx,netsim,dpsim,expref]=apcluster(s,p,'NAME',VALUE,...)
0084 %
0085 % The following parameters can be set by providing name-value
0086 % pairs, eg, apcluster(s,p,'maxits',1000):
0087 %
0088 %   Parameter    Value
0089 %   'sparse'     No value needed. Use when the number of data
0090 %                points is large (eg, >3000). Normally,
0091 %                APCLUSTER passes messages between every pair
0092 %                of data points. This flag causes APCLUSTER
0093 %                to pass messages between pairs of points only
0094 %                if their input similarity is provided and
0095 %                is not equal to -Inf.
0096 %   'maxits'     Any positive integer. This specifies the
0097 %                maximum number of iterations performed by
0098 %                affinity propagation. Default: 500.
0099 %   'convits'    Any positive integer. APCLUSTER decides that
0100 %                the algorithm has converged if the estimated
0101 %                cluster centers stay fixed for convits
0102 %                iterations. Increase this value to apply a
0103 %                more stringent convergence test. Default: 50.
0104 %   'dampfact'   A real number that is less than 1 and
0105 %                greater than or equal to 0.5. This sets the
0106 %                damping level of the message-passing method,
0107 %                where values close to 1 correspond to heavy
0108 %                damping which may be needed if oscillations
0109 %                occur.
0110 %   'plot'       No value needed. This creates a figure that
0111 %                plots the net similarity after each iteration
0112 %                of the method. If the net similarity fails to
0113 %                converge, consider increasing the values of
0114 %                dampfact and maxits.
0115 %   'details'    No value needed. This causes idx, netsim,
0116 %                dpsim and expref to be stored after each
0117 %                iteration.
0118 %   'nonoise'    No value needed. Degenerate input similarities
0119 %                (eg, where the similarity of i to k equals the
0120 %                similarity of k to i) can prevent convergence.
0121 %                To avoid this, APCLUSTER adds a small amount
0122 %                of noise to the input similarities. This flag
0123 %                turns off the addition of noise.
0124 %
0125 % Copyright (c) Brendan J. Frey and Delbert Dueck (2006). This
0126 % software may be freely used and distributed for
0127 % non-commercial purposes.
0128 
0129 function [idx,netsim,dpsim,expref,unconverged]=apcluster(s,p,varargin);
0130 
0131 % Handle arguments to function
0132 if nargin<2 error('Too few input arguments');
0133 else
0134     maxits=500; convits=50; lam=0.5; plt=0; details=0; nonoise=0;
0135     i=1;
0136     while i<=length(varargin)
0137         if strcmp(varargin{i},'plot')
0138             plt=1; i=i+1;
0139         elseif strcmp(varargin{i},'details')
0140             details=1; i=i+1;
0141         elseif strcmp(varargin{i},'sparse')
0142             [idx,netsim,dpsim,expref]=apcluster_sparse(s,p,varargin{:});
0143             return;
0144         elseif strcmp(varargin{i},'nonoise')
0145             nonoise=1; i=i+1;
0146         elseif strcmp(varargin{i},'maxits')
0147             maxits=varargin{i+1};
0148             i=i+2;
0149             if maxits<=0 error('maxits must be a positive integer'); end;
0150         elseif strcmp(varargin{i},'convits')
0151             convits=varargin{i+1};
0152             i=i+2;
0153             if convits<=0 error('convits must be a positive integer'); end;
0154         elseif strcmp(varargin{i},'dampfact')
0155             lam=varargin{i+1};
0156             i=i+2;
0157             if (lam<0.5)||(lam>=1)
0158                 error('dampfact must be >= 0.5 and < 1');
0159             end;
0160         else i=i+1;
0161         end;
0162     end;
0163 end;
0164 if lam>0.9
0165     fprintf('\n*** Warning: Large damping factor in use. Turn on plotting\n');
0166     fprintf('    to monitor the net similarity. The algorithm will\n');
0167     fprintf('    change decisions slowly, so consider using a larger value\n');
0168     fprintf('    of convits.\n\n');
0169 end;
0170 
0171 % Check that standard arguments are consistent in size
0172 if length(size(s))~=2 error('s should be a 2D matrix');
0173 elseif length(size(p))>2 error('p should be a vector or a scalar');
0174 elseif size(s,2)==3
0175     tmp=max(max(s(:,1)),max(s(:,2)));
0176     if length(p)==1 N=tmp; else N=length(p); end;
0177     if tmp>N
0178         error('data point index exceeds number of data points');
0179     elseif min(min(s(:,1)),min(s(:,2)))<=0
0180         error('data point indices must be >= 1');
0181     end;
0182 elseif size(s,1)==size(s,2)
0183     N=size(s,1);
0184     if (length(p)~=N)&&(length(p)~=1)
0185         error('p should be scalar or a vector of size N');
0186     end;
0187 else error('s must have 3 columns or be square'); end;
0188 
0189 % Construct similarity matrix
0190 if N>3000
0191     fprintf('\n*** Warning: Large memory request. Consider activating\n');
0192     fprintf('    the sparse version of APCLUSTER.\n\n');
0193 end;
0194 if size(s,2)==3
0195     S=-Inf*ones(N,N); 
0196     for j=1:size(s,1) S(s(j,1),s(j,2))=s(j,3); end;
0197 else S=s;
0198 end;
0199 
0200 % In case user did not remove degeneracies from the input similarities,
0201 % avoid degenerate solutions by adding a small amount of noise to the
0202 % input similarities
0203 if ~nonoise
0204     rns=randn('state'); randn('state',0);
0205     S=S+(eps*S+realmin*100).*rand(N,N);
0206     randn('state',rns);
0207 end;
0208 
0209 % Place preferences on the diagonal of S
0210 if length(p)==1 for i=1:N S(i,i)=p; end;
0211 else for i=1:N S(i,i)=p(i); end;
0212 end;
0213 
0214 % Allocate space for messages, etc
0215 dS=diag(S); A=zeros(N,N); R=zeros(N,N); t=1;
0216 if plt netsim=zeros(1,maxits+1); end;
0217 if details
0218     idx=zeros(N,maxits+1);
0219     netsim=zeros(1,maxits+1); 
0220     dpsim=zeros(1,maxits+1); 
0221     expref=zeros(1,maxits+1); 
0222 end;
0223 
0224 % Execute parallel affinity propagation updates
0225 e=zeros(N,convits); dn=0; i=0;
0226 while ~dn
0227     i=i+1; 
0228 
0229     % Compute responsibilities
0230     Rold=R;
0231     AS=A+S; [Y,I]=max(AS,[],2); for k=1:N AS(k,I(k))=-realmax; end;
0232     [Y2,I2]=max(AS,[],2);
0233     R=S-repmat(Y,[1,N]);
0234     for k=1:N R(k,I(k))=S(k,I(k))-Y2(k); end;
0235     R=(1-lam)*R+lam*Rold; % Damping
0236 
0237     % Compute availabilities
0238     Aold=A;
0239     Rp=max(R,0);
0240     for k=1:N Rp(k,k)=R(k,k); end;
0241     A=repmat(sum(Rp,1),[N,1])-Rp;
0242     dA=diag(A); A=min(A,0); for k=1:N A(k,k)=dA(k); end;
0243     A=(1-lam)*A+lam*Aold; % Damping
0244 
0245     % Check for convergence
0246     E=((diag(A)+diag(R))>0); e(:,mod(i-1,convits)+1)=E; K=sum(E);
0247     if i>=convits || i>=maxits
0248         se=sum(e,2);
0249         unconverged=(sum((se==convits)+(se==0))~=N);
0250         if (~unconverged&&(K>0))||(i==maxits) dn=1; end;
0251     end;
0252 
0253     % Handle plotting and storage of details, if requested
0254     if plt||details
0255         if K==0
0256             tmpnetsim=nan; tmpdpsim=nan; tmpexpref=nan; tmpidx=nan;
0257         else
0258             I=find(E); [tmp c]=max(S(:,I),[],2); c(I)=1:K; tmpidx=I(c);
0259             tmpnetsim=sum(S((tmpidx-1)*N+[1:N]'));
0260             tmpexpref=sum(dS(I)); tmpdpsim=tmpnetsim-tmpexpref;
0261         end;
0262     end;
0263     if details
0264         netsim(i)=tmpnetsim; dpsim(i)=tmpdpsim; expref(i)=tmpexpref;
0265         idx(:,i)=tmpidx;
0266     end;
0267     if plt
0268         netsim(i)=tmpnetsim;
0269         figure(234); 
0270         tmp=1:i; tmpi=find(~isnan(netsim(1:i)));
0271         plot(tmp(tmpi),netsim(tmpi),'r-');
0272         xlabel('# Iterations');
0273         ylabel('Fitness (net similarity) of quantized intermediate solution');
0274         drawnow; 
0275     end;
0276 end;
0277 I=find(diag(A+R)>0); K=length(I); % Identify exemplars
0278 if K>0
0279     [tmp c]=max(S(:,I),[],2); c(I)=1:K; % Identify clusters
0280     % Refine the final set of exemplars and clusters and return results
0281     for k=1:K ii=find(c==k); [y j]=max(sum(S(ii,ii),1)); I(k)=ii(j(1)); end;
0282     [tmp c]=max(S(:,I),[],2); c(I)=1:K; tmpidx=I(c);
0283     tmpnetsim=sum(S((tmpidx-1)*N+[1:N]')); tmpexpref=sum(dS(I));
0284 else
0285     tmpidx=nan*ones(N,1); tmpnetsim=nan; tmpexpref=nan;
0286 end;
0287 if details
0288     netsim(i+1)=tmpnetsim; netsim=netsim(1:i+1);
0289     dpsim(i+1)=tmpnetsim-tmpexpref; dpsim=dpsim(1:i+1);
0290     expref(i+1)=tmpexpref; expref=expref(1:i+1);
0291     idx(:,i+1)=tmpidx; idx=idx(:,1:i+1);
0292 else
0293     netsim=tmpnetsim; dpsim=tmpnetsim-tmpexpref;
0294     expref=tmpexpref; idx=tmpidx;
0295 end;
0296 if plt||details
0297     fprintf('\nNumber of identified clusters: %d\n',K);
0298     fprintf('Fitness (net similarity): %f\n',tmpnetsim);
0299     fprintf('  Similarities of data points to exemplars: %f\n',dpsim(end));
0300     fprintf('  Preferences of selected exemplars: %f\n',tmpexpref);
0301     fprintf('Number of iterations: %d\n\n',i);
0302 end;
0303 %if unconverged
0304 %    fprintf('\n*** Warning: Algorithm did not converge. The similarities\n');
0305 %    fprintf('    may contain degeneracies - add noise to the similarities\n');
0306 %    fprintf('    to remove degeneracies. To monitor the net similarity,\n');
0307 %    fprintf('    activate plotting. Also, consider increasing maxits and\n');
0308 %    fprintf('    if necessary dampfact.\n\n');
0309 %end;
0310

apcluster

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SOURCE CODE