This program has been downloaded from http://www.psi.toronto.edu/affinitypropagation/apcluster.m It is the Matlab implementation of affinity propagation Mateda uses in different situations in which clustering of data is needed See the web page for conditions of use and appropriate references [idx,netsim,dpsim,expref]=apcluster(s,p) APCLUSTER uses affinity propagation (Frey and Dueck, Science, 2007) to identify data clusters, using a set of real-valued pair-wise data point similarities as input. Each cluster is represented by a data point called a cluster center, and the method searches for clusters so as to maximize a fitness function called net similarity. The method is iterative and stops after maxits iterations (default of 500 - see below for how to change this value) or when the cluster centers stay constant for convits iterations (default of 50). The command apcluster(s,p,'plot') can be used to plot the net similarity during operation of the algorithm. For N data points, there may be as many as N^2-N pair-wise similarities (note that the similarity of data point i to k need not be equal to the similarity of data point k to i). These may be passed to APCLUSTER in an NxN matrix s, where s(i,k) is the similarity of point i to point k. In fact, only a smaller number of relevant similarities are needed for APCLUSTER to work. If only M similarity values are known, where M < N^2-N, they can be passed to APCLUSTER in an Mx3 matrix s, where each row of s contains a pair of data point indices and a corresponding similarity value: s(j,3) is the similarity of data point s(j,1) to data point s(j,2). APCLUSTER automatically determines the number of clusters, based on the input p, which is an Nx1 matrix of real numbers called preferences. p(i) indicates the preference that data point i be chosen as a cluster center. A good choice is to set all preference values to the median of the similarity values. The number of identified clusters can be increased or decreased by changing this value accordingly. If p is a scalar, APCLUSTER assumes all preferences are equal to p. The fitness function (net similarity) used to search for solutions equals the sum of the preferences of the the data centers plus the sum of the similarities of the other data points to their data centers. The identified cluster centers and the assignments of other data points to these centers are returned in idx. idx(j) is the index of the data point that is the cluster center for data point j. If idx(j) equals j, then point j is itself a cluster center. The sum of the similarities of the data points to their cluster centers is returned in dpsim, the sum of the preferences of the identified cluster centers is returned in expref and the net similarity (sum of the data point similarities and preferences) is returned in netsim. EXAMPLE N=100; x=rand(N,2); % Create N, 2-D data points M=N*N-N; s=zeros(M,3); % Make ALL N^2-N similarities j=1; for i=1:N for k=[1:i-1,i+1:N] s(j,1)=i; s(j,2)=k; s(j,3)=-sum((x(i,:)-x(k,:)).^2); j=j+1; end; end; p=median(s(:,3)); % Set preference to median similarity [idx,netsim,dpsim,expref]=apcluster(s,p,'plot'); fprintf('Number of clusters: %d\n',length(unique(idx))); fprintf('Fitness (net similarity): %f\n',netsim); figure; % Make a figures showing the data and the clusters for i=unique(idx)' ii=find(idx==i); h=plot(x(ii,1),x(ii,2),'o'); hold on; col=rand(1,3); set(h,'Color',col,'MarkerFaceColor',col); xi1=x(i,1)*ones(size(ii)); xi2=x(i,2)*ones(size(ii)); line([x(ii,1),xi1]',[x(ii,2),xi2]','Color',col); end; axis equal tight; PARAMETERS [idx,netsim,dpsim,expref]=apcluster(s,p,'NAME',VALUE,...) The following parameters can be set by providing name-value pairs, eg, apcluster(s,p,'maxits',1000): Parameter Value 'sparse' No value needed. Use when the number of data points is large (eg, >3000). Normally, APCLUSTER passes messages between every pair of data points. This flag causes APCLUSTER to pass messages between pairs of points only if their input similarity is provided and is not equal to -Inf. 'maxits' Any positive integer. This specifies the maximum number of iterations performed by affinity propagation. Default: 500. 'convits' Any positive integer. APCLUSTER decides that the algorithm has converged if the estimated cluster centers stay fixed for convits iterations. Increase this value to apply a more stringent convergence test. Default: 50. 'dampfact' A real number that is less than 1 and greater than or equal to 0.5. This sets the damping level of the message-passing method, where values close to 1 correspond to heavy damping which may be needed if oscillations occur. 'plot' No value needed. This creates a figure that plots the net similarity after each iteration of the method. If the net similarity fails to converge, consider increasing the values of dampfact and maxits. 'details' No value needed. This causes idx, netsim, dpsim and expref to be stored after each iteration. 'nonoise' No value needed. Degenerate input similarities (eg, where the similarity of i to k equals the similarity of k to i) can prevent convergence. To avoid this, APCLUSTER adds a small amount of noise to the input similarities. This flag turns off the addition of noise. Copyright (c) Brendan J. Frey and Delbert Dueck (2006). This software may be freely used and distributed for non-commercial purposes.
0001 % This program has been downloaded from 0002 % http://www.psi.toronto.edu/affinitypropagation/apcluster.m 0003 % It is the Matlab implementation of affinity propagation 0004 % Mateda uses in different situations in which clustering of data is needed 0005 % See the web page for conditions of use and appropriate references 0006 % 0007 % [idx,netsim,dpsim,expref]=apcluster(s,p) 0008 % 0009 % APCLUSTER uses affinity propagation (Frey and Dueck, Science, 0010 % 2007) to identify data clusters, using a set of real-valued 0011 % pair-wise data point similarities as input. Each cluster is 0012 % represented by a data point called a cluster center, and the 0013 % method searches for clusters so as to maximize a fitness 0014 % function called net similarity. The method is iterative and 0015 % stops after maxits iterations (default of 500 - see below for 0016 % how to change this value) or when the cluster centers stay 0017 % constant for convits iterations (default of 50). The command 0018 % apcluster(s,p,'plot') can be used to plot the net similarity 0019 % during operation of the algorithm. 0020 % 0021 % For N data points, there may be as many as N^2-N pair-wise 0022 % similarities (note that the similarity of data point i to k 0023 % need not be equal to the similarity of data point k to i). 0024 % These may be passed to APCLUSTER in an NxN matrix s, where 0025 % s(i,k) is the similarity of point i to point k. In fact, only 0026 % a smaller number of relevant similarities are needed for 0027 % APCLUSTER to work. If only M similarity values are known, 0028 % where M < N^2-N, they can be passed to APCLUSTER in an Mx3 0029 % matrix s, where each row of s contains a pair of data point 0030 % indices and a corresponding similarity value: s(j,3) is the 0031 % similarity of data point s(j,1) to data point s(j,2). 0032 % 0033 % APCLUSTER automatically determines the number of clusters, 0034 % based on the input p, which is an Nx1 matrix of real numbers 0035 % called preferences. p(i) indicates the preference that data 0036 % point i be chosen as a cluster center. A good choice is to 0037 % set all preference values to the median of the similarity 0038 % values. The number of identified clusters can be increased or 0039 % decreased by changing this value accordingly. If p is a 0040 % scalar, APCLUSTER assumes all preferences are equal to p. 0041 % 0042 % The fitness function (net similarity) used to search for 0043 % solutions equals the sum of the preferences of the the data 0044 % centers plus the sum of the similarities of the other data 0045 % points to their data centers. 0046 % 0047 % The identified cluster centers and the assignments of other 0048 % data points to these centers are returned in idx. idx(j) is 0049 % the index of the data point that is the cluster center for 0050 % data point j. If idx(j) equals j, then point j is itself a 0051 % cluster center. The sum of the similarities of the data 0052 % points to their cluster centers is returned in dpsim, the 0053 % sum of the preferences of the identified cluster centers is 0054 % returned in expref and the net similarity (sum of the data 0055 % point similarities and preferences) is returned in netsim. 0056 % 0057 % EXAMPLE 0058 % 0059 % N=100; x=rand(N,2); % Create N, 2-D data points 0060 % M=N*N-N; s=zeros(M,3); % Make ALL N^2-N similarities 0061 % j=1; 0062 % for i=1:N 0063 % for k=[1:i-1,i+1:N] 0064 % s(j,1)=i; s(j,2)=k; s(j,3)=-sum((x(i,:)-x(k,:)).^2); 0065 % j=j+1; 0066 % end; 0067 % end; 0068 % p=median(s(:,3)); % Set preference to median similarity 0069 % [idx,netsim,dpsim,expref]=apcluster(s,p,'plot'); 0070 % fprintf('Number of clusters: %d\n',length(unique(idx))); 0071 % fprintf('Fitness (net similarity): %f\n',netsim); 0072 % figure; % Make a figures showing the data and the clusters 0073 % for i=unique(idx)' 0074 % ii=find(idx==i); h=plot(x(ii,1),x(ii,2),'o'); hold on; 0075 % col=rand(1,3); set(h,'Color',col,'MarkerFaceColor',col); 0076 % xi1=x(i,1)*ones(size(ii)); xi2=x(i,2)*ones(size(ii)); 0077 % line([x(ii,1),xi1]',[x(ii,2),xi2]','Color',col); 0078 % end; 0079 % axis equal tight; 0080 % 0081 % PARAMETERS 0082 % 0083 % [idx,netsim,dpsim,expref]=apcluster(s,p,'NAME',VALUE,...) 0084 % 0085 % The following parameters can be set by providing name-value 0086 % pairs, eg, apcluster(s,p,'maxits',1000): 0087 % 0088 % Parameter Value 0089 % 'sparse' No value needed. Use when the number of data 0090 % points is large (eg, >3000). Normally, 0091 % APCLUSTER passes messages between every pair 0092 % of data points. This flag causes APCLUSTER 0093 % to pass messages between pairs of points only 0094 % if their input similarity is provided and 0095 % is not equal to -Inf. 0096 % 'maxits' Any positive integer. This specifies the 0097 % maximum number of iterations performed by 0098 % affinity propagation. Default: 500. 0099 % 'convits' Any positive integer. APCLUSTER decides that 0100 % the algorithm has converged if the estimated 0101 % cluster centers stay fixed for convits 0102 % iterations. Increase this value to apply a 0103 % more stringent convergence test. Default: 50. 0104 % 'dampfact' A real number that is less than 1 and 0105 % greater than or equal to 0.5. This sets the 0106 % damping level of the message-passing method, 0107 % where values close to 1 correspond to heavy 0108 % damping which may be needed if oscillations 0109 % occur. 0110 % 'plot' No value needed. This creates a figure that 0111 % plots the net similarity after each iteration 0112 % of the method. If the net similarity fails to 0113 % converge, consider increasing the values of 0114 % dampfact and maxits. 0115 % 'details' No value needed. This causes idx, netsim, 0116 % dpsim and expref to be stored after each 0117 % iteration. 0118 % 'nonoise' No value needed. Degenerate input similarities 0119 % (eg, where the similarity of i to k equals the 0120 % similarity of k to i) can prevent convergence. 0121 % To avoid this, APCLUSTER adds a small amount 0122 % of noise to the input similarities. This flag 0123 % turns off the addition of noise. 0124 % 0125 % Copyright (c) Brendan J. Frey and Delbert Dueck (2006). This 0126 % software may be freely used and distributed for 0127 % non-commercial purposes. 0128 0129 function [idx,netsim,dpsim,expref,unconverged]=apcluster(s,p,varargin); 0130 0131 % Handle arguments to function 0132 if nargin<2 error('Too few input arguments'); 0133 else 0134 maxits=500; convits=50; lam=0.5; plt=0; details=0; nonoise=0; 0135 i=1; 0136 while i<=length(varargin) 0137 if strcmp(varargin{i},'plot') 0138 plt=1; i=i+1; 0139 elseif strcmp(varargin{i},'details') 0140 details=1; i=i+1; 0141 elseif strcmp(varargin{i},'sparse') 0142 [idx,netsim,dpsim,expref]=apcluster_sparse(s,p,varargin{:}); 0143 return; 0144 elseif strcmp(varargin{i},'nonoise') 0145 nonoise=1; i=i+1; 0146 elseif strcmp(varargin{i},'maxits') 0147 maxits=varargin{i+1}; 0148 i=i+2; 0149 if maxits<=0 error('maxits must be a positive integer'); end; 0150 elseif strcmp(varargin{i},'convits') 0151 convits=varargin{i+1}; 0152 i=i+2; 0153 if convits<=0 error('convits must be a positive integer'); end; 0154 elseif strcmp(varargin{i},'dampfact') 0155 lam=varargin{i+1}; 0156 i=i+2; 0157 if (lam<0.5)||(lam>=1) 0158 error('dampfact must be >= 0.5 and < 1'); 0159 end; 0160 else i=i+1; 0161 end; 0162 end; 0163 end; 0164 if lam>0.9 0165 fprintf('\n*** Warning: Large damping factor in use. Turn on plotting\n'); 0166 fprintf(' to monitor the net similarity. The algorithm will\n'); 0167 fprintf(' change decisions slowly, so consider using a larger value\n'); 0168 fprintf(' of convits.\n\n'); 0169 end; 0170 0171 % Check that standard arguments are consistent in size 0172 if length(size(s))~=2 error('s should be a 2D matrix'); 0173 elseif length(size(p))>2 error('p should be a vector or a scalar'); 0174 elseif size(s,2)==3 0175 tmp=max(max(s(:,1)),max(s(:,2))); 0176 if length(p)==1 N=tmp; else N=length(p); end; 0177 if tmp>N 0178 error('data point index exceeds number of data points'); 0179 elseif min(min(s(:,1)),min(s(:,2)))<=0 0180 error('data point indices must be >= 1'); 0181 end; 0182 elseif size(s,1)==size(s,2) 0183 N=size(s,1); 0184 if (length(p)~=N)&&(length(p)~=1) 0185 error('p should be scalar or a vector of size N'); 0186 end; 0187 else error('s must have 3 columns or be square'); end; 0188 0189 % Construct similarity matrix 0190 if N>3000 0191 fprintf('\n*** Warning: Large memory request. Consider activating\n'); 0192 fprintf(' the sparse version of APCLUSTER.\n\n'); 0193 end; 0194 if size(s,2)==3 0195 S=-Inf*ones(N,N); 0196 for j=1:size(s,1) S(s(j,1),s(j,2))=s(j,3); end; 0197 else S=s; 0198 end; 0199 0200 % In case user did not remove degeneracies from the input similarities, 0201 % avoid degenerate solutions by adding a small amount of noise to the 0202 % input similarities 0203 if ~nonoise 0204 rns=randn('state'); randn('state',0); 0205 S=S+(eps*S+realmin*100).*rand(N,N); 0206 randn('state',rns); 0207 end; 0208 0209 % Place preferences on the diagonal of S 0210 if length(p)==1 for i=1:N S(i,i)=p; end; 0211 else for i=1:N S(i,i)=p(i); end; 0212 end; 0213 0214 % Allocate space for messages, etc 0215 dS=diag(S); A=zeros(N,N); R=zeros(N,N); t=1; 0216 if plt netsim=zeros(1,maxits+1); end; 0217 if details 0218 idx=zeros(N,maxits+1); 0219 netsim=zeros(1,maxits+1); 0220 dpsim=zeros(1,maxits+1); 0221 expref=zeros(1,maxits+1); 0222 end; 0223 0224 % Execute parallel affinity propagation updates 0225 e=zeros(N,convits); dn=0; i=0; 0226 while ~dn 0227 i=i+1; 0228 0229 % Compute responsibilities 0230 Rold=R; 0231 AS=A+S; [Y,I]=max(AS,[],2); for k=1:N AS(k,I(k))=-realmax; end; 0232 [Y2,I2]=max(AS,[],2); 0233 R=S-repmat(Y,[1,N]); 0234 for k=1:N R(k,I(k))=S(k,I(k))-Y2(k); end; 0235 R=(1-lam)*R+lam*Rold; % Damping 0236 0237 % Compute availabilities 0238 Aold=A; 0239 Rp=max(R,0); 0240 for k=1:N Rp(k,k)=R(k,k); end; 0241 A=repmat(sum(Rp,1),[N,1])-Rp; 0242 dA=diag(A); A=min(A,0); for k=1:N A(k,k)=dA(k); end; 0243 A=(1-lam)*A+lam*Aold; % Damping 0244 0245 % Check for convergence 0246 E=((diag(A)+diag(R))>0); e(:,mod(i-1,convits)+1)=E; K=sum(E); 0247 if i>=convits || i>=maxits 0248 se=sum(e,2); 0249 unconverged=(sum((se==convits)+(se==0))~=N); 0250 if (~unconverged&&(K>0))||(i==maxits) dn=1; end; 0251 end; 0252 0253 % Handle plotting and storage of details, if requested 0254 if plt||details 0255 if K==0 0256 tmpnetsim=nan; tmpdpsim=nan; tmpexpref=nan; tmpidx=nan; 0257 else 0258 I=find(E); [tmp c]=max(S(:,I),[],2); c(I)=1:K; tmpidx=I(c); 0259 tmpnetsim=sum(S((tmpidx-1)*N+[1:N]')); 0260 tmpexpref=sum(dS(I)); tmpdpsim=tmpnetsim-tmpexpref; 0261 end; 0262 end; 0263 if details 0264 netsim(i)=tmpnetsim; dpsim(i)=tmpdpsim; expref(i)=tmpexpref; 0265 idx(:,i)=tmpidx; 0266 end; 0267 if plt 0268 netsim(i)=tmpnetsim; 0269 figure(234); 0270 tmp=1:i; tmpi=find(~isnan(netsim(1:i))); 0271 plot(tmp(tmpi),netsim(tmpi),'r-'); 0272 xlabel('# Iterations'); 0273 ylabel('Fitness (net similarity) of quantized intermediate solution'); 0274 drawnow; 0275 end; 0276 end; 0277 I=find(diag(A+R)>0); K=length(I); % Identify exemplars 0278 if K>0 0279 [tmp c]=max(S(:,I),[],2); c(I)=1:K; % Identify clusters 0280 % Refine the final set of exemplars and clusters and return results 0281 for k=1:K ii=find(c==k); [y j]=max(sum(S(ii,ii),1)); I(k)=ii(j(1)); end; 0282 [tmp c]=max(S(:,I),[],2); c(I)=1:K; tmpidx=I(c); 0283 tmpnetsim=sum(S((tmpidx-1)*N+[1:N]')); tmpexpref=sum(dS(I)); 0284 else 0285 tmpidx=nan*ones(N,1); tmpnetsim=nan; tmpexpref=nan; 0286 end; 0287 if details 0288 netsim(i+1)=tmpnetsim; netsim=netsim(1:i+1); 0289 dpsim(i+1)=tmpnetsim-tmpexpref; dpsim=dpsim(1:i+1); 0290 expref(i+1)=tmpexpref; expref=expref(1:i+1); 0291 idx(:,i+1)=tmpidx; idx=idx(:,1:i+1); 0292 else 0293 netsim=tmpnetsim; dpsim=tmpnetsim-tmpexpref; 0294 expref=tmpexpref; idx=tmpidx; 0295 end; 0296 if plt||details 0297 fprintf('\nNumber of identified clusters: %d\n',K); 0298 fprintf('Fitness (net similarity): %f\n',tmpnetsim); 0299 fprintf(' Similarities of data points to exemplars: %f\n',dpsim(end)); 0300 fprintf(' Preferences of selected exemplars: %f\n',tmpexpref); 0301 fprintf('Number of iterations: %d\n\n',i); 0302 end; 0303 %if unconverged 0304 % fprintf('\n*** Warning: Algorithm did not converge. The similarities\n'); 0305 % fprintf(' may contain degeneracies - add noise to the similarities\n'); 0306 % fprintf(' to remove degeneracies. To monitor the net similarity,\n'); 0307 % fprintf(' activate plotting. Also, consider increasing maxits and\n'); 0308 % fprintf(' if necessary dampfact.\n\n'); 0309 %end; 0310