Mittwoch, 19. August 2015

K-means Clustering


       Clustering can be defined as the grouping of data points based on some commonality or similarity between the points.

                                                                              
         One of the simplest methods is K-means clustering. In this method, the number of clusters is initialized and the center of each of the cluster is randomly chosen. The Euclidean distance between each data point and all the center of the clusters is computed and based on the minimum distance each data point is assigned to certain cluster. The new center for the cluster is defined and the Euclidean distance is calculated. This procedure iterates till convergence is reached.
Let’s see how to generate some random data points and some random cluster points

MATLAB CODE:

%Generate sample data points and assign random centre for each cluster
%Number of data points
sz=100;
sz1=250;

X = random('unid',sz1,[sz 1]); %Value
Xp = random('unid',sz1,[sz 1]); %Position

%Number of clusters
c=6;
V=random('unid',sz1,[c 1]); %Value
Vp=random('unid',sz1,[c 1]); %Position


figure,plot(Xp,X,'*',Vp,V,'r+');title('Data points and the initial Cluster centers');


Explanation:
100 data points are generated and the number of clusters are assumed to be 6 and 6 random cluster points are generated.

Group the data points:
MATLAB CODE:

%Preallocate the vectors
V1=zeros(size(V));
Vp1=zeros(size(Vp));
flag=1;
while flag==1
%Find the euclidean distance between the data points and all the center of
%the clusters
J=sqrt(abs(repmat(Xp,[1 c])-repmat(Vp,[1 sz])').^2+abs(repmat(X,[1 c])-repmat(V,[1 sz])').^2);

%Find the minimum distance between the data point and the cluster
[mv,Gpos]=min(J,[],2);
CGroup=zeros([sz c]);
colr=colormap(jet(c));
figure(3),
for i = 1:c
    Temp = find(Gpos==i);
    CGroup(1:numel(Temp),i)=Temp;
   
    V1(i,:)=mean(X(Temp));
    Vp1(i,:)=mean(Xp(Temp));
    Pos=ones(numel(Temp)*2,1)*Vp1(i);
    Pos(2:2:end)=Xp(Temp);
    Value=ones(numel(Temp)*2,1)*V1(i);
    Value(2:2:end)=X(Temp);
    %Define the new centre for each cluster
    plot(Pos,Value,'Color',colr(i,:),'LineStyle','-','Marker','o');hold on;
    plot(Vp1(i),V1(i),'k+');
end
hold off;
Diffv=abs(V-V1);
DiffVp=abs(Vp-Vp1);

%Iterate the process till there is no change in the cluster position
if(Diffv < 1)
    flag=0;
else
    V=V1;
    Vp=Vp1;
end
end

figure,plot(Xp,X,'*',Vp,V,'g+');title('Data points and the Final Cluster centers');


Fig. The new position(red circle) of the clusters after the final iteration.

Explanation:


In the above figure, the data points are represented in blue color stars and the cluster centers are represented in red color cross shape.

Let’s consider one particular data point and all the cluster centers.



Data point position X = 13, Y = 20
Cluster 1 position  X = 8,  Y = 19
Cluster 2 position  X = 13, Y = 15
Step 1: Find the Euclidean Distance:
Find the Euclidean distance(D1) between data point and the cluster 1 similarly, find the Euclidean distance(D2) between data point and the cluster 2

Distance D1 = sqrt((13-8).^2+(20-19).^2)) = 5.0990
Distance D2 = sqrt((13-13).^2+(20-15).^2))= 5.0000

Step 2: Find the minimum and assign the data point to a cluster

Now the minimum distance among the two results is for the cluster 2.
So the data point with (X,Y)=(13,20) is assigned to the cluster/group 2.

Step 3: Perform the step 1 and step 2 for all the data points and assign group accordingly.

Step 4: Assign a new position for the clusters based on the clustering done.
       
        Find the average position of the newly assigned data points to a particular cluster and use that average as the new position for the cluster.       

Step 5: Iterate this procedure till the position of the clusters are unchanged.

Number of clusters = 3

 Number of clusters = 6

Number of clusters = 15


Keine Kommentare:

Kommentar veröffentlichen