J'ai plusieurs distributions (10 distributions dans la figure ci-dessous).
En fait ce sont des histogrammes: il y a 70 valeurs sur l'axe des x qui sont les tailles de certaines particules dans une solution et pour chaque valeur de x la valeur correspondante de y est la proportion de particules dont la taille est autour de la valeur de x.
Je voudrais regrouper ces distributions. Actuellement j'utilise un clustering hiérarchique avec la distance euclidienne par exemple. Je ne suis pas satisfait du choix de la distance. J'ai essayé la distance théorique de l'information telle que Kullback-Leibler mais il y a beaucoup de zéros dans les données et cela cause des difficultés. Avez-vous une proposition de distance appropriée et / ou une autre méthode de clustering?
la source
If your data are histograms, you might want to look into appropriat distance functions for that such as the "histogram intersection distance".
There is a tool called ELKI that has a wide variety of clustering algorithms (much more modern ones than k-means and hierarchical clustering) and it even has a version of histogram intersection distance included, that you can use in most algorithms. You might want to try out a few of the algorithms available in it. From the plot you gave above, it is unclear to me what you want to do. Group the individual histograms, right? Judging from the 10 you showed above, there might be no clusters.
la source
You may want to use some feature extraction technique to derive descriptors for a k-means or other type of clustering.
A basic approach would be to fit a certain distribution to your histograms and use its parameters as descriptors. For instance, you seem to have bimodal distributions, that you can describe with 2 means and 2 standard deviations.
Another possibility is to cluster over the first two or three principal component of the counts of the histograms.
Alternatively wavelets approaches can be used.
This page explains how to do that when dealing with extracellular spikes. The data is different, but the idea should be applicable to your case. You will also find many references at the bottom.
http://www.scholarpedia.org/article/Spike_sorting
In R you can calculate the principal components of your peaks using either the
princomp
orprcomp
function. Here you'll find a tutorial on PCA in R.For wavelets you may look at the
wavelets
package.k-means clustering can be achieved using the
kmeans
function.la source