I want to do symptom group consultation for many students’ master's thesis. I use factor analysis and latent category analysis. These methods have been written to you before, and today I will write an unsupervised machine learning method for you ---- --- Hierarchical clustering method for symptom clusters. If you are interested in this method, you can think about anything in the topic. I hope that after reading this article, you can have a certain understanding of hierarchical clustering and can think about whether you can expand in this direction.
The result is shown in the following figure: This is a screenshot from an article that has been published. The author clustered the symptoms of a disease into 3 categories, discussed the characteristics of each category, and put forward suggestions for treatment and care. .
Students who are interested in the article will take a look: Sethares, Kristen & Chin, Elizabeth. (2021). Age and gender differences in physical heart failure symptom clusters. Heart & Lung. 50. 832-837.
Today, I will show you how to do a symptom cluster like this kind of hierarchical clustering.
The principle of hierarchical clustering
Hierarchical clustering, the result of hierarchical clustering is like a tree, which grows layer by layer. This tree is also completely data-driven. For exploratory research in unfamiliar fields, such as symptom clusters are particularly suitable.
For this tree, it’s called dendrogram in English. How is it formed? Naturally, we can make it form by spreading from the top down (Method 1, Divisive in English), or it can be formed from the root. The upper aggregation is formed in this way (Method 2, called Agglomerative in English).
Here I only write method 2 for everyone, because this is more commonly used, the article in the screenshot above is also implemented using the method 2 clustering method.
The basic idea of Method 2 is:
First calculate the distance between each class
Combine the closest distance classes
Repeat 1 and 2 until all classes are merged into 1 class
After the above steps are completed, a tree will grow out. The intuitive diagram is as follows:
The above figure assumes that we have only two variables, 9 cases (classes), and after the process from the upper left corner to the lower right corner, 9 cases become a class.
The distance is mentioned in the steps, which involves a distance calculation problem. There are many calculation methods. This article will not expand. Interested students can consult me separately. The common distance algorithms are as follows:
The method used in articles like screenshots is Ward’s method.
The function used for clustering from bottom to top is
hclust(). The parameter that
hclust needs to accept is a distance matrix. You can directly enter the following code in R to experience the joy of plotting:
hc = hclust(dist(mtcars)) plot(hc)
In the above code, the dist function is used to calculate the distance of the case. In this step, all numeric variables must be standardized, otherwise the clustering must be wrong. Specifically, you can set the parameters of the dist function For the calculation method of fixed distance, such as Ward's method used in the screenshot paper, we can set it to "ward.D" or "ward.D2".
However, we found that the objects of clustering are still cases at this time. In fact, we think that clustering is a symptom, that is, a variable in our database. Let's use practical examples to explain to you.
For example, I currently have a database in the following form, which is every symptom that everyone will collect:
I want to see which "symptom clusters" of these symptoms in patients have. First, we need to transpose the data frame, and then cluster, I can write the code as follows:
After running the code, the picture can be drawn:
In fact, you can intuitively see here that there should be two symptom groups. Symptom group 1 includes symptoms 3 and 4, and symptom group 2 includes all the other symptoms. We still mark them as in our image paper. :
In other words, symptoms 3 and 4 are a group, and the remaining symptoms are a group.
You can also add different colors to different symptom groups to further highlight different symptom groups. The code is as follows:
hc_dend_obj<- as.dendrogram(mycluster) hc_col_dend<- color_branches(hc_dend_obj, h = 6) plot(hc_col_dend,hang=-1)
Another very important issue is to generate the symptom cluster label of the case. Only with the symptom cluster label, we can compare the general demographic characteristics of cases in different symptom clusters as in the paper. The symptom cluster label can be obtained by the following code:
hc=hclust(dist(scale(data2)))cut_avg<- cutree(hc, k = 2)data_cl<- mutate(data1, cluster = cut_avg)
After running the above code, we check the original database, and we can see that the last newly generated column is that each case is a symptom group category. Then we can compare the differences in various variables between cases of different symptom groups, and a paper like this is completed.
Finally, I want to remind everyone that the author of the paper used SPSS software to do it, and you can also try it with SPSS. In which analysis-classification-system clustering option, I tried it and it was completely possible to make it.
Today I wrote to you how to use hierarchical clustering to discuss symptom groups. Thank you for your patience.