Clustering – How to Split Data Into Multiple Normal Distributions or Clusters

clusteringmultivariate analysismultivariate normal distributionnormal distribution

I am trying to train a machine learning model to predict yield on fields, based on multiple parameters, such as elevation, humidity, and nitrogen content. Observing the historical harvest data, I guess, there are some additional (unknown) factors that make a field non-homogenous, and therefore, it would be better to split the field into multiple zones and determine their biases.

Below are the images of the fields; on the left, the yield is spatially represented, and on the right are the histograms.
enter image description here

Assuming that in nature, most phenomena have gaussian distribution, I believe it would make sense to split every field into multiple clusters with certain mean values and normally distributed points around.
Does anyone know if there are any methods to detect the number of peaks (cluster centroids) and optimally split the observations around them? Also, I guess, there are multiple ways to determine the optimization criteria.

I would be thankful for your suggestions about how to split the data into multiple zones for these observations.

Best Answer

I guess what you're looking for are Gaussian mixture models . The number of gaussians can be determined by Information criteria : AIC, BIC.

Related Question