"

In the week 6 and 7, I coded  BayesianGaussianMixture  for the full covariance type.Now it can run smoothly on synthetic data and old-faithful data. Take a peek on the demo.

from sklearn.mixture.bayesianmixture import BayesianGaussianMixture as BGMbgm = BGM(n_init=1, n_iter=100, n_components=7, verbose=2, init_params='random',         precision_type='full')bgm.fit(X)

The demo is to repeat the experiment of PRML, page 480, Figure 10.6.VB on BGMM has shown its capability of inferring the number of components automatically. It has converged in 47 iterations.

The ELBO looks a little weired. It is not always going up. When some clusters disappear, ELBO goes down a little bit, thengo up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples lessthan the number of features.

The  BayesianGaussianMixture  has much more parameters than  GaussianMixture  , there are six parameters per each components.I feel it is not easy to control the so many functions and parameters. The initial design of  BaseMixture  is also not so good.I took a look at bnpy which is a more complicated implementation of VB on various mixturemodels. Though I don't need to go such complicated implementation, but the decoupling of observation model, i.e. $X$, $\\mu$, $\\Lambda$,and mixture mode, i.e. $Z$, $\\pi$ is quite nice. So I tried to use Mixin class to represent these two models. I split  MixtureBase  into three abstract classes  ObsMixin  ,  HiddenMixin  and  MixtureBase(ObsMixn, HiddenMixin)  . I also implemented subclassesfor Gaussian Mixture  ObsGaussianMixin(ObsMixin)  ,  MixtureMixin(HiddenMixin)  ,  GaussianMixture(MixtureBase, ObsGaussianMixin, MixtureMixin)  , but Python does allow me to do this due to there is correct MRO. :-|. I changed them back, but thisunsuccessful experiment gives me a nice base class,  MixtureBase  .

I also tried to use  cached_property  to store the intermediate variables such as, $\\ln \\pi$, $\\ln \\Lambda$, and cholsky decomposed $W -1$, but didn't get much benefits. It is almost the same to save these variables as private attributes into instances.

The numerical issue comes from responsibility is extremely small. When estimating resp * log resp, it gives NAN. I simply avoid computing when resp < 10*EPS. Still, ELBO seems suspicious.

The current implementation of VBGMM in scikit-learn cannot learn the correct parameters on old-faithful data.

VBGMM(alpha=0.0001, covariance_type='full', init_params='wmc',   min_covar=None, n_components=6, n_iter=100, params='wmc',   random_state=None, thresh=None, tol=0.001, verbose=0)

It gives only one components. The  weights_  is

 array([  7.31951611e-07,   7.31951611e-07,   7.31951611e-07,         7.31951611e-07,   7.31951611e-07,   9.99996340e-01])

I also implemented  DirichletProcessGaussianMixture  . But currently it looks the same as  BayesianGaussianMixture  . Both of them can infer the best number of components.  DirichletProcessGaussianMixture  took a slightly more iteration than  BayesianGaussianMixture  . If we infer Dirichlet Process Mixture by Gibbs sampling, we don't need to specify the truncated level, only  alpha  the concentration parameter is enough. But with variational inference, we still need the give the model the maximal possible number of components, i.e., the truncated level $T$.

"