A TUTORIAL ON CHANGEPOINT MODELS¶
Derivations and the code in this incomplete report were from my Master's work. It describes a Bayesian change point model for multivariate data.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from matplotlib import cm
import scipy.special as sp
# import scipy as sp
import pandas as pd
from operator import attrgetter
np.set_printoptions(precision=8)
import time
import pickle
import operator
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.display import Image
%matplotlib inline
# https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#links
def randgen(pr, N=1):
L = len(pr)
return np.random.choice(range(L), size=N, replace=True, p=pr/np.sum(pr))
def log_sum_exp(l, axis=0):
l_star = np.max(l, axis=axis, keepdims=True)
return l_star + np.log(np.sum(np.exp(l - l_star),axis=axis,keepdims=True))
def safe_log_sum_exp(x, axis=0):
return log_sum_exp(x,axis)
def normalize_exp(log_P, axis=0):
a = np.max(log_P, keepdims=True, axis=axis)
P = normalize(np.exp(log_P - a), axis=axis)
return P
def normalize(A, axis=0):
Z = np.sum(A, axis=axis, keepdims=True)
idx = np.where(Z == 0)
Z[idx] = 1
return A/Z
def load_array(filename):
X = np.loadtxt(filename)
dim = int(X[0]);
size = []
for i in range(dim):
size.append(int(X[i+1]));
X = np.reshape(X[dim+1:], size, order='F')
return X;
def save_array(filename, X, format = '%.6f'):
with open(filename, 'w') as f:
dim = len(X.shape)
f.write('%d\n' % dim)
for i in range(dim):
f.write('%d\n' % X.shape[i])
temp = X.reshape(np.product(X.shape), order='F')
for num in temp:
f.write(str(num)+"\n")
# np.savetxt(f, temp, fmt = format)
def plot_hist(data,xlines,title="",xlabel="",ylabel="",label_='changepoints'):
(K,T) = data.shape
fig = plt.figure(figsize=(30,4))
ax = fig.gca()
y,x = np.mgrid[slice(0, K+1, 1),slice(0,T+1,1)]
im = ax.pcolormesh(x, y, data, cmap=cm.gray)
fig.colorbar(im)
ax.hold(True)
plt1 = ax.vlines(np.arange(0,T), 0, xlines*K, colors='r', linestyles='-',label=label_,linewidth='3')
ax.set_title(title)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.legend(handles=[plt1])
fig.canvas.draw()
def plot_matrix(X, title='Title', xlabel='xlabel', ylabel='ylabel', figsize=None):
if figsize is None:
plt.figure(figsize=(25,6))
else:
plt.figure(figsize=figsize)
plt.imshow(X, interpolation='none', vmax=np.max(X), vmin=0, aspect='auto')
plt.colorbar()
plt.set_cmap('gray_r')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
Graphical Model¶
This tutorial describes the implementation of multiple change point models(CPM). CPM's are hierarchical hidden Markov models with the following generative model:
\begin{align} \pi_0 &\sim p(\pi_0) \\ r_1 &\sim p(r_1) \\ r_t &\sim p(r_t|r_{t-1}) \\ \pi_t &\sim p(\pi_t|r_t,\pi_{t-1}) \\ x_t &\sim p(x_t|r_t,\pi_t) \end{align}
Here, $r_t$ and $\pi_t$ are two Markov chains where $\pi_t$ is conditioned on $r_t$, which is said to control $\pi_t$ and observations $x_t$ are conditioned only on $\pi_t$. When the state space of $\pi_t$ is discrete, this model is reduced to a hidden Markov model whose state space is the cartesian product of $r_t$ and $\pi_t$'s state spaces. When $\pi_t$ is continuous, however, forward-backward algorithm cannot be implemented in general.
In this tutorial, we will focus on a special kind of changepoint models that are called reset models. In such models, $r_t$ is a binary variable. When $r_t=0$, $\pi_t$ stays the same as $\pi_{t-1}$. When $r_t=1$, the model is said to reset itself, that is, $\pi_t$ becomes independent of $\pi_{t-1}$ and is set to a random variable generated from its hyperparameter. As we will see below, linear time algorithms can be derived thanks to this property. We also remove the dependency between $r_t$ and assume that they are independent from one another. So, the graphical model is
Image(filename='etc/graphical_model.png', width=600, height=200)
Inference in CPM's¶
We can calculate the posterior marginals of the hidden variables $r_{1:T}$, and $\pi_{0:T}$, therefore detect the \emph{change points}, via the Forward-Backward algorithm, provided that the model parameters $a$ and $p$ is known.
Filtering (Forward Recursion)¶
We can infer the latent states ${r_t, \pi_t}$ at time $t$ by considering the observation until then $x_{0:t}$ with filtering. The filtering density is defined by $\alpha_{t|t}$ as:
\begin{align} \alpha_{t|t}(r_t, \pi_t) &= p(r_t, \pi_t, x_{1:t}) \end{align}
and the filtering posterior at time $t$ is proportional to $\alpha_{t|t}(r_t, \pi_t)$:
\begin{align} p(r_t, \pi_t | x_{1:t}) \propto \alpha_{t|t}(r_t, \pi_t) \end{align}
We calculate the $\alpha$ messages recursively as follows:
\begin{align} \alpha_{t|t}(r_t, \pi_t) &= p(r_t, \pi_t, x_{1:t}) \\ &= p(x_t | r_t, \pi_t) \underbrace{ p(r_t, \pi_t, x_{1:t-1})}_{\alpha_{t|t-1}(r_{t-1}, \pi_{t-1})} \\ &= p(x_t | \pi_t) \sum_{r_{t-1}} \int_{\pi_{t-1}} p(r_{t-1}, \pi_{t-1}, r_t, \pi_t, x_{1:t-1}) \\ &= p(x_t | \pi_t) \sum_{r_{t-1}} \int_{\pi_{t-1}} p(r_{t-1}, \pi_{t-1}, x_{1:{t-1}}) p(r_t, \pi_t | r_{t-1}\pi_{t-1}) d \pi_{t-1}\\ &= p(x_t | \pi_t) \sum_{r_{t-1}} \int_{\pi_{t-1}} \alpha_{t-1|t-1}(r_{t-1}, \pi_{t-1}) p(\pi_t | r_{t}\pi_{t-1})p(r_t) d \pi_{t-1}\\ &= p(x_t | \pi_t) \sum_{r_{t-1}} \int_{\pi_{t-1}} \alpha_{t-1|t-1}(r_{t-1}, \pi_{t-1}) \left(\delta(\pi_{t}-\pi_{t-1})p(r_t=0) + p(\pi_t)p(r_t=1) \right) d \pi_{t-1} \end{align}
We also call $\alpha_{t|t-1}(r_t, \pi_t) = p(r_t, \pi_t, x_{1:t-1})$ as the forward prediction message. Therefore we move forward by calculating the prediction $\alpha_{t|t-1}$ and update messages $\alpha_{t|t}$ alternatively and propagate the posterior probabilities of latent states. And normalize $\alpha_{t|t}$ for the posterior distribution $p(r_t, \pi_t | x_{0:T})$ of latent states at time $t$ in order to make a change point decision.
Backward Recursion¶
Let's define the backward messages:
\begin{align} \beta_{t|t+1}(r_t, \pi_t) &= p(x_{t+1:T} | r_t, \pi_t) \\ \beta_{t|t}(r_t, \pi_t) &= p(x_{t:T} | r_t, \pi_t) = p(x_t | r_t, \pi_t) \beta_{t|t+1}(r_t, \pi_t) \end{align}
\begin{align} \beta_{t|t+1}(r_t, \pi_t) &= \sum_{r_{t+1}} \int_{\pi_{t+1}} p(x_{t+1:T}, \pi_{t+1}, r_{t+1} | r_t, \pi_t) d \pi_{t+1} \\ &= \sum_{r_{t+1}} \int_{\pi_{t+1}} p(x_{t+1:T} | \pi_{t+1}, r_{t+1}) p(r_{t+1}, \pi_{t+1} | \pi_{t}, r_t) d \pi_{t+1} \\ &= \sum_{r_{t+1}} \int_{\pi_{t+1}} p(x_{t+1} | \pi_{t+1}) p(x_{t+2:T} | \pi_{t+1}, r_{t+1}) p(\pi_{t+1} | r_{t+1}, \pi_{t}, r_t) p(r_{t+1}) d \pi_{t+1} \\ &= \sum_{r_{t+1}} \int_{\pi_{t+1}} p(x_{t+1} | \pi_{t+1}) \beta_{t+1|t+2}(r_{t+1}, \pi_{t+1}) \left( [r_{t+1}=0]\delta(\pi_{t+1}- \pi_{t}) + [r_{t+1}=1]p(\pi_{t+1})\right) p(r_{t+1}) d \pi_{t+1} \\ &= \sum_{r_{t+1}} \int_{\pi_{t+1}} p(x_{t+1} | \pi_{t+1}) \beta_{t+1|t+2}(r_{t+1}, \pi_{t+1})\left( p(r_{t+1}=0)\delta(\pi_{t+1}- \pi_{t}) + p(r_{t+1}=1)p(\pi_{t+1})\right) d \pi_{t+1} \\ \end{align}
Smoothing¶
The posteriors probabilities can be obtained by multiplying forward and backward messages:
\begin{align} p(r_t, \pi_t|x_{1:T}) &\propto p(r_t, \pi_t,x_{1:T}) \\ &= p(r_t, \pi_t, x_{1:t}) p(x_{t+1:T} | r_t, \pi_t, x_{1:t}) \\ &= p(r_t, \pi_t, x_{1:t}) p(x_{t+1:T} | r_t, \pi_t) \\ &= \alpha_{t|t} \beta_{t|t+1} \end{align}
If the inference task is online, calculating posterior probabilities conditioned to full history becomes infeasible. This is because the cost of executing backward recursions increases linearly with time and backward recursions must be executed at each new observation. For online problems we typically use fixed lag smoothers, in which a fixed number of backward recursions are performed so that the computational complexity stays the same. This fixed number is referred as lag. If the lag is L, posterior probabilites at time $t$ is calculated conditioned on the observations up to time $t+L$ as follows:
\begin{align} p(r_t, \pi_t, x_{1:t+L}) &= p(r_t, \pi_t, x_{0:t}) p(x_{t+1:L} | r_t, \pi_t) \\ &= \alpha_{t|t}(r_t, \pi_t) \beta_{t|t+1} (r_t, \pi_t) \end{align}
Example: Dirichlet-Multinomial CPM¶
Generative Model¶
In the example application, observations are $K>1$ dimensional. So we, set $\pi_0$ to a $K$ dimensional Dirichlet random variable with parameter $a$. We further assume that each observation $x_t$ is drawn from a Multinomial distribution parametrized by $\pi_t$. Because Dirichlet distribution is the conjugate prior of Multinomial, we will integrate out $\pi_t$ easily while calculating the posterior probability of a change. Overall, the generative model becomes
\begin{align} \pi_0 &\sim \text{Dir}(\pi_0; a) \\ r_t &\sim \mathcal{BE}(r_t; c) \\ \pi_t | r_t, \pi_{t-1} &\sim [r_{t}=0] \delta(\pi_{t} - \pi_{t-1}) + [r_{t}=1] \text{Dir}(\pi_t; a)\\ x_t | \pi_t &\sim \mathcal{M}(x_t; \pi_t) \end{align}
and the Bernoulli, Dirichlet and Multinomial distributions are defined as
\begin{align} \mathcal{BE}(r; p) &= \exp \left( r \log p + (1-r) \log (1-p) \right) \\ \text{Dir}(\pi; a) &= \exp \left( \log \Gamma \left( \sum_{k=1}^K a_k \right) - \sum_{k=1}^K \Gamma(a_k) + \sum_{k=1}^K (a_k-1) \log \pi_{k} \right) \\ \mathcal{M}(x;\pi) &= \exp\left( \log (x_s!) - \sum_i \log(x_i!) + \sum_i x_i \log \pi_i \right), \quad\quad x_s = \sum_i x_i \end{align}
Data generating code is given below:
T = 80 # time index
K = 10 # input dimension
c = 0.05 # prior on change point probability
# a = np.ones(K)/K # hyperparameter of reset parameter
a = np.ones(K) # hyperparameter of reset parameter
cps = np.random.binomial(1,c,size=T)
actual_changepoints = np.where(cps==1)[0]
pi_0 = np.random.dirichlet(a)
pi_ = np.zeros((K,T))
data = np.zeros((K,T))
for t in range(T):
if cps[t]:
pi_[:,t] = np.random.dirichlet(a)
elif t==0:
pi_[:,t] = pi_0
else:
pi_[:,t] = pi_[:,t-1]
data[:,t] = np.random.multinomial(np.random.randint(10,50),pi_[:,t])
def plot_hist(data,xlines,title="",xlabel="",ylabel="",label_='changepoints'):
(K,T) = data.shape
fig = plt.figure(figsize=(18,4))
ax = fig.gca()
y,x = np.mgrid[slice(0, K+1, 1),slice(0,T+1, 1)]
ax.pcolormesh(x, y, data)
ax.hold(True)
plt1 = ax.vlines(np.arange(0,T), 0, xlines*K, colors='r', linestyles='-',label=label_,linewidth='3')
ax.set_title(title)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.legend(handles=[plt1])
fig.canvas.draw()
plot_hist(data,cps,title="Data generated using the above generative model", xlabel="Time",ylabel="Data Vectors")
Implementational Details¶
Dirichlet Potential¶
Implementation of the multiple change point model with Multinomial observations and Dirichlet priors requires working in the continuous domain. Thus, we represent the posterior distribution on $\pi_t$ and $r_t$ using Dirichlet potentials. Also, because forward/backward recursions involve marginalization over $r_t$, $\alpha$ and $\beta$ messages are mixtures of Dirichlet potentials. At first glance, the number of mixture components may seem to grow with complexity $2^t$ (one component for each possible value of $r_t$). However, as we will see below, inference can be carried out in $O(t)$, both in terms of time and space complexity in reset models.
A Dirichlet potential is defined with a tuple $(a, c)$ where $a$ is a Dirichlet parameter and $c$ is the logarithm of the normalizing constant.
\begin{align*} \phi(x) &= e^c \text{Dir}(x; a) \\ c &= \log \int_x \phi(x) dx \end{align*}
Below data structure stores a mixture component in general. For Dirichlet-Multinomial CPM, a mixture component corresponds to a Dirichlet potential and therefore $\alpha$ is a vector and $c$ is a positive real number.
class MixtureComponent:
def __init__(self,alpha_,c_):
self.alpha = alpha_
self.c = c_
def copy(self):
return MixtureComponent(self.alpha.copy(), self.c)
Let us also define the shorthand for logarithms of change-point priors as
\begin{align} l_0 = log(r_0) = \text{log} p(r_t=0)\\ l_1 = log(r_1) = \text{log} p(r_t=1) \end{align}
We need to define two operations for the implementations: multiplication of two Dirichlet potentials and that of a Dirichlet potential and a Multinomial random variable, both of which results in a Dirichlet potential. The former is needed as we multiply forward and backward messages, which are Dirichlet potentials, for smoothing and the latter is used when we marginalize out $\pi_t$ in forward/backward algorithm. We will first show that multiplication of two Dirichlet potential yields another Dirichlet potential. Then, we will write Multinomial distribution as a Dirichlet potential, which in turn completes the second argument.
Multiplication of Dirichlet Potentials¶
Multiplication of two Dirichlet potentials $(a_1, c_1)$ and $(a_2, c_2)$ is another Dirichlet potential:
\begin{align*} (a_1, c_1) (a_2, c_2) &= e^{c_1} \mathcal{D}ir(x; a_1) e^{c_2} \mathcal{D}ir(x; a_2) \\ &= \exp \left (c_1 + c_2 + \log \Gamma \left( \sum_{k=1}^K a_{1,k} \right) - \sum_{k=1}^K \log \Gamma(a_{1,k}) + \sum_{k=1}^K (a_{1,k} - 1 ) \log x_{k} \right . \\ &\qquad \left . + \log \Gamma \left( \sum_{k=1}^K a_{2,k} \right) - \sum_{k=1}^K \log \Gamma(a_{2,k}) + \sum_{k=1}^K (a_{2,k}-1) \log x_{k} \right ) \\ &= \exp \left (c_1 + c_2 + \log \Gamma \left( \sum_{k=1}^K a_{1,k} \right) - \sum_{k=1}^K \log \Gamma(a_{1,k}) + \log \Gamma \left( \sum_{k=1}^K a_{2,k} \right) - \sum_{k=1}^K \log \Gamma(a_{2,k}) \right . \\ & \qquad \left . + \sum_{k=1}^K (a_{1,k} + a_{2,k} - 2) \log x_{k} \right ) \\ &= (a_1 + a_2 - 1, g(a_1, c_1, a_2, c_2)) \end{align*}
where the new normalizing constant is
\begin{align*} g(a_1, c_1, a_2, c_2) &= c_1 + c_2 + \log \Gamma \left( \sum_{k=1}^K a_{1,k} \right) - \sum_{k=1}^K \log \Gamma(a_{1,k}) + \log \Gamma \left( \sum_{k=1}^K a_{2,k} \right) - \sum_{k=1}^K \log \Gamma(a_{2,k}) \\ &\qquad - \log \Gamma \left( \sum_{k=1}^K (a_{1,k} + a_{2,k} - 1) \right) + \sum_{k=1}^K \log \Gamma(a_{1,k} + \alpha_{2,k} - 1) \end{align*}
The code that calculates the normalizing constan resulting from the multiplicaion is below:
def dir_dir_norm_const(alpha1,alpha2):
return sp.gammaln(np.sum(alpha1)) - np.sum(sp.gammaln(alpha1)) + \
sp.gammaln(np.sum(alpha2)) - np.sum(sp.gammaln(alpha2)) + \
np.sum(sp.gammaln(alpha1+alpha2-1)) - sp.gammaln(np.sum(alpha1+alpha2-1))
Multinomial Density as a Dirichlet Potential¶
We can define a Multinomial distribution as Dirichlet potential as follows:
\begin{align*} \mathcal{M}(x; \pi) &= \exp \left( \log \Gamma \left (1 + \sum_{k=1}^K x_k \right) - \sum_{k=1}^K \log \Gamma (x_k +1) + \sum_{k=1}^K x_k \log \pi_k \right) \\ \mathcal{D}ir(\pi; x+1) &= \exp \left( \log \Gamma \left( \sum_{k=1}^K (x_k +1) \right) - \sum_{k=1}^K \log \Gamma(x_k + 1) + \sum_{k=1}^K (x_k) \log \pi_{k} \right) \end{align*}
then, we can write $\mathcal{M}(x; \pi)$ as the Dirichlet potential $(x+1, f(x))$ where $f(x)$ is the constant
\begin{align*} f(x) = \log \Gamma \left( 1 + \sum_{k=1}^K x_k \right) - \log \Gamma \left( \sum_{k=1}^K (x_k +1) \right) \end{align*}
def dir_mult_norm_const(alpha1,alpha2):
return sp.gammaln(np.sum(alpha1)+1) - np.sum(sp.gammaln(alpha1+1)) + \
sp.gammaln(np.sum(alpha2))- np.sum(sp.gammaln(alpha2)) +\
np.sum(sp.gammaln(alpha1+alpha2)) - sp.gammaln(np.sum(alpha1+alpha2))
Calculation of $p(r_t)$ and $p(\pi_t)$ from a Mixture of Dirichlet Potentials¶
To track the density of $\pi_t$ and the probability of a change, we need to calculate $p(r_t)$ and $p(\pi_t)$. If we express a forward message as the summation of components with the following notation:
\begin{align*} \alpha_{t|t}(r_t, \pi_t) &= \sum_m \alpha_{t|t}^m(r_t, \pi_t) \end{align*}
where superscript $m$ goes over the components. The expectation of filtered intensity becomes
\begin{align*} \langle p(\pi_t, x_{1:t}) \rangle &= \langle \sum_{r_t} p(\pi_t, r_t, x_{1:t}) \rangle \\ &= \langle \sum_m \alpha_{t|t}^m(r_t, \pi_t) \rangle \\ &= \langle \sum_m \text{Dir}(\pi_t^m;a_t^m) c_t^m \rangle \\ &= \sum_m c_t^m \langle \text{Dir}(\pi_t^m;a_t^m) \rangle \\ \end{align*}
noting that if $\pi \sim \mathcal{D}ir(\pi;a)$, then $E[\pi] = \frac{a}{\sum_j a_j}$, $a_j$ denoting the $j$'th element of the vector $a$.
Similarly, posteriors of change point probabilities corresponds to summing up normalization constants:
\begin{align} p(r_t, x_{1:t}) &= \int_{\pi_t} p(\pi_t, r_t, x_{1:t}) \\ &= \int_{\pi_t} \sum_m \alpha_{t|t}^m(r_t, \pi_t) \\ &= \int_{\pi_t} \sum_m \text{Dir}(\pi_t^m;a_t^m) c_t^m \\ &= \sum_m c_t^m \int_{\pi_t} \text{Dir}(\pi_t^m;a_t^m) \\ &= \sum_m c_t^m \end{align}
def multiply(comp1, comp2):
norm_const = comp1.c + comp2.c + dir_dir_norm_const(comp1.alpha, comp2.alpha)
alpha = comp1.alpha + comp2.alpha - 1
return MixtureComponent(alpha, norm_const)
K = 5
a = MixtureComponent(np.ones(K),-sp.gammaln(K))
b = MixtureComponent(np.random.random(K),-71.2)
print(b.alpha)
print(b.c)
m = multiply(a,b)
print(m.alpha)
print(m.c)
[ 0.68853789 0.74759069 0.59645085 0.88967388 0.78955634] -71.2 [ 0.68853789 0.74759069 0.59645085 0.88967388 0.78955634] -71.2
Overall, a message class (representing an $\alpha$ or $\beta$ message) is implemented as below:
class Message:
def __init__(self):
self.components = []
def copy(self):
new_msg = Message()
new_components = []
for v in self.components:
new_components.append(v.copy())
new_msg.components = new_components
return new_msg
def eval_mean_and_cpp(self):
params = []
consts = []
for prt in self.components:
consts.append(prt.c)
params.append(prt.alpha/np.sum(prt.alpha))
params = np.array(params).T
consts = np.array(consts)
# re-weight components for normalization
tmp = np.exp(consts - np.max(consts))
norm_consts = tmp/tmp.sum()
mean = (params*norm_consts).sum(1)#*np.exp(max_const)
return mean, norm_consts[-1], consts
def log_lhood(self):
consts = []
for prt in self.components:
consts.append(prt.c)
consts = np.array(consts)
return safe_log_sum_exp(consts)
Forward-Backward Recursions¶
At each time $t$, $r_t$ can take 2 values, which correspond to the cases in which there is a change($r_t=1$) or not($r_t=0$). Therefore, the state space may seem to grow exponentially (with $t$). On the other hand, as there is only one reset parameter (the parameter of the new Dirichlet potential in case of change), then the state space grows linearly. The reasoning is given below.
First let's see that the first predicted and updated forward variables, which are $\alpha_{1|0}(r_1, \pi_1)$ and $\alpha_{1|1}(r_1, \pi_1)$, are nothing but Dirichlet potentials. If we denote
\begin{align} \alpha_{0|0}(r_0, \pi_0) = \mathcal{D}ir(\pi_0; a) \end{align}
Then, the alpha recursion starts as
\begin{align*} \alpha_{1|0}(r_1, \pi_1) & = \sum_{r_{0}} \int_{\pi_{0}} \! \alpha_{0|0}(\pi_{0}) p(r_1, \pi_1 | \pi_{0})\, \mathrm{d}\pi_{0} \\ & = \int_{\pi_{0}} \! p(\pi_{0}) p(r_1, \pi_1 | \pi_{0})\, \mathrm{d}\pi_{0} \end{align*}
Let's open up $\alpha_{1|0}(r_1, \pi_1)$ message as follows:
\begin{align*} \alpha_{1|0}(r_1, \pi_1) &= \alpha_{1|0}(r_1=1, \pi_1) \cup \alpha_{1|0}(r_1=0, \pi_1) \end{align*}
When we first examine the reset case, that is, $r_1=1$:
\begin{align*} \alpha_{1|0;1} & = \int_{\pi_{0}} \! \alpha_{0|0}(\pi_{0}) p(\pi_1 | r_1 = 1, \pi_{0}) p(r_1 = 1)\, \mathrm{d}\pi_{0} \\ & = \int_{\pi_{0}} \! p(\pi_{0}) p(\pi_{1}) p(r_1 = 1)\, \mathrm{d}\pi_{0} \\ & = p(\pi_{1}) p(r_1 = 1) \int_{\pi_{0}} \! p(\pi_{0})\, \mathrm{d}\pi_{0}\\ & = p(\pi_{1}) p(r_1 = 1)\\ & = (a,l_1) \end{align*}
where $p(\pi_{1}) \sim \mathcal{D}ir(\pi_1; a)$. This is because the model is reset and there is only a single hyperparameter that is used to generate reset parameters. When $r_1=0$,
\begin{align*} \alpha_{1|0;0} & = \int_{\pi_{0}} \! \alpha_{0|0}(\pi_{0}) p(\pi_1 | r_1 = 0, \pi_{0}) p(r_1 = 0)\, \mathrm{d}\pi_{0} \\ & = \int_{\pi_{0}} \! p(\pi_{0}) p(\pi_{1}) \delta(\pi_{1} - \pi_{0}) p(r_1 = 0)\, \mathrm{d}\pi_{0} \\ & = p(\pi_{1}) \delta(\pi_{1} - \pi_{0}) p(r_1 = 1) \int_{\pi_{0}} \! p(\pi_{0})\, \mathrm{d}\pi_{0}\\ & = p(\pi_{1}) p(r_1 = 0)\\ & = (a,l_0) \end{align*}
where $p(\pi_{1}) \sim \mathcal{D}ir(\pi_1; a)$, which is because the model is not reset and $p(\pi_{0}) \sim \mathcal{D}ir(\pi_0; a)$. The crucial observation here is that even if the model is reset, parameter of Dirichlet distribution stays the same. Therefore,
\begin{align*} \alpha_{1|0}(r_1, \pi_1) &= \alpha_{1|0}(r_1=1, \pi_1) \cup \alpha_{1|0}(r_1=0, \pi_1) \\ &= p(\pi_{1}) p(r_1 = 1) \cup p(\pi_{1}) p(r_1 = 0)\\ &= (a,1) \end{align*}
The next step is to apply update equations. In update step, we have the multiplication of a Multinomial density with a Dirichlet potential:
\begin{align*} \alpha_{1|1}(r_1, \pi_1) &= \alpha_{1|0}(r_1, \pi_1) p(x_1|\pi_1) \\ &= \mathcal{D}ir(\pi_1; a) \mathcal{M}(x_1; \pi_1) \\ &= (a+x_1,\mathcal{C}) \\ \end{align*}
where $\mathcal{C}$ is a constant, which was shown before.
When we calculate $\alpha_{2|1}(r_2, \pi_2)$, we again consider cases where $r_2=1$ and $r_2=0$ separately. When $r_2=1$, parameter of Dirichlet distribution stays the same, which is equal to $a+x_1$ (only normalizing constant is updated). When model is reset, however, the parameter becomes $a$. Therefore, $\alpha_{2|1}(r_2, \pi_2)$ is expressed as a mixture of two Dirichlet potentials. In update step, the only thing to do is to update model parameters with the observation at $t=2$.\
Calculation of $\alpha_{3|2}(r_3, \pi_3)$ is similar: In case of a change, all mixture components (propagated by $\alpha_{2|2}(r_2, \pi_2)$) are reset. Because there is only one reset parameter, we can sum up these components and we end up with a single Dirichlet potential. The case where there is no change, we just update the normalizing constants of two Dirichlet potentials (propagated by $\alpha_{2|2}(r_2, \pi_2)$). So at $t=3$, updated forward message has 3 components. When the logic is generalized, we see that the state space grows linearly.
Implementation of the Model¶
class DirichletCPModel:
def __init__(self, _c, _a):
self.a = _a # default Dir hyperparam
self.K = len(_a) # input dimension
self.c = _c # prob. of change
self.log_p1 = np.log(self.c) # log prob. of change
self.log_p0 = np.log(1-self.c) # log prob. no change
def init_forward_msg(self):
no_change_component = MixtureComponent(self.a, self.log_p0)
change_component = MixtureComponent(self.a, self.log_p1)
msg = Message()
msg.components.append(no_change_component)
msg.components.append(change_component)
return msg
def init_backward_msg(self):
component = MixtureComponent(np.ones(self.K), -sp.gammaln(self.K)) # initial component
msg = Message()
msg.components.append(component)
msg.mean = np.zeros(self.K)
return msg
def multiply(self, a_upd, b_post):
no_change_norm_consts = np.array([])
change_norm_consts = np.array([])
smt_msg = Message()
for comp_a in a_upd.components[:-1]:
for comp_b in b_post.components:
new_comp = multiply(comp_a, comp_b)
# new_comp.c += self.log_p0 #########################################333
no_change_norm_consts = np.hstack((no_change_norm_consts, new_comp.c))
smt_msg.components.append(new_comp)
for comp_b in b_post.components:
new_comp = multiply(a_upd.components[-1], comp_b)
# new_comp.c += self.log_p1 #########################################333
change_norm_consts = np.hstack((change_norm_consts, new_comp.c))
smt_msg.components.append(new_comp)
[mean,_,_] = smt_msg.eval_mean_and_cpp()
log_p_no_change = safe_log_sum_exp(no_change_norm_consts)
log_p_change = safe_log_sum_exp(change_norm_consts)
mx = np.maximum(log_p_no_change,log_p_change)
cpp = np.exp( log_p_change - (mx + np.log(np.exp(log_p_change-mx)+np.exp(log_p_no_change-mx)) ) )
return mean, cpp, smt_msg
# above line returns np.exp(log_p_change) / ( np.exp(log_p_no_change) + np.exp(log_p_change) )
def predict(self,msg):
consts = []
for prt in msg.components:
consts.append(prt.c)
consts = np.array(consts)
max_const = np.max(consts)
running_sum = 0 # norm. const of the new message
for comp in msg.components:
running_sum += np.exp(comp.c - max_const)
comp.c += self.log_p0
new_prt = MixtureComponent(self.a, self.log_p1 + max_const + np.log(running_sum))
msg.components.append(new_prt)
def postdict(self,msg):
consts = []
reset = MixtureComponent(self.a, -sp.gammaln(self.K)) # placeholder for postdict.
for comp in msg.components:
tmp = multiply(reset, comp)
consts.append(tmp.c)
comp.c += self.log_p0
new_comp = MixtureComponent(np.ones(self.K), self.log_p1+safe_log_sum_exp(np.array(consts)))
msg.components.append(new_comp)
def update(self, msg, data):
for prt in msg.components:
change = dir_mult_norm_const(data,prt.alpha)
prt.alpha += data
prt.c += change
def print_vars(self,message):
for prt in message.components:
print("constant={:f}, vector:{}".format(prt.c,prt.alpha))
@staticmethod
def gen_sequence(c, a, T=100):
# T (duration)
# K (dimensionality)
# c (prior of change point prob.)
# a (reset parameter)
K = len(a)
cps = np.random.binomial(1,c,size=T)
actual_changepoints = np.where(cps==1)[0]
pi_0 = np.random.dirichlet(a)
pi_ = np.zeros((K,T))
data = np.zeros((K,T))
for t in range(T):
if cps[t]:
pi_[:,t] = np.random.dirichlet(a)
elif t==0:
pi_[:,t] = pi_0
else:
pi_[:,t] = pi_[:,t-1]
data[:,t] = np.random.multinomial(np.random.randint(10,50),pi_[:,t])
return [cps, pi_, data]
Implementation of Forward Backward Algorithm¶
class ForwardBackward:
def __init__(self, model, L=0, max_components=100):
self.K = len(model.a) # dimension of the data
self.model = model # change point model instance
self.L = L # lag of fixed lag of smoothing
self.max_components = max_components # upper limit on the number of components stored
def forward(self,data):
T = data.shape[1]; # number of observations
alpha_predict = [] # alpha_{t|t-1} messages
alpha_update = [] # alpha_{t|t} messages
p_change = np.zeros(T) # zero-indexing
mean = np.zeros((self.K,T)) # zero-indexing
for t in range(T):
# predict
if t==0:
init_msg = self.model.init_forward_msg()
alpha_predict.append(init_msg)
else:
alpha_predict.append(alpha_update[-1].copy())
self.predict(alpha_predict[-1])
# update
alpha_update.append(alpha_predict[-1].copy())
self.update(alpha_update[-1],data[:,t])
# posterior calculation
[mean_vec, cpp, _] = alpha_update[-1].eval_mean_and_cpp()
p_change[t] = cpp
mean[:,t] = mean_vec
# fixed_lag_smoothing
if self.L > 0 and t > self.L:
self.fixed_lag_smoothing(t,p_change,mean)
# pruning
self.prun(alpha_update[-1])
return [alpha_predict, alpha_update, p_change, mean]
def backward(self,data):
T = data.shape[1]; # number of observations
beta_postdict = [] # beta_{t|t+1} messages
beta_update = [] # beta_{t|t} messages
p_change = np.zeros(T+1) # zero-indexing
mean = np.zeros((self.K,T)) # zero-indexing
for t in range(T-1,-1,-1):
# postdict
if t==T-1:
init_msg = self.model.init_backward_msg()
beta_postdict.append(init_msg)
else:
beta_postdict.append(beta_update[-1].copy())
self.postdict(beta_postdict[-1])
# update
beta_update.append(beta_postdict[-1].copy())
self.update(beta_update[-1],data[:,t])
# posterior calculation
[mean_vec, cpp, _] = beta_update[-1].eval_mean_and_cpp()
p_change[t+1] = cpp
mean[:,t] = mean_vec
# pruning
self.prun(beta_update[-1])
return [beta_postdict, beta_update, p_change, mean]
def smoothing(self,data):
[alpha_predict, alpha_update, _, _] = self.forward(data)
[beta_postdict, beta_update, _, _] = self.backward(data)
T = data.shape[1];
mean = np.zeros((self.K,T))
p_change = np.zeros(T)
smt_msgs = []
for t in range(0,T,1):
mean_vec, cpp, smt_msg = self.model.multiply(alpha_update[t],beta_postdict[T-1-t])
mean[:,t] = mean_vec
p_change[t] = cpp
smt_msgs.append(smt_msg)
return [mean, p_change, smt_msgs]
@staticmethod
def loglhood(c, a, data, L=0):
model = DirichletCPModel(c,a)
fb = ForwardBackward(model, L=0, max_components=1000)
[_, alpha_update, _, _] = fb.forward(data)
norm_consts = [cmp.c for cmp in alpha_update[-1].components]
return safe_log_sum_exp(norm_consts)
def predict(self,msg):
self.model.predict(msg)
def postdict(self,msg):
self.model.postdict(msg)
def update(self, msg, data):
self.model.update(msg,data)
def prun(self,msg):
if len(msg.components) > self.max_components:
no_ch_components = msg.components[:-1]
min_part = min(no_ch_components,key=attrgetter('c'))
msg.components.remove(min_part)
T = 50
K = 10
[cps, pi_, data] = DirichletCPModel.gen_sequence(0.05, np.ones(K), T=T)
save_array("/tmp/data.txt",data)
model = DirichletCPModel(0.1 ,np.ones(K))
fb = ForwardBackward(model,L=0)
[alpha_predict, alpha_update, p_change_f, mean_f] = fb.forward(data)
tmp = []
for comp in alpha_update[10].components:
tmp.append(comp.c)
print(safe_log_sum_exp(np.array(tmp)))
[beta_postdict, beta_update, p_change_b, mean_b] = fb.backward(data)
model_output_forward = [i for i in range(T) if p_change_f[i]>0.5]
model_output_backward = [i for i in range(T) if p_change_b[i]>0.5]
print("actual_changepoints:\t {}".format(np.where(cps==1)[0]))
print("model_output_forward:\t {}".format(model_output_forward))
print("model_output_backward:\t {}".format(model_output_backward))
plot_hist(data,cps,title="Data generated using the above generative model", xlabel="Time",ylabel="Data Vectors")
plot_hist(pi_,cps,title="Params of Multinomial Distribution (pi_t)", xlabel="Time",ylabel="Parameter Vectors")
plot_hist(mean_f,p_change_f,title="Filtered Density on Forward Direction", xlabel="Time",ylabel="Density")
plot_hist(mean_b,p_change_b,title="Filtered Density on Backward Direction", xlabel="Time",ylabel="Density")
[mean, p_change, smt_msgs] = fb.smoothing(data)
print(mean[0,:])
plot_hist(mean,p_change,title="Smoothed Density on Full Observation", xlabel="Time",ylabel="Density")
[-153.00450549] actual_changepoints: [] model_output_forward: [] model_output_backward: [] [ 0.1183844 0.11838368 0.11838368 0.11838366 0.11838366 0.11838366 0.11838366 0.11838366 0.11838366 0.11838366 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838365 0.11838366 0.11838366 0.11838367 0.1183838 0.11836577]
Parameter Learning¶
The EM Algorithm¶
[see http://www.cs.princeton.edu/courses/archive/spr08/cos424/scribe_notes/0311.pdf for better sketching]
The EM algorithm is a general method that is widely used to find the parameters of a state space model that maximizes the likelihood. If $\theta$, $z$ and $x$ denote model parameters, latent variables and observations, respectively, our goal is to maximize log-likelihood $\mathcal{L}(\theta)$:
\begin{align} \hat{\theta} &= \underset{\theta}{\text{argmax}}\mathcal{L}(\theta)\\ &=\underset{\theta}{\text{argmax}} \log p(x|\theta) \\ &= \underset{\theta}{\text{argmax}} \log \sum_{z} \underbrace{p(z|\theta) p(x|z,\theta)}_{\text{complete data likelihood}} \end{align}
The expression on the right is hard to evaluate as the summation over latent variables is almost always intractable. For this, classical EM algorithm attempts to maximize a lower bound for the likelihood. Now, let's first write the log-likelihood:
\begin{align} \log p(x|\theta) &= \log \sum_{z} p(x,z|\theta) \\ &= \log \sum_{z} p(x,z|\theta) \frac{q(z)}{q(z)} \\ &= \log E \left[ \frac{ p(x,z|\theta)}{q(z)} \right]_{q(z)} \\ &\geq \underbrace{E \left[ \log \frac{p(x,z|\theta)}{q(z)} \right]_{q(z)}}_{\text{free energy term}} \\ &= E \left[ \log p(z|\theta) + \log p(x|z,\theta) - \log q(z) \right]_{q(z)} \\ \mathcal{L}(\theta, q(z)) &= \underbrace{E \left[ \log p(z|\theta) \right]_{q(z)} + E \left[ \log p(x|z,\theta) \right]_{q(z)}}_{\text{energy}} - \underbrace{E \left[ \log q(z) \right]_{q(z)}}_{\text{entropy}} \\ \end{align}
Here, $q(z)$ is an arbitrary distribution on $z$. Also note that we used the Jensen inequality, which states that $f(E[x]) \geq E[f(x)]$ when $f$ is a concave function (such as $\log$), in the fourth line. In each iteration of the algorithm, $q(z)$ and $\theta$ is updated to maximize current value of $\mathcal{L}(\theta, q(z))$. The algorithm terminates when the likelihood converges.
E-Step¶
In the classical EM, recipe for the E-step is ready. At $t$'th step, \begin{equation} q(z)^{(t)} = p(z|x,\theta^{(t)}) \end{equation}
This choice makes the bound tight:
\begin{align} \mathcal{L}(p(z|x,\theta), \theta) &= E \left[ \log p(z|\theta) + \log p(x|z,\theta) - \log p(z|x,\theta) \right]_{p(z|x,\theta)} \\ &= \sum_{z} p(z|x,\theta) \log \frac{p(x,z|\theta)}{p(z|x,\theta)} \\ &= \sum_{z} p(z|x,\theta) \log p(x|\theta) \\ &= \log p(x|\theta) \sum_{z} p(z|x,\theta) \\ &= \log p(x|\theta) \end{align}
From now on, we will replace $\mathcal{L}(\theta, q(z))$ with $\mathcal{L}(\theta, \theta^{(t)})$ as $q(z)$ is a distribution conditioned to $\theta^{(t)}$.
M-Step¶
The M-Step updates the model parameter $\theta$ that maximizes the expected complete data log likelihood. \begin{equation} \theta^{(t+1)} = \underset{\theta}{\text{argmax}} \mathcal{L}(\theta^{(t)}, \theta) \end{equation}
Relation to KL Divergence¶
When we re-write the free energy term, \begin{align} \mathcal{L}(\theta, q(z)) &= E \left[ \log \frac{p(x,z|\theta)}{q(z)} \right]_{q(z)} \\ &= E \left[ \log \frac{p(x|\theta)p(z|x,\theta)}{q(z)} \right]_{q(z)} \\ &= E \left[ \log p(x|\theta) \right]_{q(z)} + E\left[ \log \frac{p(z|x,\theta)}{q(z)} \right]_{q(z)} \\ &= \mathcal{L}(\theta) - \mathbf{KL}[q(x)||p(z|x,\theta)] \end{align}
Thus, maximizing the lower bound corresponds to minimizing KL divergence between the variational distribution and the posterior.
EM Applied to CPM¶
In the model, we have two parameters $\theta = (c,a)$ to learn: the prior probability of a change, $p(r=1)=c$ and the reset parameter $a$.
Let's first write down the log of complete data likelihood:
\begin{align} \log p(r_{1:T},\pi_{0:T}, x_{1:T} | \theta) &= \log p(r_{1:T} | \theta) + \log p(\pi_{0:T}|r_{1:T}, \theta) + \log p(x_{1:T}|r_{1:T},\pi_{0:T}, \theta) \\ &= \left( \sum_{t=1}^T \log p(r_t|\theta) \right) + \left( \log p(\pi_0|\theta) + \sum_{t=1}^T \log p(\pi_t|\pi_{t-1}, r_t, \theta) \right) + \left( \sum_{t=1}^T p(x_t|\pi_t) \right) \\ &= \left( \sum_{t=1}^T \log \left( c^{[r_t=1]} (1-c)^{[r_t=0]} \right) \right) + \left( \log \text{Dir}(\pi_0;a) + \sum_{t=1}^T \log \left( \delta(\pi_t-\pi_{t-1})^{[r_t=0]} \text{Dir}(\pi_t;a)^{[r_t=1]} \right) \right) + \left( \sum_{t=1}^T \text{Mult}(x_t; \pi_t) \right) \\ &= \left( \sum_{t=1}^T [r_t=1]\log c + [r_t=0]\log(1-c) \right) + \left( \log \text{Dir}(\pi_0;a) + \sum_{t=1}^T [r_t=0] \log \delta(\pi_t-\pi_{t-1}) + [r_t=1]\log \text{Dir}(\pi_t;a) \right) + \left( \sum_{t=1}^T \text{Mult}(x_t; \pi_t) \right) \\ \end{align}
Then, we calculate expected complete data log-likelihood:
\begin{align} &\mathrm{E} \left[ \log p(r_{1:T},\pi_{0:T}, x_{1:T} | \theta) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} \\ &= \mathrm{E} \left[ \left( \sum_{t=1}^T [r_t=1]\log c + [r_t=0]\log(1-c) \right) + \left( \log \text{Dir}(\pi_0;a) + \sum_{t=1}^T [r_t=0] \log \delta(\pi_t-\pi_{t-1}) + [r_t=1]\log \text{Dir}(\pi_t;a) \right) + \left( \sum_{t=1}^T \text{Mult}(x_t; \pi_t) \right) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} \\ \end{align}
Now, we consider each summation in the expectation separately. Note that last summation is omitted as it does not incorporate any model parameter to optimize.
Here is the first term:
\begin{align} \mathrm{E} \left[ \sum_{t=1}^T [r_t=1]\log c + [r_t=0]\log(1-c) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} &= \sum_{t=1}^T \mathrm{E} \left[ [r_t=1]\log c + [r_t=0]\log(1-c) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} \\ &= \sum_{t=1}^T \mathrm{E} \left[ [r_t=1]\log c + [r_t=0]\log(1-c) \right]_{p(r_t|x_{1:T},\theta^{(t)})} \\ &= \sum_{t=1}^T p(r_t=1|x_{1:T},\theta^{(t)})\log c + p(r_t=0|x_{1:T},\theta^{(t)})\log(1-c) \\ &= \log c \sum_{t=1}^T p(r_t=1|x_{1:T},\theta^{(t)})+ \log (1-c) \sum_{t=1}^T p(r_t=0|x_{1:T},\theta^{(t)}) \\ \end{align}
We then took the derivative wrt $c$:
\begin{align} \frac{\partial \left( \log c \sum_{t=1}^T p(r_t=1|x_{1:T},\theta^{(t)})+ \log (1-c) \sum_{t=1}^T p(r_t=0|x_{1:T},\theta^{(t)}) \right) }{\partial c} &= 0 \\ (1-c)\sum_{t=1}^T p(r_t=1|x_{1:T},\theta^{(t)}) - c \sum_{t=1}^T p(r_t=0|x_{1:T},\theta^{(t)}) &= 0 \\ c = \frac{\sum_{t=1}^T p(r_t=1|x_{1:T},\theta^{(t)})}{T} \end{align}
[http://www.msr-waypoint.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf]
Concentrating on the second term, we have
\begin{align} \mathrm{E} \left[ \log p(\pi_{0:T}|r_{1:T}, \theta) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} &= \mathrm{E} \left[ \log \text{Dir}(\pi_0;a) + \sum_{t=1}^T [r_t=0] \log \delta(\pi_t-\pi_{t-1}) + [r_t=1]\log \text{Dir}(\pi_t;a) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} \\ &= \mathrm{E} \left[ \log \text{Dir}(\pi_0;a) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} + \mathrm{E} \left[ \sum_{t=1}^T [r_t=0] \log \delta(\pi_t-\pi_{t-1}) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} + \mathrm{E} \left[ \sum_{t=1}^T [r_t=1]\log \text{Dir}(\pi_t;a) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} \\ &=^+ \mathrm{E} \left[ \log \text{Dir}(\pi_0;a) \right]_{p(\pi_{0}|x_{1:T},\theta^{(t)})} + \sum_{t=1}^T \mathrm{E} \left[ [r_t=1]\log \text{Dir}(\pi_t;a) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})} \\ &= \mathrm{E} \left[ \log \text{Dir}(\pi_0;a) \right]_{p(\pi_{0}|x_{1:T},\theta^{(t)})} + \sum_{t=1}^T p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} \left[ \log \text{Dir}(\pi_t;a) \right]_{p(\pi_t|x_{1:T},\theta^{(t)})} \end{align}
To be more compact, let's define $p(r_0=1|x_{1:T},\theta^{(t)}) := 1$ and write the last line above as
\begin{equation} = \sum_{t=0}^T p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} \left[ \log \text{Dir}(\pi_t;a) \right]_{p(\pi_t|x_{1:T},\theta^{(t)})} \end{equation}
Let's also define $\pi_{t,k}$ as the $k$'th component of a vector $\pi_t \in \mathbf{R}^K$. In our context, $p(\pi_t|x_{1:T},\theta)$ corresponds to the posterior probability computed in the E-step.
Now, take the derivative of the above expression wrt $a_k$, $k$'th component of a vector $a \in \mathbf{R}^K$ :
\begin{align} \frac{\partial}{\partial a_{k}} \left( \mathrm{E} \left[ \log p(\pi_{0:T}|r_{1:T}, \theta) \right]_{p(r_{1:T},\pi_{0:T}|x_{1:T},\theta^{(t)})}\right) &= \frac{\partial} {\partial a_{k}} \left\{ \sum_{t=0} p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} \left[ \log \text{Dir}(\pi_t;a) \right]_{p(\pi_t|x_{1:T},\theta^{(t)})} \right\} \\ &= \frac{\partial}{\partial a_{k}} \left\{ \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} \left[ \sum_k (a_k-1) \log \pi_{t,k} + \log \Gamma \left( \sum_k a_k \right) - \sum_k \log \Gamma(a_k) \right]_{p(\pi_t|x_{1:T},\theta^{(t)})} \right\} \\ &= \frac{\partial}{\partial a_{k}} \left\{ \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \left( \int_{\pi_t} p(\pi_t|x_{1:T},\theta^{(t)}) \sum_k (a_k-1) \log \pi_{t,k} \text{d} \pi_t + \log \Gamma \left( \sum_k a_k \right) - \sum_k\log \Gamma(a_k) \right) \right\}\\ &= \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \left\{ \int_{\pi_t} p(\pi_t|x_{1:T},\theta^{(t)}) \frac{\partial \left( \sum_k (a_k-1) \log \pi_{t,k} \right)}{\partial a_{k}} \text{d} \pi_t + \frac{\partial \log \Gamma \left( \sum_k a_k \right)}{\partial a_{k}} - \frac{\partial \sum_k\log \Gamma(a_k)}{\partial a_{k}} \right\} \\ 0 &= \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \left\{ \int_{\pi_t} p(\pi_t|x_{1:T},\theta^{(t)}) \log \pi_{t,k} \text{d} \pi_t + \psi \left( \sum_{k=1}^K a_k \right) - \psi (a_k) \right\} \\ \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \int_{\pi_t} p(\pi_t|x_{1:T},\theta^{(t)}) \log \pi_{t,k} \text{d} \pi_t &= \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \left( \psi (a_k) - \psi \left( \sum_{k=1}^K a_k \right) \right) \\ \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|x_{1:T},\theta^{(t)})} &= \left( \psi (a_k) - \psi \left( \sum_{k=1}^K a_k \right) \right) \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \end{align}
If we set $\mathbf{C} = \sum_t p(r_t=1|x_{1:T},\theta^{(t)})$, the above expression becomes
\begin{equation} \frac{1}{\mathbf{C}} \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|x_{1:T},\theta^{(t)})} = \psi (a_k) - \psi \left( \sum_{k=1}^K a_k \right) \end{equation}
Calculating $\mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|x_{1:T},\theta^{(t)})}$¶
Above, we need to compute the expectation within the integral, $\mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|x_{1:T},\theta^{(t)})}$. As the posterior is a mixture of Dirichlet potentials, the computation is a little tricky. Say, $p(\pi_t|x_{1:T},\theta^{(t)})$ is made up of $M$ potentials $superscripted$ by $m$. (Remember, a Dirichlet potential is defined with a tuple $(a, c)$. Here, we use $c^m$ for the normalization constant and $a^m$ for the Dirichlet parameter of $m$'th component)
\begin{align} p(\pi_t|x_{1:T},\theta^{(t)}) = \frac{1}{\mathbf{Z}} \sum_{m=1}^M c^m \text{Dir}(\pi_t|a^m) \end{align}
Observe that $\mathbf{Z} = \sum_m c^m$ is needed for normalization.
\begin{align} \mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|x_{1:T},\theta^{(t)})} &= \int_{\pi_t} p(\pi_t|x_{1:T},\theta^{(t)}) \log \pi_{t,k} \text{d} \pi_t \\ &= \int_{\pi_t} \frac{1}{\mathbf{Z}} \sum_{m=1}^M c^m \text{Dir}(\pi_t|a^m) \log \pi_{t,k} \text{d} \pi_t \\ &= \frac{1}{\mathbf{Z}} \sum_{m=1}^M c^m \int_{\pi_t} \text{Dir}(\pi_t|a^m) \log \pi_{t,k} \text{d} \pi_t \\ &= \frac{1}{\mathbf{Z}} \sum_{m=1}^M c^m \mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|a^m)} \\ &= \frac{1}{\mathbf{Z}} \sum_{m=1}^M c^m \left( \psi(a^m_k) - \psi(a^m_0) \right), \qquad\qquad a^m_0 = \sum_{k} a^m_k \end{align}
The closed form formula of the expectation in last step is from https://www.hiit.fi/u/ahonkela/dippa/node95.html.
'''
computes the expectation of log of each element of a Dirichlet random variable
with respect to a mixture of Dirichlet potentials
- msg is a message, made up of components
'''
def compute_ss(msg):
K = len(msg.components[0].alpha)
M = len(msg.components)
tmp = np.zeros((K, M)) # K-by-M matrix
log_norm_consts = np.zeros(M) # log constants, as many as components
for i in range(M):
comp = msg.components[i]
log_norm_consts[i] = comp.c
tmp[:,i] = sp.digamma(comp.alpha) - sp.digamma(comp.alpha.sum())
# numerically stable computation
log_max_norm_const = np.max(log_norm_consts)
norm_consts = np.exp(log_norm_consts-log_max_norm_const)
return (tmp*norm_consts).sum(1) / np.sum(norm_consts)
Inverse Gamma Function¶
We also need to find an update equation for $a_k$, Luckily, Minka's paper derived the equation for estimating Dirichlet parameters. Applying the same idea, we reach to an iterative procedure:
\begin{align} \psi (a_k^{\text{new}}) = \frac{1}{\mathbf{C}} \sum_t p(r_t=1|x_{1:T},\theta^{(t)}) \mathrm{E} [\log \pi_{t,k}]_{p(\pi_t|x_{1:T},\theta^{(t)})} + \psi \left( \sum_{k=1}^K a_k^{\text{old}} \right) \end{align}
Finally, we need to invert digamma function ($\psi$). Newton's iteration to invert the function was again given in Minka's paper. Below method $\textbf{inv_digamma}$ performs the inversion very quickly as it starts the iterations with good initial values: (See $\textbf{Appendix A}$ for estimating maximum likelihood estimate of Dirichlet distribution)
# given y, returns x, where y = \psi(x)
def inv_digamma(y, eps_=1e-3):
# here is the good initial start
x = (y>-2.22)*(np.exp(y)+0.5) + (y<=-2.22)*(-1/(y-sp.digamma(1)))
# iterate until convergence
while np.sum( np.abs(sp.digamma(x)-y) > eps_ ) > 0:
x = x - (sp.digamma(x)-y)/sp.polygamma(1,x)
# check for a numerical issue
if pd.isnull(x).any():
raise ValueError("inv_digamma() output contains a nan value for the input:", y)
return x
a = np.array([0.1,1,2,3,4])
inv_a = inv_digamma(a)
print(sp.digamma(inv_a))
x = np.arange(-1e2,10,1e-1)
plt.plot(x,inv_digamma(x))
plt.title("inv_digamma")
plt.figure()
y = np.arange(1e-6,1e-4,1e-6)
plt.plot(y,sp.digamma(y))
plt.title("digamma")
plt.figure()
z = np.arange(0,1.5,1e-4)
plt.plot(z,sp.polygamma(1,z))
plt.title("polygamma")
[ 0.09954798 0.9999849 1.99999971 2.99999999 4. ]
<matplotlib.text.Text at 0x7f8e01c592b0>
Implementation of EM Algorihm¶
###################################### DATA GENERATION ######################################
T = 50
K = 10
c_true = 0.08
a_true = np.ones(K)*1
[cps, pi_, data] = DirichletCPModel.gen_sequence(c_true, a_true, T=T)
data += 1
save_array("/tmp/data.txt",data)
print(np.sum(cps), "change points")
print("a:", a_true)
plot_hist(data,cps,title="Data generated using the above generative model", xlabel="Time",ylabel="Data Vectors")
###################################### EM ALGORITHM ######################################
# constants
INV_DIG_ITER = 1000
MAX_ITER = 15
EM_EPS = 1e-5
c = 0.2
a = np.random.random(K)*20
save_array("/tmp/alpha.txt", a)
fb = ForwardBackward(DirichletCPModel(c, a))
loglhoods = []
means = []
cpps = []
for i in range(MAX_ITER):
############ log-likelihood calculation ###########
ll = ForwardBackward.loglhood(c, a, data)
loglhoods.append(ll)
print("\nloglhood is", loglhoods[i])
# print("current a is", a)
################ check convergence ################
if i>0:
if loglhoods[i]-loglhoods[i-1]<0:
print("Likelihood decreased by " + str(loglhoods[i]-loglhoods[i-1]))
elif loglhoods[i]-loglhoods[i-1]<EM_EPS:
print("Converged")
break
############ E step and sufficient stats ############
[mean, cpp, smt_msgs] = fb.smoothing(data)
means.append(mean)
cpps.append(cpp)
# plot_hist(mean,cpp,title="Smoothed Density on Full Observation at time " + str(i), xlabel="Time",ylabel="Density")
E_log_pi = np.zeros((K,T))
for j in range(T):
E_log_pi[:,j] = compute_ss(smt_msgs[j])
E_log_pi_weighted = (E_log_pi*cpp)
ss = E_log_pi_weighted.sum(1) / np.sum(cpp)
print(ss)
###################### M step ######################
c = np.sum(cpp)/len(cpp)
for it in range(INV_DIG_ITER):
a = inv_digamma( ss + sp.digamma(a.sum()) )
fb = ForwardBackward(DirichletCPModel(c, a))
# results
print("a_true:", a_true)
print("c_true:", c_true)
print("a_est :",a)
print("c_est :",c)
plt.figure()
plt.plot(loglhoods)
plt.ylabel("Log-likelihood")
plt.xlabel("EM iteration #")
plt.show()
plot_hist(means[-1],cpps[-1],title="Smoothed Density on Full Observation at time " + str(i), xlabel="Time",ylabel="Density")
6 change points a: [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] loglhood is [-939.07038346] [-2.47287431 -2.22034842 -2.78350253 -2.43092495 -2.23998384 -2.75133308 -2.28655976 -2.23407869 -2.30429278 -1.95167939] loglhood is [-846.67370545] [-2.50367676 -2.34336322 -2.92768113 -2.2679198 -2.31628241 -2.50298268 -2.20852926 -2.1526982 -2.34811096 -1.95487376] loglhood is [-877.21242905] Likelihood decreased by [-30.5387236] [-2.4832466 -2.36835061 -2.96090852 -2.2308746 -2.32765483 -2.44589215 -2.19168089 -2.15202555 -2.37651918 -1.95290847] loglhood is [-892.19764995] Likelihood decreased by [-14.9852209] [-2.46825501 -2.37638791 -2.9724551 -2.21662539 -2.32710888 -2.42291411 -2.18833482 -2.15385733 -2.39210794 -1.95233123] loglhood is [-899.37786465] Likelihood decreased by [-7.1802147] [-2.46072859 -2.37975257 -2.97679425 -2.2099541 -2.32526306 -2.4137207 -2.18645661 -2.15570309 -2.39966461 -1.95108414] loglhood is [-903.14247164] Likelihood decreased by [-3.76460699]
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-19-66a710f06bed> in <module>() 65 E_log_pi = np.zeros((K,T)) 66 for j in range(T): ---> 67 E_log_pi[:,j] = compute_ss(smt_msgs[j]) 68 E_log_pi_weighted = (E_log_pi*cpp) 69 ss = E_log_pi_weighted.sum(1) / np.sum(cpp) <ipython-input-15-e7be04997f75> in compute_ss(msg) 12 comp = msg.components[i] 13 log_norm_consts[i] = comp.c ---> 14 tmp[:,i] = sp.digamma(comp.alpha) - sp.digamma(comp.alpha.sum()) 15 # numerically stable computation 16 log_max_norm_const = np.max(log_norm_consts) KeyboardInterrupt:
Appendix A¶
Estimating Dirichlet Distribution¶
Say we have a $N$ random vectors $\{p^1, p^2, \dots p^N\}$ drawn from a Dirichlet distribution with unknown parameter $a$, where each vector is $K$ dimensional, $p_k^n \geq 0$ and $\sum_k p_k^n=1$ for $n \in 1 \dots N$ and $k \in 1 \dots K$.
The unknown parameter $a$ can be estimated using EM-like updates (noted in Minka's paper). Given an initial guess for $a$, each component of $a$ is estimed iteratively as follows:
\begin{align} \psi (a_k^{\text{new}}) = \mathrm{E} [\log \bar p_k] + \psi \left( \sum_{k=1}^K a_k^{\text{old}} \right) \end{align}
where $\log \bar p_k$ is the sufficient statistics
\begin{align} \log \bar p_k = \frac{1}{N} \sum_n p_k^n \end{align}
To calculate $a_k^{\text{new}}$, we need the invert the digamma function.
Below is the code to estimate Dirichlet parameters:
# method-2: https://www.hiit.fi/u/ahonkela/dippa/node95.html
K = 5
alpha = np.random.random(K)
N = 100000
# exact sufficient statistics
E_log_pi_exact = sp.digamma(alpha) - sp.digamma(alpha.sum())
# monte carlo
samples = np.random.dirichlet(alpha,N)
logsamples = np.log(samples)
E_log_pi_emp = logsamples.sum(0)/N
print("E_log_pi_exact:", E_log_pi_exact)
print("E_log_pi_empir:", E_log_pi_emp)
a = np.random.random(K)*25
a_old = np.zeros(K)
while np.sum( np.abs( a-alpha ) ) > 1e-3 and np.sum(np.abs(a-a_old))>1e-10:
a_old = a.copy()
a = inv_digamma( E_log_pi_emp + sp.digamma(a.sum()) )
print("alpha: ", alpha)
print("estimation:", a)
E_log_pi_exact: [-1.63156869 -4.89641487 -8.48324734 -1.7261939 -1.93025423] E_log_pi_empir: [-1.62652558 -4.87088593 -8.51355972 -1.72979668 -1.92054335] alpha: [ 0.91697848 0.26471757 0.13879428 0.86838363 0.77603482] estimation: [ 0.92298556 0.2667529 0.13834213 0.86962401 0.78261646]