*data and R code for this post are available on github*]

In this post I will cover three ways to estimate parameters for regression models; least squares, gradient descent and Monte Carlo methods. The aim is to introduce important methods widely used in machine learning, such as gradient descent and Monte Carlo, by linking them to a common "use case" in the different data science communities, such as linear regression.

Regression is one of the first prediction methods invented, and the three approaches I will be discussing, are typically used by three different communities. Least squares is probably the most common method, mainly employed by frequentist statisticians and also used as the default method for many easy to use packages in R or Python (e.g. the

*lm*function in R).

Gradient descent is the machine learning approach to the problem. My favorite resource on the topic is the famous machine learning course by Andrew Ng on Coursera. In machine learning, the overall approach to problem solving and prediction is rather different compared to more classical statistics, even though it heavily relies on statistics (check out this answer by Sebastian Raschka on the origins of machine learning, I think it makes my point clear). I think one of the reasons why Andrew Ng uses gradient descent in the first place, instead of other methods like the least squares, is probably that he wants to stress the importance of the method in the machine learning community. By using gradient descent, he highlights how in machine learning it's often more important to approximate a solution by defining an iterative procedure able to efficiently explore the parameter space, instead of obtaining an exact analytical solution.

Finally, there is the Bayesian way of doing things. Monte Carlo methods are powerful tools to explore the parameter's space and obtain the full posterior distribution, instead of just point estimates. This is something that requires a bit more introduction and explanation, and while I tried to do so in the remaining of this blog post, this is certainly far from being a comprehensive resource on Bayesian modeling. For the ones that want to dig deeper into regression and Bayesian approaches I would suggest reading Gelman's book on hierarchical modeling, one of my favorite resources in the field.

## dataset
As a dataset for this post I used some real life data I've collected using HRV4Training. In particular I am using
heart rate variability (HRV) and age data, since there is a known relation between HRV and age, with HRV decreasing as we age.Thus, I am using as single predictor $X$ the user's age, and as outcome variable, HRV, in particular rMSSD in milliseconds (ms). rMSSD is a time-domain feature representative of parasympathetic activity (not really relevant for this post). In this dataset, I have $n = 581$ samples. Pearson's correlation between age and HRV is $r = -0.23$.Note that this post is not about finding the best fit to this data, which might not be linear, but it's about showing different methods to estimate the linear model's parameters, with the aim of introducing important aspects widely used in machine learning. |

## least squares regression

\[

y_i = X_i \beta + \epsilon_i

\]

where $y$ - the outcome variable -, and $\epsilon$ are vectors of $n$ elements, $X$ is a matrix with $n$ rows - the samples - and $k$ columns - the predictors. $\beta$ is a vector of length $k$, with one element per predictor.

The regression problem, starting from a set of predictors $X$ and an outcome variable $y$, concerns in finding the $\hat{\beta}$ minimizing the sum of squared residuals. As residuals, we mean $r_i = y_i - X_i \hat{\beta}$, that are different from the errors $\epsilon_i = y_i - X_i \beta$.

Basically, the sum of squared residuals is a way to quantify the accuracy of our prediction model. Since we have an outcome variable, we can determine how well our model (i.e. the parameters $\hat{\beta}$) can predict the outcome variable, based on our predictors, by computing:

\[

sum\:of\:squares = \sum_{i=1}^n (y_i - X_i \hat{\beta})^2

\]

What we want to achieve, is to find the $\hat{\beta}$ that makes the sum of the values $(y_i - X_i \hat{\beta})^2$ the smallest possible given our data. The solution to this problem, in matrix notation, is the following:

\[

\hat{\beta} = (X^t X)^{-1}X^ty

\]

Here is the R code in closed form as well as using the

*lm*package:

The solution is, as expected, exactly the same for both methods (matrix form and using the

*lm*package), with intercept $105.86$ (supposedly the HRV of a newborn), and slope $-0.65$, i.e. according to this model we loose one

*ms*in rMSSD every year. Note that

*lm*takes care of adding the intercept for us, and we need to specify only the

*age*parameter. Here are the data and the regression line:

First of all, problems are typically not this simple. For most non-linear regression problems, there is no closed form solution. Also, the closed form solution scales very poorly when we have much more data, especially in terms of parameters. So, in general, it would be better to have a simple approach to solve optimization problems in a more generic way.

Secondly,

**gradient descent**is not only an alternative to least squares for parameter estimation, but most importantly it is one of the algorithms

**at the foundations of machine learning**. As such, gradient descent is an important algorithm to learn, that will occur many times while using other methods, beyond linear regression (e.g. neural networks).

Finally, when it comes to Monte Carlo methods, an entire new world opens up due to the flexibility of using

**Bayesian modeling and probabilistic graphical models**to model relations between variables and perform inference on the parameters of interest. However, many of these models are often analytically intractable, and here is where Monte Carlo methods come into play. For such complex models we typically do not write our own samplers (i.e. our own tools to estimate the parameters), but rely on tools such as JAGS in R or PyMC in Python. However, it can be useful to understand how Monte Carlo methods work in the context of a simple linear regression example.

## gradient descent

\[

h_{\theta}(x) = \theta_0 + \theta_1 x

\]

Here our $\beta$ parameters have been replaced by $\theta$, but the problem is exactly the same as before. Even the idea behind our gradient descent algorithm is basically the same as before, i.e. to minimize the sum of squared errors. This time, we call the function to minimize the

*cost function*:

\[

J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m (h_0(x^{(i)}) - y^{(i)} )^2

\]

Why do we talk about an hypothesis and cost function this time? Here we think about the problem differently. We make an hypothesis about the parameters values, assigning values to $\theta_0$ and $\theta_1$, and then evaluate how good our hypothesis is, by computing the

*cost function*. Then, we optimize the parameters values iteratively. This optimization is what the gradient descent algorithm does.

Why do we need it? Simply put, we need an efficient way to explore the parameter's space, especially when we have many parameters, and find the optimal values quickly. To minimize the cost function, we set the derivative with respect to $\theta$ to zero. As very well explained by Andrew Ng, who also provides great graphical representations of the intuition behind gradient descent,

**by taking the derivative, i.e. the line tangent to the cost function, we obtain a direction towards the which we can move to minimize the cost function**. The only other additional parameter is the learning rate, $\alpha$, which controls how big our steps towards the minimum are.

So, the generic gradient descent equation is defined as:

\[

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)

\]

Replacing our equations and computing the derivative for our linear regression problem (while skipping the math):

\[

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)} )

\]

\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)} ) x^{(i)}

\]

*history*variable in the script above.

## monte carlo methods

## quick intro to bayesian statistics

\[ p(\theta | y) = \frac{p(y | \theta)p(\theta) }{p(y)} \]

where we call $p(y | \theta)$ the likelihood, $p(\theta)$ the prior and $p(\theta | y)$ the posterior (let's not worry about $p(y)$ now, since this parameter is independent of $\theta$ and we can often ignore it). A prior is a density or mass function (probability distribution) on the parameter expressing our belief about those parameters. The likelihood is the component that depends on the data, and the factor by which our prior beliefs are updated to produce conclusions once we've seen some data. All inferences are performed on the distribution of the parameter given the data $p(\theta | y)$, called the posterior, our main variable of interest.

## parameter estimates in bayesian statistics

**Exact inference**: in certain simple situations we can have an analytical solution to the problem. The solution is obtained using conjugate priors, i.e. prior probabilities that are such that once you multiply them with the likelihood, the posterior distribution is of the same family as the prior (i.e. if the prior was a beta distribution, the posterior will also be a beta distribution). Unfortunately this is usually not possible since the target posterior distribution is not analytically tractable.**Point estimates of $\theta$**: two of the simplest methods to estimate the parameters are the maximum likelihood estimate (MLE) and the maximum a posteriori (MAP). Especially in the context of regression, these are rather simple to compute. MLE estimates the parameters that are most likely to have generated the data (optimizing the likelihood function only) while MAP returns the parameters that are most likely given the data. The main difference being that in MAP we introduce a prior and apply Bayes' rule. However, one important aspect of Bayesian statistics is that we can obtain the full posterior distribution and not only a point estimates such as MLE and MAP. For this reason I won't spend additional time on MLE and MAP, but move directly to Monte Carlo methods.**Approximations**: Bayesian analysis often requires integration over multiple dimensions that is intractable both via analytic methods or standard methods of numerical integration. However, it is often possible to compute those integrals by simulating from posterior distributions, which is the core of Monte Carlo methods. Additionally we also have deterministic approximations such as Laplace approximation, variational methods, expectation propagation.

Due to the limitations of exact inference (not always applicable) and point estimates (we lack information about the full posterior distribution, and in certain cases, e.g. MLE, we might even overfit), we will only consider approximations and in particular Monte Carlo methods.

## monte carlo methods

**Intuition**

MCMC methods assume that the prior distribution is specified by a function that can be easily evaluated at any point and that the value of the likelihood can also be computed for any value of $\theta$ and the data at hand. The method will produce an approximation of the posterior distribution in the form of a large number of samples of $\theta$ sampled from that distribution. So the basic idea is that even if we cannot compute the integral, we can evaluate it at any point. Then,

**we need a way to explore the parameters space so that the probability of being at a given point in the parameters space is proportional to the function we sample from**(our so called target distribution, i.e. the posterior). This is why it works. By moving proportionally to the posterior distribution probabilities, the chain of samples we will eventually gather, will be representative of our $\theta$. To do so, we start at a random point and explore the parameter space using a

*proposal function*. Since we start randomly, we typically discard those first samples (burn-in). The remaining component to put all of this together, is the

*proposal function*. Let's see how this works for one of the most commonly used MCMC sampling algorithms, the Metropolis-Hastings algorithm.

**Metropolis-Hastings**

The Metropolis-Hastings algorithm requires that we can sample from our unnormalized posterior, i.e. the product of the prior and the likelihood, which we call the

*target distribution*$\pi$. This is the case because of the

*acceptance rule*used, which is a ratio of densities, and therefore we do not need to know the normalizing constant. The acceptance rule defines whether we make a move to the next location $x_j$, based on the proposal function evaluated in the current $x_i$ and next location $x_j$. While discussing the intuition, we stopped at the proposal function. So, what can we use as proposal function to decide where and when to move? The proposal can have different forms, we can consider for example a normal distribution, so that the proposed move will typically be near the current position. Using as proposal a normal, i.e. a symmetric distribution, makes things simpler because the acceptance rate for a new value $x_j$ from the current position $x_i$, is defined as:

\[

a_{ij} = min(1, \frac{\pi(x_j)}{\pi(x_i)})

\]

And therefore we only need to evaluate the target distribution at these two points (things are slightly more complicated with a non symmetrical proposal distribution, but I won't cover it here). By computing the value of the target distribution at $x_i$ and $x_j$ we decide if we want to make the move towards the new point. We always make the move if the value at the new point is higher. However, if the value is lower, we make the move only with probability $a_{ij}$. We'll see this later when we code the algorithm. We will use a uniform distribution between $0$ and $1$ to determine if we want to accept the new value and make the move when the value of the target distribution at $x_j$ is not higher than at $x_i$. After many steps we will obtain the target posterior distribution $\pi$.

Now that we have some background on Bayesian statistics and MCMC, let's implement the Metropolis-Hasting algorithm for our univariate regression example. We first introduce our new notation. We assume independent normal distribution with mean $0$ and standard deviation $\sigma$ for $\epsilon$ and write the equation in compact form as:

\[

y_i = X_{i} \beta + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2), \quad i = 1, \dots, n

\]

Equivalently, we can express the relation between predictors and predicted variables as:

\[

y_i \sim N( X_{i} \beta, \sigma^2), \quad i = 1, \dots, n

\]

Which is form I typically use, and one of the notations you'll find in Gelman's books.

*acceptance*, we use R's

*runif*function to decide if we are moving to the next location or staying.

**In the long run, we will be exploring the parameters space with probability**. Note that we are summing over the likelihood and priors because we are taking the logarithms of the values. This is necessary because otherwise we will have numerical issues due to the tiny values we are trying to multiply.

*acceptance*, which is exactly what we wantLet's look at the parameters estimates for the slope and intercept:

We obtained almost the same results we had before using least squares or gradient descent, however this seemed a lot of extra complexity, even for the simple case of univariate regression. As I was saying above we typically use software packages that can do the sampling for us. What this means is that we need to worry only about specifying the likelihood and priors, and nothing else. Some of the most commonly used software packages are JAGS for R, STAN (created by Gelman) and PyMC for Python. I will be using JAGS. JAGS stands for Just Another Gibbs Sampler, and as the name states, uses a sampling technique called Gibbs sampling, which is slightly different from the Metropolis-Hastings algorithm we just implemented.

**Gibbs sampling**

Gibbs sampling can be seen as a special case of the Metropolis-Hastings algorithm, which is not always applicable, but has some advantages. For example, one of the main limitations of the Metropolis-Hastings algorithm is that we need to take good care of choosing the proposal distribution. If the proposal distribution is too narrow and the target distribution very spread out, it will not work well. On the other hand, Gibbs sampling does not require tuning of a proposal distribution. In Gibbs sampling, a proposed move affects only one parameter at a time, and the proposed move is never rejected. The main catch is that we must be able to generate samples from the posterior distribution conditioned on each individual parameter.

Hopefully we gained an understanding of sampling methods from the previous section, and without entering in the details of the Gibbs sampler, we can have a look at how things work in JAGS. In general, it can be very hard to either define the proper proposal function (in case we use Metropolis-Hastings) or specify all the conditional distributions (in case we use Gibbs sampling), especially when we work in a high dimensional parameter space. Thus, we rely on software packages that do the job for us, and focus on specifying the model and priors only. Here is the JAGS code for univariate regression:

The last step is to call the sampler passing our data (outcome and predictors), and specifying a couple of parameters such as the number of iterations and the burn-in period:

Our results using JAGS and Gibbs sampling, are $105.6$ for the intercept and $-0.65$ for the slope, when taking the mean of the chains:

## summary

We've seen, as expected, that regardless of the method (exact analytical solution or approximations), results are basically the same for this very simple univariate regression example. There are some clear advantages in using gradient descent or Monte Carlo methods with respect to least squares, such as better scalability or the derivation of the full posterior distribution. However, my main point here is that seeing how the same results can be obtained in many different ways and manually implementing most of these methods can be particularly useful to provide a unified perspective on a well known problem, such as regression, to the ones coming from different backgrounds and being used only to one approach or the other.

For the ones interested in learning more about other ways in which the gradient descent algorithm is used to solve parameter optimization problems, I would suggest going through Andrew Ng machine learning course.

In my work I got more and more interested in Bayesian modeling when I first approached hierarchical models. In addition to the conventional points in favor or Bayesian statistics, such as the full posterior distributions instead of point estimates and the use of priors (yes, they are a plus), the flexibility of hierarchical modeling in a Bayesian framework, allows to specify relations between variables very clearly. Especially for "small data" problems, generalized linear models and Bayesian inference provide a much better alternative to standard complete pooling or no-pooling models. By using probabilistic programming languages like JAGS, STAN or PyMC we can work on specifying the model and relations between variables and then let the sampler do the magic. For the ones interested in this topic and in seeing how hierarchical modeling can be used in many different applications to better model hierarchical relations that are very often encountered in real life applications (e.g. samples within people, cities within counties, etc.), I would suggest reading Data Analysis Using Regression and Multilevel/Hierarchical Models. For a lighter introduction, still showing the power of Bayesian modeling and hierarchical regression, start from this ipython notebook by Chris Fonnesbeck.

Bayesian statistics is rather complex and obviously this post is far from being comprehensive. If all of this still seems very confusing, do as I did, go to London and visit Bayes' grave: