Minimum Norm Estimates

|

This post is a summary of the paper: Interpreting magnetic fields of the brain: minimum norm estimates (Hämäläinen and Ilmoniemi, 1994).

Minimum Norm Estimates (MNE): One Way of Getting to the Source Getting to the Source

To work on the source space of MEG data, it is required that we have a good understanding of how source models work and how the signal in the sensor space is projected onto the source space. One of the basic algorithms for solving this inverse problem is Minimum Norm Estimates (MNE) (Hämäläinen and Ilmoniemi, 1994).

The interesting question is how we are able to get to the source space from the sensor space. This classical inverse problem, unfortunately, does not have a unique solution; “there are infinite current distributions that can explain the measurements”, as says in the paper.

One potential method is to regularize or constrain the current distribution under some particular assumptions, such as limiting the source current in a small area (single dipole model) and in several sites (multi-dipole models). If these assumptions are not met, the result of the source current distribution might be incorrect.

The MNE method gives the best estimate for the current when minimal a priori information about the source is available. The mathematical details of how the final current distribution is derived are too complicated for me to understand at this point, but the main idea is to mathematically construct an equation between the magnetometer and primary-current distribution. To recover the primary-current distribution, one needs to find the lead field, which describes the sensitivity distribution of the magnetometers. The lead field depends on the conductivity, which is determined by the head model (spherically symmetric or horizontally layered conductor).

The equation shows that there are infinite solutions to the lead field; “any primary-current distribution (together with volume currents) to which the measuring instrument is not sensitive may be added to the solution.” Regularization is also applied to improve stability.

enter image description here

The estimation of the primary-current distribution has direction and magnitude, as shown in the above image. It seems like that the more magnetometers we have, the more precision we get for the primary-current distribution.

(Question remains to be answered) What still unclear to me is whether dipoles are different from the primary current distribution. My guess is that the later is what we really want to estimate, and using the single dipole assumption as the constrain to find a single solution.

Term Definition

Current Dipole: “The electrical current dipole can be thought of as an infinitesimally short length of wire which carries a current. The strength of the source is defined by its dipole moment (p).” (“Defining the Electrical Current Dipole — Electromagnetic Geophysics,” n.d.)

Lead Field: “The lead field is a vector field that describes the sensitivity pattern of a magnetometer to the primary current” (Hämäläinen and Ilmoniemi, 1994)

Position Vector: A vector that starts from the origin (O) is called a position vector (“Position Vector (solutions, examples, videos),” n.d.).

References:

Defining the Electrical Current Dipole — Electromagnetic Geophysics [WWW Document], n.d. URL https://em.geosci.xyz/content/maxwell1_fundamentals/dipole_sources_in_homogeneous_media/electric_dipole_definition/index.html (accessed 4.19.20).

Hämäläinen, M.S., Ilmoniemi, R.J., 1994. Interpreting magnetic fields of the brain: minimum norm estimates. Medical & Biological Engineering & Computing 32, 35–42. https://doi.org/10.1007/BF02512476

Position Vector (solutions, examples, videos) [WWW Document], n.d. . www.onlinemathlearning.com. URL https://www.onlinemathlearning.com/position-vector.html (accessed 4.19.20).

Written with StackEdit.

The Bayesian Brain

|

This blog is a review of a paper: Bayes in the Brain – On Bayesian Modelling in Neuroscience (Colombo and Seriès, 2012)

How much Bayes is it in our brain?

I am having this thought that the neural network in our brain might encode information using a Baysian framework. There is a paper (Colombo and Seriès, 2012) talking about Bayesian modelling in neuroscience. Before writing my thoughts about this, it is important to see what has already been studied.

…, a growing trend in theoretical neuroscience considers that the human perceptual system is akin to a Bayesian machine. The function of this machine would be to infer the causes of sensory inputs in an ‘optimal’ way.

To justify this point, the paper talks about the noise and ambiguity of sensory inputs. Our perception requires some mechanism to represent and handle uncertainty. Therefore, our neural network (“nerve systems”) needs to encode probabilistic models through neural processing. The sensory input is used to update the neural encoded probabilistic (Bayesian) models.

The paper addresses two questions:

(1) How are Bayesian models used in theoretical neuroscience.

(2) From the use of Bayesian models in theoretical neuroscience, have we learned, or can we hope to learn, that perception is Bayesian inference or that the brain is a Bayesian machine?

An important thing about the two questions is that, as Colombo and Series mentioned in the paper:

Given the increasing influence on the neuroscience of ideas and tools from such fields, it’s not surprising that Bayesian modeling has a lot to offer to study the brain. Yet, that the brain is a Bayesian machine does not follow from the fact that Bayesian models are used to study the brain and the behavior it generates.

We need to be careful in the sense that what we are discussing here is not in an instrumentalist point of view towards Bayesian modeling. We are trying to look at it from a theoretical neuroscience perspective - the ‘fundamental implication for neuroscience’:

… the Bayesian coding hypothesis: ‘the brain represents information probabilistically, by coding and computing with probability density functions or approximations to probability density functions’

The hypothesis is tow-fold:

(1) The brain performs Bayesian inference to enable us to make judgments and guide action in the world

(2) The brain represents sensory information in the form of probability distributions.

The paper demonstrates an experiment that fits a Bayesian model to predict the stimuli with both haptic and visual information and compare it with fitting the model separately for each sensory input. The result shows that the behavioral result is accurately predicted by the models, indicating that “Humans integrate visual and haptic information in a statistically optimal fashion”.

Though, it is worth noticing that even if the model is able to compute the belief of the state of the world (the S) from sensory inputs, the model does not indicate any decision making or motor response mechanism. In order for this model to account for decision making or motor response, a loss function has to be optimized. It is also arguable that this kind of Bayesian modeling is feasible for brain neural processing due to its computational complexity. In that case, a linear function approximation might be applied instead of computing the exact probability distribution.

Another important thing about Bayesian modeling is that, even though it makes good predictions about the behavioral data, it does not mean that the mechanism of our perception and neural encoding is Bayesian. There is a difference between realism and instrumentalism. Instrumentalists models can make good predictions, but it doesn’t care about the actual mechanism, while realistic models try to capture the components of a system and describe what is actually happening in the system. There are a few reasons why the Bayesian experiment mentioned in section 2 is only an instrumental model. The prior and the loss function are chosen not because of the biological implication, but for the convenience of mathematical calculation. How we should understand the inference from psychophysics to brain mechanisms remains a question which is addressed in section 4.

In the last two sections, 4 and 5, the issues that needs to be addressed in order for our brain to be Bayesian is the following:

  1. How might neurons represent uncertainty?
  2. How might they represent probability distribution?
  3. How might they implement different approximations to Bayesian inference?

Multiple efforts have been shown to help explain the neural mechanism that encodes the probability density functions, particularly the possion firing patterns in visual neurons.

In the last section, the author mentioned an example of the Hodgkin and Huxley computational model for neurons, though the model is not necessary a claim on the real neuron mechanism.

The conclusion of the paper is that:

  1. Bayesian models do not provide mechanistic explanations currently, instead they are predictive instruments.
  2. The inference typically drawn from psychophysical performance to the Bayesian coding hypothesis should be understood within the Marr’s framework.
  3. Within Marr’s framework we can hope to learn that perception is Bayesian inference or that the brain is a Bayesian machine to the extent that Bayesian models will prove successful in yelding secure and informative predictions.

Reference

Colombo, M., Seriès, P., 2012. Bayes in the Brain—On Bayesian Modelling in Neuroscience. Br. J. Philos. Sci. 63, 697–723. https://doi.org/10.1093/bjps/axr043

Written with StackEdit.

Stanford UFLDL Tutorial Work Through (4) - Softmax Regression

|

Link to the Tutorial: Softmax Regression

Problem

Logistic regression has two classes of output, \(y_i\in[1,0]\). A questions raise when we want to do classification of more than 2 classes, \(y_i\in[1,\dots,k]\). Softmax regression is an extension of logistic regression to handle multiple class classifications.

Definition of the Cost Function

The softmax function is given by:

[h_{\Theta}(x^{(i)})= \left[\begin{array}{c}P(y_i=1|x^{(i)},\Theta^{(1)})
P(y_i=2|x^{(i)},\Theta^{(2)})
\vdots
P(y_i=k|x^{(i)},\Theta^{(k)}) \end{array}\right]=\frac{1}{\sum_{j=1}^k\exp{(-x^{(i)}\Theta^{(k)})}}\left[ \begin{array}{c}\exp{(-x^{(i)}\Theta^{(1)})}
\exp{(-x^{(i)}\Theta^{(2)})}
\vdots
\exp{(-x^{(i)}\Theta^{(k)})} \end{array} \right]]

where

[\Theta = \left(\begin{array}{cccc} |&|&|&|
\Theta^{(1)} & \Theta^{(2)} & \dots &\Theta^{(k)}
|&|&|&| \end{array}\right)]

Similar to logistic regression, we want to get the log likelihood function for \(\Theta\):

[\ell(\Theta) = \sum_{i=1}^{m}\sum_{k=1}^{K}1{y_i=k}log\frac{\exp(x^{(i)}\Theta^{(k)})}{\sum_{k=1}^{K}\exp(x^{(i)}\Theta^{(k)})}]

Note that \(1\{\}\) is an indicator function so that \(1\{true\} = 1,1\{false\}=0\).

Let \(P(y_i=k \vert x^{(i)},\Theta)=\frac{\exp(x^{(i)}\Theta^{(k)})}{\sum_{k=1}^{K}\exp(x^{(i)}\Theta^{(k)})}\)the cost function is defined as:

[J(\Theta) = - \left[\sum_{i=1}^{m}\sum_{k=1}^{K}1{y_i=k}logP(y_i=k x^{(i)},\Theta)\right]]

Taking the derivatives, we get:

[\nabla J_{\Theta^{(k)}}(\Theta) = -\sum_{i=1}^m[x^{(i)}(1{y_i=k}-P(y_i=k x^{(i)},\Theta))]]

In order to vectorize the gradient calculation, we define a output vector for each \(y_i\):

[Y_i=\left(\begin{array}{c} 0
\vdots
1
\vdots
0 \end{array}\right)]

where the kth element is 1 and others are 0. Then we combine all the output vector for all records, we get a output matrix:

[Y_{k\times m}=\left(\begin{array}{ccc} 0&0
\vdots&1
1&\vdots&\dots
\vdots&\vdots
0&0 \end{array}\right)]

Using this matrix in calculation we get:

[\nabla J_{\Theta}(\Theta) = -X^T[Y-H_{\Theta}(X)]^T= -\left(\begin{array}{cccc}|&|&|&|
\nabla J_{\Theta^{(1)}}(\Theta)&\nabla J_{\Theta^{(2)}}(\Theta)&\vdots&\nabla J_{\Theta^{(k)}}(\Theta)
|&|&|&| \end{array}\right)]

\(H_{\Theta}(X)\) is a matrix with each column corresponding to the softmax function output of each record, \(h_{\Theta}(x^{(i)})\).

Then we update \(\Theta\) with \(\Theta:=\alpha\nabla J_{\Theta}(\Theta)\) in each iteration to achieve the local minimum of the cost function \(J(\Theta)\).

Properties of softmax regression parameterization

Please see the following calculation, for any vector \(\psi\) with the same dimension as \(\Theta^{(k)}\):

[\begin{array}{ccl}P(y_i=k|x^{(i)},\Theta)&=&\frac{\exp(x^{(i)}(\Theta^{(k)}-\psi))}{\sum_{j=1}^{K}\exp(x^{(i)}(\Theta^{(j)}-\psi))}
&=&\frac{\exp(x^{(i)}\Theta^{(k)})\exp(-x^{(i)}\psi))}{\sum_{j=1}^{K}\exp(x^{(i)}\Theta^{(j)})\exp(-x^{(i)}\psi)}
&=&\frac{\exp(x^{(i)}\Theta^{(k)})}{\sum_{j=1}^{K}\exp(x^{(i)}\Theta^{(j)})} \end{array}]

From the above, we can see that no matter what \(\psi\) is we get the same result as the original parameter \(\Theta\). It is also telling us that there is more than one set of parameters that can take us to the minimum of the cost function.

If we set \(\psi=\Theta^{(k)}\), then we have \(\Theta^{(k)}:=\Theta^{(k)}-\psi=\vec0\). In this way we only need to train the other \((k-1)\) number of parameters. In my experience with statistics, it probability has to do with the degree of freedom concept, but more study need to be done on this to fully understand it entirely.

Relationship between softmax regression and logistic regression

In a 2 class softmax regression case, we get the following:

[\begin{array}{ccl}h_{\Theta}(x^{(i)})&=&\frac{1}{\exp{(-x^{(i)}\Theta^{(1)})}+\exp{(-x^{(i)}\Theta^{(2)})}}\cdot\left[ \begin{array}{c}\exp{(-x^{(i)}\Theta^{(1)})}
\exp{(-x^{(i)}\Theta^{(2)})} \end{array} \right]
&=&\frac{1}{\exp{(-x^{(i)}(\Theta^{(1)}-\Theta^{(2)}))}+\exp{(-x^{(i)}\vec0)}}\cdot\left[ \begin{array}{c}\exp{(-x^{(i)}(\Theta^{(1)}-\Theta^{(2)}))}
\exp{(-x^{(i)}\vec0)} \end{array} \right]
\end{array}]

Let \(\Theta'=\Theta^{(1)}-\Theta^{(2)}\), simplify the above equation, we get:

[h_{\Theta}(x^{(i)}) = \left[\begin{array}{c} 1-\frac{1}{1+\exp(-x^{(i)}\Theta’)}
\frac{1}{1+\exp(-x^{(i)}\Theta’)} \end{array}\right]]

From the above, we can see that logistic regression is actually a special case of the softmax regression with class 2.

Stanford UFLDL Tutorial Work Through (3) - Gradient Check

|

Link to the Tutorial: Gradient Check

Problem

Linear regression and logistic regression both are required using gradient descent to optimize their cost functions. Linear regression and logistic regression are relatively simple regression models that their gradients are easy to compute. In later sections, when we start working with more complicated calculations, for example, back-propagation in neural network, which sometimes hard to see what’s going on in the blackbox, we need a way to check whether the gradient is correct. In this section, a numerical method of checking the gradients is introduced so we can be more confident that our model is working as expected.

Numerical Approximation for Derivatives

Recall the definition of a derivative:

[\frac{dJ(\theta)}{d\theta} = \lim_{\epsilon\to0}\frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}]

Using this definition we can approximate any derivative by:

[\frac{dJ(\theta)}{d\theta}\approx\frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}]

It follows that the smaller \(\epsilon\) is, the better the approximation. A \(\epsilon\) with value of \(10^{-4}\) is an acceptable approximation a lot of the times. It gives at least 4 digits of accuracy.

The gradient in our case is a vector over all the \(\theta_i\) that we want to train, for example, the gradient for linear regression is given by:

[\nabla J_{\Theta}(\Theta)=\left( \begin{array}{c} \frac{\partial {J(\Theta)}}{\partial\theta_0}
\frac{\partial {J(\Theta)}}{\partial\theta_1}
\vdots
\frac{\partial {J(\Theta)}}{\partial\theta_n} \end{array} \right)]

If we want to check \(\frac{\partial J(\Theta)}{\theta_j}\) in this vector, we need to define a helper vector \(\vec{e}_j=[0,0,...,1,...,0]^T\), then calculate the \(\Theta^{(j\pm)} = \Theta \pm \epsilon \vec{e}_j\). By using the numerical approximation formula introduced in the above, we get:

[\frac{\partial J(\Theta)}{\theta_j}\approx \frac{J(\Theta^{(j+)})-J(\Theta^{(j-)})}{2\epsilon}]

Gradient Checker Code

Now let’s review the code that the tutorial gives to check the gradients.

function average_error = grad_check(fun, theta0, num_checks, varargin)

  delta=1e-3;
  sum_error=0;

  fprintf(' Iter       i             err');
  fprintf('           g_est               g               f\n')

  for i=1:num_checks
    T = theta0;
    j = randsample(numel(T),1);
    T0=T; T0(j) = T0(j)-delta;
    T1=T; T1(j) = T1(j)+delta;

    [f,g] = fun(T, varargin{:});
    f0 = fun(T0, varargin{:});
    f1 = fun(T1, varargin{:});

    g_est = (f1-f0) / (2*delta);
    error = abs(g(j) - g_est);

    fprintf('% 5d  % 6d % 15g % 15f % 15f % 15f\n', ...
            i,j,error,g(j),g_est,f);

    sum_error = sum_error + error;
  end

  average_error=sum_error/num_checks;

This implementation randomly choose the gradients to check and average the error between the approximated ones and the ones calculated by using gradient descent.

The number of gradients to check is specified in the parameter num_checks, and we can see that a piece of code j = randsample(numel(T),1) is used to randomly generate the index of the gradients array. By doing this the run time of the gradient check function can be control easily. The program will run significantly slower if all the gradients are check in every iterations.

Refresh on Trig Identities, Integrals and Substitutions

|

Link to the lecture: Lec 27 MIT 18.01 Single Variable Calculus, Fall 2007

Trig Identities

Suppose we have a circle with radius 1 centered at the origin \((0,0)\). If we connect a point on the circle and the origin and form an angle \(\theta\) with the \(x\) axis, we have \(\sin^2(\theta)+\cos^2(\theta)=1\).

So then we have the double and half angle formulas:

[\begin{array}{ccl} \cos(2\theta) &= &\cos^2(\theta)-\sin^2(\theta)
&=&\cos^2(\theta)-(1-\cos^2(\theta))
&=&2\cos^2(\theta)-1
\cos^2(\theta) &=& \frac{1+\cos(2\theta)}{2}
\sin^2(\theta) &=& \frac{1-\cos(2\theta)}{2}
\sin(2\theta)&=&2\sin(\theta)cos(\theta) \end{array}]

Trig Integrals

\(\int \sin^n(x)cos^m(x)dx\), \(m,n=0,1,2,...,\)

case 1: \(m=1\)

[\int \sin^n(x)cos(x)dx]

substitute \(u=sin(x)\), \(du = cos(x)dx\)

[\begin{array}{ccl} \int \sin^n(x)cos(x)dx &= &\int u^ndu
&=&\frac{u^{n+1}}{n+1}+C
&=&\frac{\sin^{n+1}(x)}{n+1}+C \end{array}]

case 2: \(m=2, n=3\)

[\int \sin^3(x)cos^2(x)dx]

use \(\sin^2(x)=1-cos^2(x)\), substitute \(u=\cos(x)\), \(du=-\sin(x)dx\)

[\begin{array}{ccl} \int \sin^3(x)cos^2(x)dx &= &\int (1-\cos^2(x))\sin(x)\cos^2(x)dx
&=&\int (\cos^2(x)-\cos^4(x))\sin(x)dx
&=&\int (u^2-u^4)(-du)dx
&=&-\frac{u^3}{3}+\frac{u^5}{5}+C
&=&-\frac{\cos^3(x)}{3}+\frac{\cos^5(x)}{5}+C \end{array}]

case 3: \(m=0, n=3\)

[\int \sin^3(x)dx]

use \(\sin^2(x)=1-cos^2(x)\), substitute \(u=\cos(x)\), \(du=-\sin(x)dx\)

[\begin{array}{ccl} \int \sin^3(x)dx &= &\int (1-\cos^2(x))\sin(x)dx
&=&\int (1-\cos^2(x))\sin(x)dx
&=&\int (1-u^2)(-du)dx
&=&-u+\frac{u^3}{3}+C
&=&-\cos(x)+\frac{\cos^3(x)}{3}+C \end{array}]

rule1: If we have some odd power to play with, we can use the double angle formula and substitution method to play around and find the integration.

case 4: \(m=2,n=0\)

[\int \cos^2(x)dx]

use \(\cos^2(x)=\frac{1+cos(2x)}{2}\),

[\begin{array}{ccl} \int \cos^2(x)dx &= &\int \frac{1+cos(2x)}{2}dx
&=&\frac{x}{2}+\frac{\sin(2x)}{4}+C \end{array}]

case 5: \(m=2,n=2\)

[\int \cos^2(x)\sin^2(x)dx]

use \(\cos^2(x)=\frac{1+cos(2x)}{2},\sin^2(x)=\frac{1-cos(2x)}{2}\),

[\begin{array}{ccl} \int \cos^2(x)\sin^2(x)dx &= &\int \frac{1+cos(2x)}{2}\cdot\frac{1-cos(2x)}{2}dx
&= &\int \frac{1-cos^2(2x)}{4}dx
&= &\int \frac{1}{4}-\frac{1+cos(4x)}{8}dx
&=& = \frac{x}{8}-\frac{\sin(4x)}{32} + C \end{array}]

Or we can use \(\sin^2(x)\cos^2(x) = (\sin(x)\cos(x))^2 = \frac{sin^2(2x)}{4}\)

rule 2: If we have even exponents, use the half angle identities to reduce the power of the trig functions.

Introduction to the polar coordinates

Any points \((x,y)\) on the circle \(x^2+y^2=a^2\) can be represented in the polar coordinates \((a\cos(\theta),a\sin(\theta))\). Sometimes it is useful to calculate the integrals of some function in the form \(f(y)=\sqrt{a^2-y^2}\). For example, we can substitute \(y=a\sin(\theta), dy=a\cos(\theta)d\theta\):

[\begin{array}{ccl} \int \sqrt{a^2-y^2}dy &= &\int \sqrt{a^2-a\sin^2(\theta)}\cdot a\cos(\theta)d\theta
&= &\int a^2\cos^2(\theta)d\theta
&=& a^2(\frac{\theta}{2}+\frac{\sin(\theta)\cos(\theta)}{2}) + C
&=& \frac{a^2\arcsin(\frac{y}{a})}{2} + \frac{(a\sin(\theta))(a\cos(\theta))}{2} + C
&=& \frac{a^2\arcsin(\frac{y}{a})}{2} + \frac{y\sqrt{a^2-y^2}}{2} + C
\end{array}]

It becomes something easier to solve.

Other Trig Identities and Integration

[\begin{array}{cl}\sec(x)=\frac{1}{\cos(x)}&\csc=\frac{1}{\sin(x)}
\tan(x)=\frac{\sin(x)}{\cos(x)}&\cot(x) = \frac{\cos(x)}{\sin(x)} \end{array}]

[\sec^2(x) = 1+\tan^2(x)]

[\begin{array}{cl}\frac{d}{dx}\tan(x)=\sec^2(x)&\frac{d}{dx}\sec(x)=\sec(x)\tan(x)
\int\tan(x)dx=-\ln\vert\cos(x)\vert+c&\int\sec(x)dx = \ln(\sec(x)+\tan(x))+c \end{array}]

Example 1:

using \(u=tan(x),du=\sec^2(x)dx\)

[\begin{array}{ccl} \int\sec^4(x)dx&=&\int(1+\tan^2(x))\sec^2(x)dx
&=&\int(1+u^2)du
&=&u+\frac{u^3}{3}+c
&=&\tan(x)+\frac{\tan^3(x)}{3}+c \end{array}]

Example 2:

using \(x=tan(\theta),dx=\sec^2(\theta)d\theta,u=\sin(\theta),du=\cos(\theta)d\theta\)

[\begin{array}{ccl} \int\frac{dx}{x^2\sqrt{1+x^2}}&=&\int\frac{\sec^2(\theta)d\theta}{\tan^2(x)\sec(\theta)}
&=&\int\frac{\cos{\theta}}{\sin^2(\theta)}d\theta
&=&\int\frac{du}{u^2}
&=&-\frac{1}{u}+c
&=&-\frac{1}{\sin(\theta)}+c
&=&-\csc(\theta)+c
&=&-\csc(\arctan(x))+c \end{array}]

rule 3, substitute the following square roots using trigs:

[\begin{array}{ccl}\sqrt{a^2-x^2}&\to&x=a\sin(\theta)/a\cos(\theta)
\sqrt{a^2+x^2}&\to&x=a\tan(\theta)
\sqrt{x^2-a^2}&\to&x=a\sec(\theta) \end{array}]

These substitutions can be derived using a right triangle by assigning the 1 and x to the hpp, opp and adj.