Monte Carlo Approximation Methods: Which one should you choose and when? | by Suyang Li | Aug, 2023


We will generate 5000 samples using the Inverse CDF method:

import numpy as np
import matplotlib.pyplot as plt

# Inverse CDF for the exponential distribution with lambda = 1
def inverse_cdf(p):
return -np.log(1 - p)

# Generate 1000 sample values using inverse CDF
uniform_samples = np.random.uniform(0, 1, 1000)
transformed_samples = inverse_cdf(uniform_samples)

# Generate x values and true PDF
x_values = np.linspace(0, np.max(transformed_samples), 1000)
pdf_values = np.exp(-x_values)

Requirements for the inverse CDF algorithm to work

The inverse CDF method makes a key assumption:

  • CDF F is invertible: The CDF F must be invertible, meaning that each input to F must have a unique output

This constraint rules out a number of functions. For example, below are a few function types that are common but non-invertible (and thus would not work with inverse CDF):

  1. Constant functions: Any constant function in the form of f(x) = c where c is a constant is not invertible since every input maps to the same output and thus the function is not one-to-one.
The red dots show two of the many x values that map to the same y value in f(x) = 5, making it non-invertible

2. Certain quadratic functions: For example f(x) = x^2 is non-invertible since it’s many-to-one (consider f(x) = 1, x could be 1 or -1).

The red dots show one of the many pairs of x values that map to the same y value in f(x) = x²

3. Certain trigonometric functions: For example f(x) = sin(x) is not invertible over their entire domain since they are periodic, although with restricted domains they may be made invertible.

The red dots show one of the many sets of x values that map to the same y value in f(x) = sin(x) due to its periodicity over the given domain

Why does inverse CDF work?

The key idea is that a random variable uniformly distributed between 0 and 1 can be transformed into a random variable with a certain distribution by applying the inverse of the target distribution’s CDF, which is assumed to be known and tractable.

Advantages

  1. Algorithmic simplicity: it’s very easy to implement with lower-dimensional data, thus having a wide application area across different fields and tasks.
  2. Sample accuracy: assuming the CDF and its inversion represents the exact target distribution, the method yields a relatively high exactness compared to other methods such as MCMC which we will see soon.

Disadvantages

  1. Computational complexity: For some distributions, the inverse CDF may not have a closed-form expression, making computation challenging or expensive.
  2. Difficulty with high dimensionality: It can be difficult to apply in high-dimensional spaces, especially with dependencies between variables.
  3. Invertibility constraint: Anytime a CDF is non-invertible, this method becomes invalid. This excludes a number of functions as we saw above.
  4. Limited to known distributions: Inverse CDF requires the exact form of the CDF, limiting its application to known distributions only.

Taking all these limitations into account, there are only a few categories of distributions that we can apply inverse CDF to. In reality, with Big Data and unknown distributions, this method can quickly become unavailable.

With these advantages and disadvantages in mind, let’s now look at another random sampling framework that tackles these limitations: Markov Chain Monte Carlo (MCMC).

As we saw just now, the inverse CDF transformation method is highly limited, especially with high dimensional sample spaces. Markov Chain Monte Carlo (MCMC), on the other hand, scales well with dimensionality, enabling us to sample from a much larger family of distributions.

An example of Metropolis-Hastings exploring a mixed Gaussian (left), generating samples (right)

How does the Metropolis-Hastings algorithm work?

In intuitive terms, the algorithm works in the following steps: Similar to inverse CDF, we have a target distribution that we’re sampling from. However, we need an additional ingredient: the current state z*, and q(z|z*) depends on z*, creating a Markov chain with samples z¹, z², z³,. Each sample is accepted into the chain only if it satisfies a certain criterion, which we will define below since this criterion differs across variations of the algorithm.

Let’s formalize it into a more algorithmic structure.

The algorithm runs in cycles, and each cycle follows these steps:

  1. Generate a sample z* from the proposal distribution.
  2. Accept the sample with probability Then we will accept this value with an acceptance probability, which in Metropolis-Hastings is defined as:
PRML¹ Eq 11.44

where

  • z* is the current state.
  • z^T is the proposed new state.
  • p(z*) is the probability of state z* according to the desired distribution.
  • p(z^T) is the probability of state z^T according to the desired distribution.

The logic behind this acceptance threshold is that it ensures that the more probable states (according to the desired distribution) are visited more often.

Now, this is the most generalized version of the algorithm; if the proposal distribution is known to be symmetric, meaning that the probability of proposing a move from state x to x′ is the same as proposing a move from x′ to x, i.e. q(x′|x) = q(x|x′), then we can use a special case of Metropolis-Hastings that requires a simpler acceptance threshold.

Metropolis algorithm for Symmetric Distributions

This is a specific MCMC algorithm that we choose to use if the proposal distribution is symmetric, i.e. q(z⁰ | z¹) = q(z¹ | z⁰) for all values of 1 and 0, interpreted as “the probability of transitioning from any state A to state B is equal to the probability of transitioning from B to A”. So, each step of the algorithm becomes:

  1. Generate a sample z* from the proposal distribution.
  2. Accept the sample with probability:
Metropolis algorithm acceptance threshold. Src: PRML¹ Eq. 11.33

Metropolis-Hastings and Metropolis Algorithms

Let’s look at the algorithms side by side. As we saw before, the only difference is the acceptance threshold; all other steps of the algorithms run identically:

Metropolis vs Metropolis-Hastings Algorithm

Advantages

  1. Convergence to Equilibrium Distribution: In certain cases, the random walk can converge to a desired equilibrium distribution although it likely takes a long time in high-dimensional spaces.
  2. Low Computational Cost: Random walk often requires fewer computational resources compared to other complex sampling methods, making it suitable for problems where computational efficiency is a priority.
  3. Versatility of application: Due to its high similarity to naturally occurring patterns, Random Walk has applications in a wide variety of fields:
    • Physics: Brownian motion of molecules in liquids and gases.
    • Network Analysis
    • Financial Markets: to model stock price movements
    • Population Genetics

Disadvantages:

  1. Sensitive to initialization: The algorithm’s performance can be sensitive to the choice of the starting values, especially if the initialized values are far from the high-density areas.
  2. Locality traps: Depending on the complexity of the proposal distribution, the algorithm could get stuck in local optima and have difficulty traversing to other areas along the distribution.

Now, keeping the Metropolis-Hastings algorithm in mind, let’s look at another special instance of it: Gibbs Sampling.

Gibbs Sampling is a special instance of Metropolis-Hastings where each step is always accepted. Let’s first look at the Gibbs sampling algorithm itself.

How does the Gibbs algorithm work?

The idea is relatively simple and is best illustrated by first zooming in on a micro example involving sampling from a distribution p(z1, z2, z3) over 3 variables. The algorithm would run in the following steps:

  1. At timestep T, initialize the starting values to:
PRML¹

2. Draw the new value for z1:

PRML¹ Eq 11.46

3. Draw a new value for the second position, z2 from the conditional:

PRML¹ Eq 11.47

4. Finally draw a new value for the last position, z3:

PRML¹ Eq 11.48

5. Repeat this process, cycling through the three variables z1…z3 until it reaches a certain satisfactory threshold.

Generalized algorithm

Formally, the algorithm is represented by first initializing the starting positions, and then taking T consecutive steps

Image source: PRML¹ Ch11.3 Gibbs Sampling

Conditions for Gibbs to sample from a target distribution correctly

  1. Invariance. The target distribution p(z) is invariant of each Gibbs step, and therefore p(z) is invariant of the entire Markov chain.
  2. Ergodicity. If the conditional distributions are all non-zero, then ergodicity is implied since any point in z space is reachable in a finite number of steps.
  3. Sufficient burn-in. As we saw with any method that requires random initialization, the first few samples are dependent on the initialization, and the dependence weakens after many iterations.

How does this relate to Metropolis-Hastings?

In Metropolis-Hastings, we defined the acceptance threshold to be:

Thus, the Metropolis-Hastings proposal steps are always accepted, as we saw in the Gibbs algorithm.

Variations

Since the Gibbs method updates one variable at a time, there are strong dependencies between consecutive samples. To overcome this, we could use an intermediate strategy to sample from groups of variables instead of individual variables, known as blocking Gibbs.

Similarly, by the nature of Markov chains, the examples drawn successively will be correlated. To generate independent samples, we could use sub-sampling within the sequence.

Now that we’ve walked through how each algorithm works and its application areas, let’s summarize the defining characteristic of each method.

1. Inverse Transformation Sampling

  • Data Size: Best for moderate-sized datasets.
  • Time: Generally efficient for univariate distributions.
  • Data Complexity: Use for simple distributions where the cumulative distribution function (CDF) and its inverse are known and easy to compute.
  • Consider avoiding if: Sampling high-dimensional variables/distributions.
  • Biggest Advantage: High accuracy if the CDF accurately reflects the target distribution.
  • Requirement: The CDF must be known and invertible.

2. Metropolis-Hastings (MCMC)

  • Data Size: Scalable and suitable for large datasets.
  • Time: Can be computationally intensive, depending on the complexity of the target distribution.
  • Data Complexity: Ideal for complex or multi-modal distributions.
  • Biggest Advantages:
    – Can sample from a distribution without knowing its normalization constant (the full form)
    – Great for exploring the global structure of a distribution and guarantees convergence
  • Disadvantage: May suffer from very slow convergence with
    – complex or multimodal target distribution, since the algorithm may be stuck in local modes and have difficulty transitioning between them;
    – the variables are highly correlated;
    – high dimensional spaces;
    – poor initial values or step size choices

3. Gibbs Sampling

  • Data Size: Suitable for both small and large datasets.
  • Time: Often more efficient than Random Walk Metropolis-Hastings since it doesn’t require acceptance/rejection steps.
  • Data Complexity: Best used when dealing with high-dimensional distributions where you can sample from the conditional distribution of each variable.
  • Biggest Advantages:
    – Can easily compute conditional distributions;
    – Less prone to local minima traps compared to Random Walk.
  • Requirements:
    – Markov chain ergodicity
    – The full conditional distributions must be known and tractable

In summary:



Source link

Leave a Comment