type
Post
Created date
Nov 15, 2021 02:40 AM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
Frequentist vs Bayesian Definitions of probability (vidhya)

What

What is Bayes theorm

  • A formula for combining prior beliefs with observed evidence to obtain a "posterior" distribution (Metaa)
    • It is central to Bayesian statistics, where one infers a posterior over the parameters of a statistical model given the observed data.

Why

Why do we need Bayes's theorem?

  • To update the probability of a hypothesis, , in light of some body of data. (Tb p.23)
  • It is diachronic — something is happening over time; in this case the probability of the hypotheses changes, over time, as we see new data. (Tb p.23)

Why do we need Posterior probability?

  • In finance, Bayes' theorem can be used to update a previous belief once new information is obtained.
    • Prior probability represents what is originally believed before new evidence is introduced, and posterior probability takes this new information into account. (Investopedia)

How

  • We use the product of prior and likelihood to arrive at a posterior via P(w | x) ∝ P(x | w)P(w). H (Dive into Deep Learning )
There are a few components constitute the formula. The photos below delivers the same meaning in different wording :

notion image
notion image
notion image
notion image
notion image

but in terms of the diachronic interpretation: (TB)
Posterior : The probability of the hypothesis after we see the data
  • Something we want to compute
Prior : The probability of the hypothesis before we see the data
  • Aka.
  • (It is subjective.) Sometimes can be computed but often time cannot. Because reasonable people use different background information or because they interpret the same information differently.
    • That's why people called it Prior — like holding some useful knowledge in prior.
Likelihood : The probability of the data under the hypothesis
  • Easiest part to compute
The normalizing constant : The probability of the data under any hypothesis

Example : Cancer prediction (vidhya)

Scenario

The patients were tested thrice (three-times) before the oncologist concluded that they had cancer. The general belief is that 1.48 out of a 1000 people have breast cancer in the US at that particular time when this test was conducted. The patients were tested over multiple tests. Three sets of test were done and the patient was only diagnosed with cancer if she tested positive in all three of them.

Identify the components

Posterior : The probability of having cancer given that he tested positive on the first test — (Something we want to compute)
Prior : 0.00148 — (The general belief is that 1.48 out of a 1000 people having cancer — something we know before we observe the data — our prior knowledge.)
Likelihood : 0.93 — The probability of people having cancer given that they are tested positive
The normalizing constant : 0.011332 — The prob of people tested +ve, regardless they have cancer or not

Let’s examine the test in detail :
  • Sensitivity of the test (93%) – true positive Rate
  • Specificity of the test (99%) – true negative Rate
notion image

Q1. The probability of having cancer given that he tested positive on the first test

So, let's hit up with a conditional prob first. We want to calculate P (cancer|+)
.0013468 / [ (.0013468) + (.99852*.01) = .11885
.0013468 / [ (.0013468) + (.99852*.01) = .11885
To calculate the probability of testing positive, the person can have cancer and test positive or he may not have cancer and still test positive.
To calculate the probability of testing positive, the person can have cancer and test positive or he may not have cancer and still test positive.

Q2. The probability of having cancer given the patient tested positive in the second test ( as we see the data, update the Baye's rules)

Now remember we will only do the second test if she tested positive in the first one. Therefore now the person is no longer a randomly sampled person but a specific case. We know something about her.

Hence
  • (changed) Hence, the prior probabilities should change. We update the prior probability with the posterior from the previous test.
  • (unchanged) Nothing would change in the sensitivity and specificity of the test since we’re doing the same test again. Look at the probability tree below.
The tree depicts the left description
The tree depicts the left description

So, let’s calculate again the probability of having cancer given she tested positive in the second test.
notion image
notion image

Example 2 : Sci-fic

(from here)
Chinese Version (Here)

What is prior posterior conflict? (34:05 in ETC2420 lecture 11)

Bayesian models are predicated on your choice of prior. Our data is updating our prior distribution to get a posterior distribution.
  • If you have set your prior particularly badly, you can end up with like really bad values, which can end up with problems.
Your prior information (i.e. previously thought of as reasonable values for the parameters ) doesn't contain any values of the parameters that are reasonable for producing the actual data that you see.
This phenomena is called a prior posterior conflict.

(How) What is the way to know if the prior is good

Method 1 : overlapping pr not?
By looking at whether or not your prior and posterior are overlapping, you can see if your prior is reasonable.
  • If it does, this is a bad thing. That indicates that your prior reasoning is wrong; or it could mean that your data is wrong.
  • Wrong is in a sense that data can be corrupted in a variety of ways like the data entry staff input the data with error.
The right one is called cauchy distribution
The right one is called cauchy distribution
Method 2 : Posterior predictive checking (45:00 in ETC2420 lecture 11)
  • Is also a graphical model evaluation or so called visual inspection.
  • You are using the samples to conduct graphical checks to see whether the predictive distributions (e.g. bayesian models) fits to the observed data.
Check to see whether big features of the data are appropriately captured.
Check to see whether big features of the data are appropriately captured.
  • Informal inference, meaning you cannot do a hypo testing; however, you can understand that what is good or bad about your model.

Another example is from here p.31.
  • We do that for many, many credible parameter values to create representative distributions of what data would look like according to the model.
  • The predicted weight values are summarized by vertical bars that show the range of the 95% most credible predicted weight values. The dot at the middle of each bar shows the mean of the predicted weight values.
  • By visual inspection of the graph, we can see that the actual data appear to be well described by the predicted data. The actual data do not appear to deviate systematically from the trend or band predicted from the model.
notion image

  • If the actual data did appear to deviate systematically from the predicted form, then we could contemplate alternative descriptive models.
    • For example, the actual data might appear to have a nonlinear trend. In that case, we could expand the model to include nonlinear trends. It is straightforward to do this in Bayesian software, and easy to estimate the parameters that describe nonlinear trends.
    • We could also examine the distributional properties of the data. For example, if the data appear to have outliers relative to what is predicted by a normal distribution, we could change the model to use a heavy-tailed distribution, which again is straightforward in Bayesian software.

Difference between OLS model and Bayesian Regression

here p.31
here p.31
Instead of showing only point estimate, we can draw a range of lines, with each one representing a different estimate of the model parameters.
As the number of datapoints increases, the lines begin to overlap because there is less uncertainty in the model parameters.
 

Reference

Supplementary Questions

 
Base Rate Fallacy 基本比率Bias-Variance tradeoff (BV)