type
Post
Created date
Aug 18, 2021 11:43 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
Table of content

Week 1 CDF


How to draw random number with distribution

 
Poisson :
rpois(n = 5, lambda = 1)
T-dist :
rt(20, df = 10)
ND :
rnorm(1, mean = 3, sd = 2)
Note : the mean of these draws is not exactly the same as declared. So, you cannot guarantee that the mean is exactly 3.

Simulation

Q1 :
Supposed that we want to explore how variable the sample mean of 100 normally distributed samples is. So, first, we Simulate 100 variables. Then Compute the sample mean.
Run the experiment 100,000 times and make a histogram of the sampling distribution of the sample mean.

# 1. Run the experiment 100,000 times N <- 100000 draws_combined <- tibble( experiment = rep(1:N, each = 100), draw = rnorm(N * 100, 3, 1) ) # 2. sampling distribution of the sample mean. sampling_distribution <- draws_combined %>% group_by(experiment) %>% summarise(sample_mean = mean(draw))
Step 1, 2
# make a histogram sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(fill = "purple", bins = 30) + theme_bw() + xlab("Sample mean") # The version of making a bell curve. sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(aes(y = after_stat(density)), fill = "purple", bins = 50) + stat_function(fun = function(x) dnorm(x, 3,sqrt(1/100)), colour = "red", linetype = "dashed", size = 1.2) + theme_bw() + xlab("Sample mean")
Step 3 : Histogram

Q1.1
For the same experiment, plot the sampling distribution of the largest observation in the sample.
sampling_distribution <- draws_combined %>% group_by(experiment) %>% summarise(sample_mean = max(draw)) #take the largest observation
#make a histogram
sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(fill = "purple", bins = 30) + theme_bw() + xlab("Sample standard deviation")

CDF & eCDF - Visualise more complex uncertainty

Q1 . Test if the data is following a ND, using comparison with CDF
tibble(sample = data) %>% ggplot(aes(x = sample)) + stat_ecdf() + stat_function(fun = pnorm, colour = "blue", linetype = "dashed")
The empirical CDF is not close enough to the theoretical one to say that it's from the same distribution.
notion image

Q2. Now, working the same thing for 100 times.
#Data
set.seed(301) tibble(draw = rt(20,df= 10)) %>% ggplot(aes(x = draw)) + stat_ecdf() + stat_function(fun = pnorm, colour = "blue", linetype = "dashed")
#Visualisation
ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_ecdf(data = tibble(real_data = data), aes(x = real_data)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x")
Conclusion : What we have shown is that we cannot distinguish between the sample and 20 draws from a standard ND.
Conclusion : What we have shown is that we cannot distinguish between the sample and 20 draws from a standard ND.
Notes on visualisation code
  • Making the empirical CDFs a lighter colour is useful to make them less prominent.
  • The alpha parameter makes the lines semi-transparent, helping reduce their intensity.
    • This parameter is difficult to set and you should try a bunch of different values. (alpha = 1 is the default value, and alpha = 0 is fully transparent.)

Q3 . The difference when the experiment has 2000 observations instead of 20
#data
N <- 100 data <- rt(2000, df = 10) experiments <- tibble(experiment = rep(1:N, each = 2000), draw = rnorm(N * 2000, 0, 1))
#Visualisation
ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_ecdf(data = tibble(real_data = data), aes(x = real_data)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x")
notion image

Q4. The difference between 20 samples from a T-distribution with 5 degrees of freedom and a N(0,1) distribution
#data
N <- 100
data <- rt(20, df = 5) experiments <- tibble(experiment = rep(1:N, each = 20), draw = rnorm(N * 20, 0, 1))
#Visualisation
ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_ecdf(data = tibble(real_data = data), aes(x = real_data)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x")
notion image

Week 2 Hypo testing

Wilcoxon test - nonparametric


If we do not assume anything about the underlying distribution
( that is we do not know if it is a ND ), test if we could reject the null hypo.
null hypo :
1. the diff of marks is a ND;
2. quiz1 and quiz2 shares the same dist ;
3. diff = 0


It turns out p (≤ 0.05), so There is strong evidence that the null hypothesis can be rejected.
That is, quiz1 and quiz2 does not share the same dist.

Shapiro Wilks test - Test if the population is ND


So, if the result of shapiro.test ≤ 0.5 , we assume it is ND and can use T-test(), else Wilcoxon.Test()

Q1 . Overall test.

Although her full time work is focused on strapping ice skates to tiny kittens, she is pretty good at statistics and immediately sees a problem:
she doesn’t know what type of null distributions to check against. So she decides to check 3 different distributions that all have the same mean.
  1. Both test scores come independently from N(50, 252) distributions
  1. Both test scores come independently from Uniform(10, 90) distributions
  1. Both test scores come independently from Exp(1/50) distributions
Her reasoning is that the first one is the best case for the tests, the second corresponds to students who are at all levels, and the third corresponds to more students getting low marks than high marks.
Q1.1 : Make a new function hypothesis_test() that implements Ambrose's proposal.
hypothesis_test <- function(x) { shapiro <- shapiro.test(x) is_normal <- (shapiro$p.value > 0.05) if(is_normal) { test <- t.test(x) } else { test <- wilcox.test(x) } return(test$p.value < 0.05) }

Q1.2 Assess the type 1 error for Ambrose's procedure for each of Eunice's null distributions and class sizes of $n=10, 100, 400$.
# make a method called type1_prob type1_prob <- function(n, n_simulations = 10000){ experiments <- tibble( experiment = rep(1:n_simulations, each = n), quiz1 = rnorm(n * n_simulations, mean = 50, sd = 25), quiz2 = rnorm(n * n_simulations, mean = 50, sd = 25), difference = quiz2 - quiz1 ) tests <- experiments %>% group_by(experiment) %>% summarise(reject = hypothesis_test(difference)) return(mean(tests$reject)) } #Check the null hypo of the first one : # Both test scores come independently from N(50, 252) distributions tibble(n = c(10, 100, 400), map_dbl(n, type1_prob))
Test the 2nd and 3rd
unif_quiz <- function(n) { return(runif(n, 10, 90)) } exp_quiz <- function(n) { return(rexp(n, 1 / 50)) } type1_prob <- function(n, rand_quiz, n_simulations = 10000){ experiments <- tibble( experiment = rep(1:n_simulations, each = n), quiz1 = rand_quiz(n * n_simulations), quiz2 = rand_quiz(n * n_simulations), difference = quiz2 - quiz1 ) tests <- experiments %>% group_by(experiment) %>% summarise(reject = hypothesis_test(difference)) return(mean(tests$reject)) } #Check the null hypo of the second one : tibble(n = c(10, 100, 400), map_dbl(n, ~type1_prob(.x, unif_quiz))) #Check the null hypo of the third one : tibble(n = c(10, 100, 400), map_dbl(n, ~type1_prob(.x, exp_quiz)))

 

Week 3

Violin plot

plot_data %>% ggplot(aes(x = score, y = test)) + geom_violin()
Violin p = boxplot + histogram
notion image
 

Q1 : Are there any apparent similarities or differences between the two intervals or the appropriateness of the methods (T.test and Bootstrap) for this application?
  1. plot a histogram of the diff
    1. cbt %>% ggplot(aes(x = diff)) + geom_histogram(aes(y = after_stat(density))) + geom_density(colour = "blue", linetype = "dashed")
notion image

CLT confidence interval

For the t.test
  1. Compute the traditional 95% confidence interval for $\mu$ assuming it is (approximately) normal^[This is often called the "CLT confidence interval" because by the central limit theorem, the sampling distribution for $\mu$ is approximately Gaussian (aka normal).
    1. clt_ci <- t.test(cbt$diff)$conf.int
For the Bootstrap
Generate $5000$ Bootstrap samples.
  1. Do the bootstrap
    1. bootstrap <- tibble(experiment = rep(1:5000, each = 60), index = sample(1:60, size = 60*5000, replace = TRUE), ystar = cbt$diff[index])
  1. Compute and report the 95% Bootstrap confidence interval for $\mu$.
    1. y_bar <- mean(cbt$diff)
      bias <- bootstrap %>% group_by(experiment) %>% summarise(delta_star = y_bar - mean(ystar))
      boot_ci <- y_bar + quantile(bias$delta_star, c(0.025, 0.975))
All in all,
Plot the density estimate (overlayed histogram and kernel density plot) corresponding to the empirical Bootstrap distribution (that is the distribution of y_bar + delta_star) and the position of the 2.5% and 97.5% quantiles of this distribution. In addition, indicate on the graph each of the following items:
  • the position of the sample mean shown by a solid vertical line,
  • the lower 2.5% and 97.5% quantiles of the empirical Bootstrap distribution as vertical dashed lines,
  • the lower and upper end points of the CLT-based confidence interval (from part i) as vertical dotted lines,
  • an appropriate title, and
  • appropriately modified axis labels.

colours <- ggthemes::colorblind_pal()(4) bias %>% ggplot(aes(x = delta_star + y_bar)) + geom_histogram(aes(y = after_stat(density)), fill = colours[1], alpha = 0.6) + geom_density(colour = "blue", linetype = "dashed") + geom_vline(xintercept = y_bar, colour = colours[2]) + geom_vline(xintercept = boot_ci[1], linetype = "solid", colour = "Red") + geom_vline(xintercept = boot_ci[2], linetype = "solid", colour = "Red") + geom_vline(xintercept = clt_ci[1], linetype = "dotdash", colour = "Green") + geom_vline(xintercept = clt_ci[2], linetype = "dotdash", colour = "Green") + ggtitle("The estimated distribution of the difference") + xlab("Average difference between scores")
notion image

Q2 .
Compute the bootstrap confidence interval for the population skewness for
  • A Gaussian distribution, which has skewness zero (regardless of the mean and variance parameters)
  • An Exponential distribution, which has skewness 2 (regardless of the rate parameter)
  1. Write a function that computes the coverage for the bootstrap confidence intervals. Ideally it should let you specify:
      • The size of the sample
      • The distribution the sample is drawn from
      • The the true value of the skewness
      • The number of bootstrap simulations
Note 1 : For both distributions, compute the confidence intervals for $n=10$ and $n= 100$ and comment on the coverage. (Hint: Expect it to be ok for one of them and bad for the other)
  1. Make a function of skewness
skewness <- function(y) { numerator <- mean((y - mean(y))^3) denom <- (mean((y - mean(y))^2))^(3/2) return(numerator / denom) }
  1. Is it in the confident interval?
is_in_ci <- function(n, simulate = function(n) rexp(n,1), skew_true = 2, n_sim = 1000) { y <- simulate(n) bs <- tibble(experiment = rep(1:n_sim, each = n), ind = sample(1:n, n*n_sim, replace = TRUE), ystar = y[ind]) bias <- bs %>% group_by(experiment) %>% summarise(delta = skewness(y) - skewness(ystar)) int <- skewness(y) + quantile(bias$delta, c(0.025, 0.975)) return((skew_true > int[1]) & (skew_true < int[2])) }
Let's just do this for n = 10 and the exponential.
is_in = map_dbl(1:1000, ~is_in_ci(10))
mean(is_in)
  • I get about 0.56 - this is bad!
    • It turns out that this measure of skewness can be quite biased when the distribution is skewed, which doesn't actually make it very useful! This combined with the small sample size makes the bootstrap fail.
  • The gaussian case is a bit better, with 88% coverage for n=10 and
    • 92% coverage for n = 100. But it's still short of the true coverage!
  • It turns out that skewness is moderately difficult to compute!
 

Week 5

Code
--- title: "Statistical Thinking Tutorial" author: "Week 6" date: "29/08/2021" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## What's on deck for this week This week we are going to look at censored data. We will - A quick check in with QQ-plots - Learn how to fit survival data with the `survival` package - A closer look at the Kaplan-Meier curve - A better look at medians - Fitting parametric survival models # An amuse-bouche a. Simulate samples of size $n=$ 30, 100, 500 from these distributions i. Lognormal(4, 2) ii. Gamma(3, 3) b. Make a QQ-plot of each these samples comparing to . Explain how closely the samples, of different sizes, appears to match the theoretical distribution. c. By drawing 100 samples of size 30 from the Gamma(3,3), visualise how variable the QQ-plot can be. ## Dealing with right-censored data in R Right censored data is a common type of "time to event" data where the event is only observed for some of the observations. An example would be when the event is a death due to acute myelogenous leukemia. Some patients in the sample will die before the end of the study, while some (hopefully most!) will still have not died when the study ends. In class, we talked about how important^[We talked about the statistical reasons for doing this, but when dealing with serious data about death, we are bound both morally and ethically to use our data as best we can.] it is to correctly account for both the events that we have observed and the ones that will occur, unobserved, in the future. This data is so common that there is a packing in R that deals with it _that you you don't even have to install_. It's part of the basic R distribution. This package is, creatively, names `survival`^[Survival can deal with more complex censoring than just right censoring. It does a bunch of stuff that we are not going to look at in this class. It's a real industrial-strength thing.]. In order to deal with survival data, we need to keep track of two things: 1. The observed time 2. An indicator of whether this was an event or if we just stopped measuring (a censoring). You can check the help for other options, but this is usally coded as a `1` if it's an event and a `0` if it is not. We can see this data structure in the `aml` dataset (the aforementioned leukemia) from the `survival` package. ```{r, echo = TRUE, eval = TRUE, message = FALSE} library(survival) library(tidyverse) head(aml) dim(aml) ``` This data contains an extra factor (`x`) that indicates what type of chemotherapy regime was used. We are going to use that to filter the data into subpopulations. One of the key functions in the survival package is `Surv`, which lets us put the time and event information together. ```{r, echo = TRUE, eval = TRUE} Surv(aml$time, aml$status) %>% head() ## If you don't understand it, you can view on the next code view(cbind(aml, Surv(aml$time, aml$status))) ``` Notice that this separates the times into numbers (9, 13) and cenored times (13+), where the + indicates that the event occurs at some point after 13. ## Making a survival curve To fit survival data, we use the `survival::survfit()` function. ```{r, echo = TRUE, eval = TRUE} fit <- survfit(Surv(time, status) ~ 1, data = aml) ``` Let's break that down: - The `Surv(time, status)` term is there to link the time to event with the event indicator - The tilde (`~`) is a special character in R that we will see more of when we start moving into regression, but it exists to separate the data (the left of `~`) from the "predictors" on the right. For today, we are just using `~ 1`, which says "there are no predictors". - The `data = aml` tells the function which data to fit the survival curve on (and where to look for the columns `time` and `status`) The fit object contains a bunch of stuff, but the easiest thing we can do is look at what it prints out as a default ```{r, echo = TRUE, eval = TRUE} fit ``` This gives the number of observations, the number of observed events and an estimate with 95% confidence interval for the median survival time. We could've also gotten this median with the commnad ```{r, echo = TRUE, eval = TRUE} median(Surv(aml$time, aml$status)) ``` (The `50` is labelling the median as the 50% quantile. Is it terrible output? Yes.) We can use the `fit` object to plot the Kaplan-Meier estimate of the survival curve. There are a few ways to do this, but we will use the `survminer::ggsurvplot()` function. To use this, you will need to install the `survminer` package. ```{r, echo = TRUE, eval = FALSE} install.packages("survminer") ``` We can then make the plot. ```{r echo = TRUE, eval = TRUE} library(survminer) ggsurvplot(fit) ``` By default, this plots a confidence interval and you can turn this off with the option `conf.int = FALSE`. ### Question 1. You can plot the Kaplan-Meier curves for the sub-populations defined by the covariate `x`, by replacing `~ 1` with `~ x` in the `survfit` call. Look at the resulting medians and Kaplan-Meier curves (use the same call to `ggsurvplot`). Write a few sentences commenting on the similarities and differences. ## A closer look at the Kaplan-Meier curve So how did `ggsurvplot` make that plot? If we look a little closer at the `fit` object, we can work it out . ```{r echo = TRUE, eval =TRUE} glimpse(fit) ``` These are all the things that are computed when you call `survfit`. - `time` is the set of unique times in the data ($t_j$ in the lectures) - `n.event` is the number of events that were observed at each time ($d_j$ in the lectures) - `n.censor` is the number of censored observations at each time - `n.risk` is the number at risk at each time ($n_j$ in the lectures) - `surv` is the Kaplan-Meier estimate of the survival time. We can check that we can compute `surv` from `n.event` and `n.risk`: ```{r, echo = TRUE, eval = TRUE} cumprod(1 - fit$n.event / fit$n.risk) - fit$surv ``` ### Questions - What does it mean that the above expression is (almost) zero? - Why isn't it _exactly_ zero? - What does `cumprod` do? (Also look up what `cumsum` does) - How can you compute `n.censor` from this data? ## Computing the median Finally, we can use the estimated survival curve to find the median. This is not unique except for in the very unusual situation where `fit$surv` is exactly `0.5` at one time. When there is no unique median, we have a number of options, two popular ones are: - Select the average of the two closest times with `fit$surv` on either side of 0.5 - Select the time with `fit$surv` closest to 0.5. For big enough data sets, both of these should give about the same value, but for small data sets the second option (which is what R uses) is slightly more reliable as it's possible that one of the two closest times has a survival estimate that is quite far from 0.5. Let's look at this! ```{r, echo = TRUE, eval = TRUE} summary(fit) ``` The "closest" median is `27`, while the "average" median is `r (23+27)/2`. ### Questions - Write functions to find the median using both methods. They should take a `fit` as an input. - Consider the `aml` data when `x == "Maintained"`. Compute the Kaplan-Meier curve for this subset of the data. Which of the two medians is most appropriate? - Repeat the exercise for `x == Nonmaintained`. - Look at the `diabetic` data (load it with `data(diabetic)`). Why is the median of this data not defined? ## Fitting a parametric survival distribution So far, we have only talked about non-parametric estimates of the survival curve. What if we think that the uncensored data came from a particular distribution? Not every distribution can be used to fit survival data in R, but three of the most useful options are - Exponential - Weibull (_WARNING_ the parameterisation of the Weibull that `survival` uses is different to the standard one in R. Don't even ask.) - Log-normal. We fit these models using the `survival::survreg` function ```{r, echo = TRUE, eval = TRUE} fit2 <- survreg(Surv(time, status) ~ 1, data = aml, dist = "exponential") fit2 ``` What do the paraemters mean? The `(Intercept)` term is the _log_ mean of the survival distribution (in this case the exponential). So the single unknown parameter in the exponential distribution can be computed as ```{r eval = TRUE, echo = TRUE} lambda <- exp(-coef(fit2)) ``` We can then look up what the median of an exponentially distributed random variable is. ![]("exponential.png") Using this we can compare the parametric estimate of the median to the Kaplan-Meier estimate. ```{r, eval = TRUE, echo = TRUE} log(2) / lambda median(Surv(aml$time, aml$status))$quantile ``` We can also compare the survival curves. ```{r, echo = TRUE, eval = TRUE} km_curve <- ggsurvplot(fit, conf.int = FALSE) km_curve$plot + stat_function(fun = pexp, colour = "blue", line_type = "dashed", args = list(rate = lambda, lower.tail = FALSE) ) ``` The resulting survival curve is a prettty good match with the empirical estimate for early times, but after the median, the two estimates diverge wildly. ### Question - Fit the data using a Weibull distribution. And compare the median and the survival curve and comment on your findings. The following code will be useful for translating between the `survival` package's parameterisation of the Weibull distribution and the standard R parameterisation. ```{r, echo = TRUE, eval = FALSE} ## fit3 is the weibull fit pweibull_shape <- 1 / fit3$scale pweibull_scale <- exp(coef(fit3)) ```
 
ETC2420 - Tutorial solution (W4-6) bootstrap confidence intervals; MLE; censored data八大框架 NLP