ETC2420 - Tutorial solution (W7-12) DAG ; Bayesian A/B Testing; Bayesian Regression; TidyModels

type

Post

Created date

Sep 28, 2021 01:30 AM

Week 7 DAG & more about Regression

Part 1 DAG


## You're such a DAG

This week we are looking at using regression to estimate causal effects. Remembering, always, that correlation is not causation, we need a little bit more than straight linear regression to estimate the _causal effect_ of $X$ on $Y$.

The _causal effect_ of $X$ on $Y$ is how much $Y$ would change if we re-did the experiment but with $X$ set to a new value.  This would possibly also change the _other_ covariates in the system _as well as_ the outcome variable $Y$.  This is _different_ from the regression coefficient in most cases, as the regression coefficients tell us our much $Y$ would change if we changed $X$ but kept everything else fixed. 

There are a lot of ways to reason through this process, but one of the more popular ones is using a directed acyclic graph (DAG) to visualise the direct and indirect relationships between variables.

DAGs allow us to visualise a nice condition under which we can estimate a causal effect from a correlative model (like linear regression). Some theory, which is definitely beyond the scope of the course is that if the DAG faithfully represents the system under consideration (in that there ar no undirected paths from $X$ to $Y$ that aren't in the DAG^[This is a _big_ assumption!]), then as long as there are no open back doors between $X$ and $Y$, the estimated effect of $X$ on $Y$ can be interpreted causally!

_This doesn't mean that linear regression is the right tool for finding this effect. But we well get to that soon enough._

This leads to a procedure to find all the of the variables that should be in our model (regression or otherwise):

1. Find your DAG (this is actually the hardest part and will involve subject-matter experts, and probably a great deal of hope and optimism.)

1. Identify all backdoor paths between $X$ and $Y$ (aka paths regardless of arrow direction that go from $X$ to $Y$ that start with an arrow pointing into $X$)

1. Close off all backdoor paths by conditioning (aka adding to the model, aka adjusting) appropriate variables. Typically you add Fork variables. You never add Collider or Pipe variables.

1. Identify all causal (aka directed aka following the arrows) paths from $X$ to $Y$ and _make sure they are open_! (This is usually fine, but if you block off a causal path while closing a back door, you need to find another way to close that particular back door^[This might not exist. In that case you've got to do something more fancy, but that's not usually the case.)

In R, we can use the libraries `dagitty` and `ggdag` to reason with and visualise DAGs.

The `dagify()` function is great because it uses regression notation. `X -> Y` is written as `Y ~ X` just as we would in regression. You can build a graph edge by edge by chaining together these statements.

For example, the classical _fork_ configuration is visualised as follows.

```{r, message = FALSE}
library(tidyverse)
library(dagitty)
library(ggdag)
dagify(Y ~ X, 
       X ~ Z,
       Y ~ Z) %>% ggdag() + theme_dag()

```

You can use R to find all of the paths between $X$ and $Y$ and find a conditioning set (called here the _adjustment set_). To do this you need to specify the variable of interest, which they call the _outcome_ (Y), and the variable we want to change, which they call the _exposure_ (X).


```{r, message = FALSE}

dagify(Y ~ X, 
       X ~ Z,
       Y ~ Z) %>% 
  ggdag_adjustment_set(exposure = "X",
                       outcome = "Y")

```

To interpret this graph, the _adjusted_ variables need to be added to the regression (between $X$ and $Y$), while the unadjusted ones don't.

We can repeat the exercise with a collider triangle. In this case we don't expect to be told to condition on (or adjust for) the collier variable $Z$.


```{r, message = FALSE}

dag <- dagify(Y ~ X, 
       Z ~ Y,
       Z ~ X)

dag %>% ggdag() + theme_dag()
  
dag %>%   ggdag_adjustment_set(exposure = "X",
                       outcome = "Y")

```



We can also use the `dagitty` package to do non-graphical stuff. The `paths()` function gives you a list of all paths between the exposure and the outcome, while the `adjustmentSets()` function gives you a list (or multiple lists if there are options) of covariates you need to add to your model to close all of the back doors.

```{r}
dagify(Y ~ X, 
       X ~ Z,
       Y ~ Z,
       outcome = "Y",
       exposure = "X") %>% paths()


dagify(Y ~ X, 
       X ~ Z,
       Y ~ Z,
       outcome = "Y",
       exposure = "X") %>% adjustmentSets()

```


### Questions



```{r, echo = FALSE, message = FALSE}

dagify(Y ~ X, 
       X ~ Z,
       Y ~ Z,
       Y ~ W,
       W ~ X) %>% 
  ggdag() + theme_dag()


```


1. Before you use R to get the answer, answer the following:
  1. What would happen if you add $Z$ to your regression model? Is this a good idea?
  
<!--
Z is a fork variable so if you don't add it you will see bias in your causal estimate.
1-->
  
  1. What would happen if you add $W$ to your regression model? Is this a good idea?
  
  <!--
W is a pipe variable so if you add it you will see bias in your causal estimate. In particular, you will only see the _direct_ effect of X on Y and you will not see the indirect effect (the one that is mediated by W). This means that you will potentially only see part of the effect X has on Y.
1-->

1. Use `ggdag` to viusalise the adjustment set.

```{r eval = FALSE, echo = FALSE}
dagify(Y ~ X, 
       X ~ Z,
       Y ~ Z,
       Y ~ W,
       W ~ X,
       exposure = "X",
       outcome = "Y") %>% 
  ggdag_adjustment_set()
```

1. Describe the linear regression you would do to estimate the _causal_ effect of $X$ on $Y$.

<!--
The regression should be

fit <- lm(Y ~ X + Z)


-->

Part 2 Regression using Foxes dataset


## Foxes^[This is example is taken from Richard McElraith's Statistical Rethinking (2nd Ed.)]

In this question, we will be considering a data set about 30 different groups of urban foxes in England. Foxes are territorial and each group has a different number of foxes (`groupsize`), and their territory an area (`area`) and an amount of food (`avgfood`).

```{r, message = FALSE}
# devtools::install_github("rmcelreath/rethinking")
library(rethinking)
data(foxes)
head(foxes) %>% knitr::kable()
```

Heavier foxes are, usually, healthier foxes, so we are interested in modelling the weight of each fox (`weight`) and estimate the causal effect that interventions involving changing various covariates has on the weight of the foxes.

We will use the following DAG.

```{r, echo = FALSE}
dagify(weight ~ groupsize,
       weight ~ avgfood,
       groupsize ~ avgfood,
       avgfood ~ area) %>% 
  ggdag_classic() + theme_dag()
```

1. Use regression to model the _causal effect_ of changing the area size on the weight of a Fox. You will need to analyse the DAG to work out which variables need to be included in the model. Comment on the results _and_ the appropriateness of the linear regression model.

<!--
Looking at the graph, avgfood and groupsize are pipe variables, so we should not condition on them! We get the following. 
- All of the diagnostic plots look good.
- The effect of area is not easily distinguishable from zero (roughly +/- 0.1). 

This means that changing the foxes territory area does not change the weight of a fox.
-->
```{r echo = FALSE, eval = FALSE}

fit1 <- lm(weight ~ area, data = foxes)
summary(fit1)
plot(fit1)
```

1. Now estimate the causal effect of adding food to a territory. Comment on your results and the appropriateness of a linear regression model.

<!--
Once again, everything is a pipe so we should not add any other variables to the model.
- All of the diagnostic plots look fine.
- There does not seem to be an obviously non-zero effect of changing the food.

-->

```{r, eval =FALSE, echo = FALSE}
fit2 <- lm(weight ~ avgfood, data = foxes)
summary(fit2)
plot(fit2)
```

1. Finally estimate the causal effect of changing group size. Comment on your results and the appropriateness of a linear regression model.

<!--
This time, there is an open backdoor path groupsize <- avgfood -> weight that we need to close off. We do this by adding avgfood to the regression.

- All of the diagnostic plots look good.
- There is a negative effect of group size on weight (when we adjust for /condition on avgfood, which is appropriate here!)
- Unlike the previous model, avgfood has a positive effect when groupsize is fixed. (This is the difference between a correlative effect and a causal effect!)
-->

```{r echo= FALSE, eval = FALSE}
fit3 <- lm(weight ~ avgfood + groupsize, data = foxes)
summary(fit3)
plot(fit3)
```

1. Write a short paragraph synthesising the results of the three regressions you have done.

<!--
The most interesting thing is that we see conflicting information about the effect of food availability depending on whether we adjust for groupsize or not. There is a very plausible reason for this: the net effect of adjusting for food is zero because when there is more food in an area the group in that area becomes bigger! This corresponds to the concept of an `ideal-free distribution' in population ecology.
-->

Part 3 Some more regression practice (VIF)


## Some more regression practice 

Simulate $n=50$ observations of each of the following variables:

+ $z_{j,i}\overset{i.i.d}{\sim}N(0,1)$ for $j = 1, 2, 3$ and $i=1, 2, \ldots,n$. This produces ***three*** independent sequences of $n ~i.i.d.$ standard normal random variables.

+ $x_{1,i}\overset{i.i.d.}{\sim}Uniform(min=5,~max=15)$, for $i=1, 2, \ldots, n$.

+ $x_{2,i}\overset{i.i.d.}{\sim}Student-t$ with $\nu=4$ degrees of freedom, for $i=1, 2, \ldots, n$.

+ $x_{3,i}\overset{i.i.d.}{\sim}Gamma(shape=3, ~scale=5)$, for $i=1, 2, \ldots, n$.

+ $x_{4,i} = 3+ 2\,z_{1,i}$, for $i=1, 2, \ldots, n$.

+ $x_{5,1} = -1 - 0.96z_{1,i} + 0.28*z_{2,i}$, for $i=1, 2, \ldots, n$.

Then simulate $n$ observation of each of the response variable $y$ according to:
$$y_{i} = \beta_0 + \beta_1 \, x_{1,i} + \beta_2 \, x_{2,i} + \beta_3 \, x_{3,i} + \beta_4 \, x_{4,i} + \beta_5 \, x_{5,i} + \sigma\,z_{i},$$
where $\beta_0=12,~\beta_1=2.5, ~\beta_2=4,~ \beta_3 =1, ~ \beta_4 =9, ~\beta_5=5$ and $\sigma=20$. 

Simulation of these variables are completed in the code chunk below^[Note that as **R** does not use a zero value to index an element of a vector, the indices of the individual $\beta$ coefficients is increased by one.].

```{r}
n <- 50
set.seed(345820)

z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)

x1 <- runif(n, min = 5, max = 15)
x2 <- rt(n, df = 4)
x3 <- rgamma(n, shape = 3, scale = 5)
x4 <- 3 + 2*z1
x5 <- -1 - 0.96*z1 + 0.28*z2

tb <- c(12, 2.5, 4, 1, 9, 5 ) #tb for "true beta"
ts <- c(20) #ts for "true sigma"

y <- tb[1] + tb[2]*x1 + tb[3]*x2 + tb[4]*x3 + tb[5]*x4 + tb[6]*x5 + ts[1]*z3
```

Once you are satisfied that your data has been simulated correctly, put all variables in a *tibble* named *df*.

```{r}
df <- tibble(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5, y=y)
```


Produce a scatterplot matrix of all columns of $df$ (including $y$) using the **GGally**::*ggscatmat*() function.

```{r fig.height=8, fig.width=8}
library(GGally)
ggscatmat(df)
```

**Question** Which pairs of variables in the scatterplot matrix appear to show a relationship between variables?Do these seem to align with how the data were simulated? Are there an other features you can detect from these pairwise scatterplots? 


Next we'll pretend we didn't simulate the data, and instead try to identify a suitable multiple linear regression relationship between the response variable $y$ and the five different regressors, including an intercept term. We'll start with fitting the largest model (the one we used to generate $y$), using OLS. We'll call this model "Model 1". 

```{r}
library(broom)
M1 <- lm(formula=y~x1+x2+x3+x4+x5, data=df) # Model 1
tidy.M1 <- tidy(M1)
glance.M1 <- glance(M1)
```

Consider the output from the fit of Model 1. 

```{r eval=FALSE, echo=FALSE}
tidy.M1 # not included for students
glance.M1 # not included for students
```

**Question**  According to the model output, which regression coefficients are (individually) significant at the 5% level? Do the estimated coefficients seem "close" to the true values used to simulate the data? 

<!-- 

Answer to Q2. Need to look at the p-values in *tidy.M1*. The $\hat{\beta}_0$, $\hat{\beta}_2$, $\hat{\beta}_4$ and $\hat{\beta}_5$ all have p-values greater than 0.05. The estimated values are also a bit off - only $\beta_2$ and $\beta_4$ seem "close" to the true values (though this is subjective!) 

-->

**Question**  What can you say about the estimated standard errors of the estimated regression coefficients?

<!-- 

Answer to Q3. The size of the standard errors are pretty large! Of course, the sample size is pretty small for this size of a model. We also have the issue (to be discovered below if not already noticed) that there is multi-collinearity present.

-->

**Question**  What are the $R^2$ and the adjusted $R^2$ values for the fitted Model 1?

<!-- 

Answer to Q4. Need to look at *glance.M1* to see the result: r.squared = 0.5265and adjusted r.squared = 0.4727. The adjusted $R^2$ penalises (reduces) the R^2 because there are many regressors in the model. 

-->

One thing that we can look at to see if there is a lot of correlation between the variances in a model is the _variance inflation factor_ or VIF. This is a nifty way to notice linear correlation between covariates. The basic idea is that if I can estimate the jth covariate $x_j$ extremely well from all of the other covariates ($x_1, \ldots, x_p$, $x \neq j$) then it won't add very much to the regression. The VIF is large when a covariate can be estimated very well using the other covariates, and a rule of thumb is that if the VIF is bigger than 5, we should think carefully about including the variable in the model. (One reason to include it would be a DAG-based reason, but the presence of a high VIF definitely suggests that we should think more carefully about the presence of a covariate in our model.)

The VIF can be computed by using the function `vif()` from the `car` package (which you may need to install). We can calculate the variance inflation factors associated with the regressors in Model 1. 


```{r}
library(car)
vif(M1)
```

Part 4 More about VIF & cook distance


**Question**  Find all variables associated with an excessively large VIF. How should the VIFs be interpreted? (Should you remove *all* of these variables from your model? Discuss possible strategies to decide on the variables to exclude from your model due to the large VIF values.)

<!--

Answer Q5. x4 and x5 have VIFs over 10 (the threshold). Values over 10 mean than if we fit the linear regression like x4 = a0 + a1*x1 + a2*x2 + a3*x3 + a4*x5, we'll get an R^2 of over 90%. (Since $VIF=\frac{1}{1-R^2_4}$ where the R^2_4 is for this special regression for x4. Solve for R^2_4 when VIF=10 and you'll get R^2=90%.)

It makes sense to remove at least one variable since at least one must be close to redundant given the other regressors. (i.e. there is multi-collinearity present). Given there are two values with nearly the same large VIF, it makes sense to take only one out at a time. (We also know this makes sense since we simulated the data...but we can always check the VIFs after one regressor is removed and if the VIF for the other variable is still large we can then take out the second variable.)

-->

The code chunk below fits a *modified* version of Model 1 and saves the result in an object named *MM1*.

```{r}
MM1 <- lm(formula=y~x1+x2+x3+x4, data=df) # "Modified" Model 1
tidy.MM1 <- tidy(MM1)
glance.MM1 <- glance(MM1)
```

Look at the output from the Modified Model 1. 

```{r echo=FALSE, eval=FALSE}
tidy.MM1  # not included for students
glance.MM1  # not included for students
```

**Question**  What is the value of $p$ associated with Modified Model 1?

<!--

Answer Q6. $p$ is the number of regression coefficients (number of regressors), including the intercept (including a constant regressor). So here $p=5$.

-->

**Question**  According to the Modified Model 1 output, which regression coefficients are significant at the 5% level? Do the estimated coefficients seem "close" to the true values used to simulate the data? 

<!--

Answer Q7. The estimated intercept and coefficient on x2 are not (individually) significant. The estimated values are not spot on, but they seem closer to the corresponding true values than under Model 1. 

-->

**Question**  Note the size of the standard errors associated with each estimated coefficient. How do these compare with the coefficients that were not significant in Model 1? (See Q2.) Given the way the data were simulated, do you have any ideas about why there might still be insignificant coefficients, even for the Modified Model 1? 

<!--

See the *glance.MM1* output. The standard errors for the remaining coefficients haven't changed all that much, but we have removed the coefficient that had the very large standard error (x5). Multi-collinearity will tend to inflate the standard errors.

Why still relatively large standard errors? Well, there is still a small sample size here! It seems that the "signal" for the intercept, and for x2, are too small to be detected amid all of the noise in the data. 

-->

**Question**  What are the values of: $R^2$ and the Adjusted $R^2$ values for the fitted Modified Model 1? Compare these values to those from the original Model 1. (See Q4.) Do you think removing regressor $x_5$ has improved or diminished the model?

<!-- 

Answer to Q9. Need to look at *glance.MM1* to see the result: r.squared = 0.5255 or 52.55% and adjusted r.squared = 0.4833 or 48.33%. We really can only compare the adjusted r.squared since the ordinary r.squared doesn't account for the number of regressors. We can see that the adjusted r.squared actually improves with the removal of x5, so the models seems better without x5 in it (on this measure, at least). 

-->
    
Next, calculate the variance inflation factors (VIF) associated with the regressors in Modified Model 1.

```{r}
vif(MM1)
```



Change the value of $p$ in the code chunk below to match the correct value for Modified Model 1. Then run the code chunk to extract the *leverage* and *Cook's D* measures associated with the fit of Modified Model 1 that exceed the "rule of thumb" threshold of $2*p/n$. 

```{r fig.height=5, fig.width=8, eval=FALSE}
df <- df %>% mutate(rowid=1:n)
aug.MM1 <- augment(MM1)
aug.MM1 <- aug.MM1 %>% mutate(rowid=1:n)

p <- 1 # change this once you know the correct value of p

threshold <- 2*p/n # threshold for .hat and .cooksd

exceed_hat <- aug.MM1 %>% 
  arrange(desc(.hat)) %>% 
  select(rowid, .hat) %>% 
  filter(.hat > threshold) 

exceed_cooksd <- aug.MM1 %>% 
  arrange(desc(.cooksd)) %>% 
  select(rowid, .cooksd) %>% 
  filter(.cooksd > threshold) 

df_longer <- df %>% 
  pivot_longer(-c(y, rowid), 
               names_to="regressor", 
               values_to="factor")  

df_longer %>% 
  ggplot(aes(x=factor, y=y, colour=regressor)) + 
  geom_point() + 
  facet_wrap(~regressor, nrow=2) + 
  theme_bw() + 
  geom_text(label=df_longer$rowid, 
            nudge_x = 3, 
            colour="black", 
            size=2,
            check_overlap = T) + 
  ggtitle("y against each regressor") 
```

<!-- code chunk below has the correct value of p (not shown to students) -->

```{r fig.height=5, fig.width=8, eval=FALSE, echo=FALSE}
df <- df %>% mutate(rowid=1:n)
aug.MM1 <- augment(MM1)
aug.MM1 <- aug.MM1 %>% mutate(rowid=1:n)

p <- 5 # correct value here

threshold <- 2*p/n 

exceed_hat <- aug.MM1 %>% arrange(desc(.hat)) %>% dplyr::select(rowid, .hat) %>% filter(.hat > threshold) 

exceed_cooksd <- aug.MM1 %>% arrange(desc(.cooksd)) %>% dplyr::select(rowid, .cooksd) %>% filter(.cooksd > threshold) 

df_longer <- df %>% pivot_longer(-c(y, rowid), names_to="regressor", values_to="factor")  

df_longer %>% ggplot(aes(x=factor, y=y, colour=regressor))+ geom_point() + facet_wrap(~regressor, nrow=2) + theme_bw() + geom_text(label=df_longer$rowid, nudge_x = 3, colour="black", size=2, check_overlap = T) + ggtitle("y against each regressor") 
```

**Question**  How many points exceed the leverage threshold? Find these points using their *rowid* value which is also shown on the faceted plot. (Some will be easier to find than others! Try filtering the data points to see the values of the regressors to make it easier to find the points.)

<!-- 

Answer Q11. Need to view the *exceed_hat* object. Then look at the plot produced above. 

-- code chunk below not shown to students -->

```{r echo=FALSE, eval=FALSE}
exceed_hat

df %>% dplyr::filter(rowid == 21 | rowid == 16 | rowid == 44)
```

**Question**  How many points exceed the Cook's D threshold? 

<!-- 

Answer Q11. Need to view the *exceed_cooksd* object. No points are flagged as influential according to this measure. 

-- code chunk below not shown to students -->

```{r eval=FALSE, echo=FALSE}
exceed_cooksd
```

Week 8 bootstrap to correctly estimate the uncertainty in a prediction & TidyModels

Part 1 : Prediction intervals for linear regression

The distribution of β − ˆβ

In regression, we call the confidence interval for the observed value the prediction interval. We can compute a normality-based interval using R.


data(airquality)
fit <- lm(Ozone ~ ., data = airquality)
new_data <- tibble(Solar.R = 110, Wind = 10,
                   Temp = 74, Month = 5, Day =3)
predict(fit, new_data, interval = "prediction")

We can compare that to the confidence interval for the mean of the new value



predict(fit, new_data, interval = "confidence")

We can see that the fitted value is the same, but the prediction interval is much wider than the confidence interval.

We can see this with a plot.


library(tidyverse)
library(tidymodels)

airquality_pred <- augment(fit, interval = "prediction") %>%
  mutate(interval = "prediction")
airquality_ci <- augment(fit, interval = "confidence") %>%
  mutate(interval = "confidence")

bind_rows(airquality_ci, airquality_pred) %>%
  ggplot(aes(x = Temp, y = Ozone)) +
  geom_point() +
  geom_line(aes(y = .lower), linetype = "dashed") +
  geom_line(aes(y = .upper), linetype = "dashed") +
  facet_wrap(.~interval)

Estimating

Putting it together


## Step 1: Fit the model and compute residuals

fit <- lm(Ozone ~ ., data = airquality)
boot_dat <- augment(fit)

new_data <- tibble(Solar.R = c(90, 110), Wind = c(10,5),
                   Temp = c(100, 74), Month = c(5,5), Day = c(3,4))

new_data <- new_data %>%
  mutate(.pred = predict(fit, new_data))

Next, we need to compute the variance-adjusted residuals.



## Step 2: Make the variance-adjusted residuals
boot_dat <- boot_dat %>%
  mutate(epsilon = .resid / sqrt(1 - .hat)) %>%
  rename(y = Ozone, mu = .fitted) %>%
  select(-starts_with(".")) # remove all the augment stuff

We are going to break with our pattern for doing a bootstrap and instead build a function that computes a single bootstrap resample and computes a single sample of $e_\text{new}$. We will then use map_dbl to repeat this over and over again.


## Step 3: Make a function to build a bootstrap sample
## and compute things from it

boot <- function(dat, new_dat) {
  # assume residuals are called epsilon
  # assumes response is called y
  # assumes predictions are called mu
  # assumes all columns except y and epsilon are covariates
  # assumes new_dat has a column called .pred, which is
  # the prediction for the original data
  n <- length(dat$epsilon)
  n_new <- dim(new_dat)[1]

  dat <- dat %>%
    mutate(epsilon_star = sample(epsilon, replace = TRUE),
           y_star = mu + epsilon_star) %>%
    select(-y, -mu, -epsilon, -epsilon_star)

  fit_star <- lm(y_star ~ ., data = dat)
  dat_star <- augment(fit_star) %>%
    mutate(epsilon = .resid / sqrt(1 - .hat))

  new_dat <- new_dat %>%
    mutate(pred_star = predict(fit_star, new_dat),
           bias = .pred - pred_star,
           epsilon = sample(dat_star$epsilon, n_new, replace = TRUE),
           e_star = bias + epsilon) %>%
    select(-pred_star, -bias, -epsilon)


  return(new_dat)
}

# always check that it doesn't throw an error!
boot(boot_dat, new_data)

Now we can compute the bootstrap estimate of the distribution of $e_\text{new}$



## Step 4: Run this a lot of times to get the distribution of e_new

n_new <- dim(new_data)[1]

error <- tibble(experiment = rep(1:1000),
                e_star = map(experiment,
                                 ~boot(boot_dat, new_data))) %>%
  unnest(e_star) %>%
  mutate(y_new_star = .pred + e_star) %>%
  select(-e_star, -experiment)

intervals <- error %>%
  group_by(across(-c(y_new_star, .pred))) %>%
  summarise(lower = quantile(y_new_star, 0.025),
            pred = first(.pred),
            upper = quantile(y_new_star, 0.975)) %>%
  bind_cols(
    predict(fit, new_data, interval = "prediction") %>%
      as_tibble()) %>%
  select(-fit) %>%
  rename(lower_clt = lwr,
         upper_clt = upr)


intervals %>% knitr::kable(digits = 2)

These are quite different intervals!

Part 2 : Using tidymodels to fit more complex models

1. split the data

Because we are going to be doing complicated things later, we need to split our data into a test and training set. If we do this now we won't be tempted to forget, or to cheat.

The rsample package, which is part of tidymodels makes this a breeze.


library(tidymodels)
office_split <- initial_split(office, strata = season)
office_train <- training(office_split)
office_test <- testing(office_split)

The initial_split() function has a prop argument that controls how much data is in the training set. The default value prop = 0.75 is fine for our purposes.

The strata argument makes sure this 75/25 split is carried out for each season. Typically, the variable you want to stratify by is any variable that will have uneven sampling.

The training() and testing() functions extract the test and training data sets.

2. build a recipe

Now! We can build a recipe!

The first step in any pipeline is always building a recipe to make our data analysis standard and to allow us to document the transformations we made. (And to apply them to future data.)

For predictive modelling, this is a bit more interesting, because we need to specify which column is being predicted. The recipe() function takes a formula argument, which controls this. For the moment, we are just going to regress imdb_rating against everything else, so our formula is imdb_rating ~ .. We also need to do this to the training data, so let's do that.



office_rec <- recipe(imdb_rating ~ ., data = office_train) %>%
  update_role(episode_name, new_role = "ID") %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes())

So there are three more steps.

The first one labels the column episode_name as an "ID" column, which is basically just a column that we keep around for fun and plotting, but we don't want to be in our model.

step_zv removes any column that has zero variance (ie is all the same value). We are applying this to all numeric columns, but not to the outcome. (In this case, the outcome is imdb_rating.)

We are normalizing all of the numeric columns that are the outcome.

3. Build the model

Now! We can talk about the model!

Ok, so how do we fit this linear regression? We will do this way way we saw in class. Later, we will modify this and see what happens.

In order to facilitate us using other method in the future, we are going to use the nice tidymodel bindings so we don't have to work too hard to understand how to use it. One of the best things that tidymodels does is provide a clean and common interface to these functions.

We specify a model with two things: A specification and an engine.


lm_spec <- linear_reg() %>%
  set_engine("lm")
lm_spec

So what does that do? Well it tells us that

we are doing a linear regression-type problem (that tells us what loss function we are using).

we are going to fit the model using the lm function.

Part 3 : other models using tidymodel Workflows

Random Forest & Cross Validation

When we are doing real data analyses, there are often multiple models and multiple data sets lying around. tidymodels has a nice concept to ensure that a particular recipe and a particular model specification can stay linked. This is a workflow.

We don't need anything advanced here, so for this one moment we are going to just add the data recipe to the workflow.


wf_office <- workflow() %>%
  add_recipe(office_rec)
wf_office

Now! Let's get fitting!

To fit the model, we add its specification to the workflow and then call the fit function.


lm_fit <- wf_office %>%
  add_model(lm_spec) %>%
  fit(data = office_train)

We can then view the fit by "pulling" it out of the workflow.


lm_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  knitr::kable(digits = 2)

We can then compute the metrics on the training data.


office_test<- office_test %>%
  bind_cols(predict(lm_fit, office_test)) %>%
  rename(lm_pred = .pred)

office_test %>% metrics(imdb_rating, lm_pred) %>%
  knitr::kable(digits =2 )

Moving beyond `lm`

The previous fit gave us a warning. It said that the fit was rank deficient (which basically is maths-speak for "something has gone very very wrong"). If we compare the metrics on the training set with those on the test set we can see that they are very different.


office_train %>%
  bind_cols(predict(lm_fit, office_train)) %>%
  metrics(imdb_rating, .pred) %>%
  knitr::kable(digits = 2)

One way that we can make things better is to fit a different model!

The first thing we will look at is a random forest^[These are covered in detail in Chapter 8 of ISLR (the book refered to in the extra reading for week 8) and, while they are cool, they are beyond the scope of the course.], which is an advanced prediction method that tries to find a good non-linear combination of the covariates.

To do this we need to install the ranger package, which fits random forest models.


install.packages("ranger")

Now we just need to change the engine and everything else stays the same!


rf_spec <- rand_forest(mode = "regression") %>%
  set_engine("ranger")

Then we work as above:


rf_fit <- wf_office %>%
  add_model(rf_spec) %>%
  fit(data = office_train)

office_test %>%
  bind_cols(predict(rf_fit, office_test)) %>%
  metrics(imdb_rating, .pred) %>%
  knitr::kable(digits = 2)

We see that this is slightly better than the lm run.

Question

Use cross validation to estimate the variability in the RMSE estimates and comment of whether the complex random forest method did better than linear regression.


folds <- vfold_cv(office_train, v = 20, strata = season)

lm_fit_cv <-  wf_office %>%
  add_model(lm_spec) %>%
  fit_resamples(resamples = folds)

collect_metrics(lm_fit_cv) %>% knitr::kable(digits = 2)

rf_fit_cv <-  wf_office %>%
  add_model(rf_spec) %>%
  fit_resamples(resamples = folds)

collect_metrics(rf_fit_cv) %>% knitr::kable(digits = 2)

Now we can actually tune our model to find a good value of $\lambda$.



wf_rr <- wf_office %>% add_model(rr_tune_spec)

fit_rr <- wf_rr %>%
  tune_grid(resamples = folds,
            grid = lambda_grid
)

Let's see how we did! we collected a bunch of metrics that we can now look at.


fit_rr %>% collect_metrics() %>%
  ggplot(aes(x = penalty, y = mean,
             colour = .metric)) +
  geom_errorbar(aes(ymin = mean - std_err,
                    ymax = mean + std_err),
                alpha = 0.5) +
  geom_line() +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")

We see that a relatively large penalty is good. We can select the best one!


lowest_rmse <- fit_rr %>% select_best("rmse") ## We want it small!
lowest_rmse

final_rr <- finalize_workflow(wf_rr, lowest_rmse)
final_rr

Now with our workflow finalized, we can finally fit this model on the whole training set, and then evaluate it on the test set.


fit_rr <- final_rr %>% fit(office_train)

office_test %>%
  bind_cols(predict(fit_rr, office_test)) %>%
  metrics(imdb_rating, .pred) %>%
  knitr::kable(digits = 2)

Regularised regression

We can see that this didn't do much better than just with lm. But in a lot of situations, it does a better job.

Question

Try the exercise again with mixture = 1 in the model specification. This is called a Lasso and automatically includes only the most relevant variables in a model. Does it do better than ridge regression and linear regression in this case?


lasso_tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

wf_lasso <- wf_office %>% add_model(lasso_tune_spec)

fit_lasso <- wf_lasso %>%
  tune_grid(resamples = folds,
            grid = lambda_grid
)

fit_lasso %>% collect_metrics() %>%
  ggplot(aes(x = penalty, y = mean,
             colour = .metric)) +
  geom_errorbar(aes(ymin = mean - std_err,
                    ymax = mean + std_err),
                alpha = 0.5) +
  geom_line() +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")

lowest_rmse <- fit_lasso %>% select_best("rmse") ## We want it small!
lowest_rmse

final_lasso <- finalize_workflow(wf_lasso, lowest_rmse)
final_lasso

fit_lasso <- final_lasso %>% fit(office_train)

office_test %>%
  bind_cols(predict(fit_rr, office_test)) %>%
  metrics(imdb_rating, .pred) %>%
  knitr::kable(digits = 2)

Week 9 : Model construction using tidymodel

Part 1 : Prepare dataset

As always we will begin by loading our two friends:


library(tidymodels)
library(tidyverse)

An artificial dataset

Consider the following data set.


set.seed(24601)

n <- 500
dat <- tibble(r = runif(n, max = 10),
              z1 = rnorm(n),
              z2 = rnorm(n),
              x1 = r * z1 / sqrt(z1^2 + z2^2),
              x2 = r * z2 / sqrt(z1^2 + z2^2),
              y = if_else(r < 5, "inner", "outer")) %>%
  mutate(y = factor(y)) %>%
  select(x1, x2, y)

split <- initial_split(dat)
train <- training(split)
test <- testing(split)

train %>% ggplot(aes(x1,x2, colour = y)) +
  geom_point() +
  coord_fixed()

The code looks a bit weird, but it helps to know that points $(x_1,x_2)$ that are simulated according to

$z_1, z_2 \sim N(0,1)$

$x_j = z_j / (z_1^2 + z_2^2)^{1/2}$

are uniformly distributed on the unit circle^[If you add more components to $z_j$ and $x_j$, you can simulate points uniformly on the unit sphere!]. This means that I've chosen a random radius uniformly on $[0,10]$ and sampled from a point on the circle with radius. If the radius is less than $5$ the point is an inner point, otherwise it's an outer point

Part 2 : A failure of linear classifiers

A failure of linear classifiers

This is a nice 2D data set where linear classification will not do a very good job.

Question: Use tidymodels' logistic_reg() model with the "glm" engine to fit a linear classifier to the training data using x1 and x2 as features. Plot the classification regions (aka which parts of the space are classified as "inner" and "outer).


spec_linear <- logistic_reg() %>%
  set_engine("glm")

wf_linear <- workflow() %>%
  add_formula(y ~ x1 + x2) %>%
  add_model(spec_linear)

fit_linear <- wf_linear %>%
  fit(data = train)

plot_grid <- expand_grid(x1 = seq(-10, 10, length.out = 100),
                         x2 = seq(-10, 10, length.out = 100))

plot_grid %>% bind_cols(predict(fit_linear, plot_grid)) %>%
  ggplot(aes(x1, x2)) +
  geom_tile(aes(fill = .pred_class),
            alpha = 0.5) +
  geom_point(data = train, aes(colour = y)) +
  theme_bw()

One way to assess how well a classifier did on a binary problem is to use the confusion matrix, which cross tabulates the true inner/outer and the predicted inner/out values on the test set.

This often gives good insight into how classifiers differ. It can be computed using the yardstick package (which is part of tidymodels)


#this is an example - the code chunk won't run without the proper
#stuff - see next chunk
conf_mat(data, ## A data.frame with y and .pred
         truth, ## y
         estimate ## .pred (from predict)
         )

Question: Compute the confusion matrix for the linear classifier using the test data and comment on the fit.


train %>% bind_cols(predict(fit_linear, train)) %>%
  conf_mat(truth = y, estimate = .pred_class)

`It did quite badly!

Part 3 : Nonlinear classification with k-nearest neighbours

Now let's try to fit a non-linear classifier to the data. The first thing we can try is k-Nearest Neighbours. We talked in the lectures about how to fit these models using the tidymodels framework.

Question: Fit a k-nearest neighbours classifier to the training data, using cross validation to choose $k$. Compute the confusion matrix on the test data. Plot the prediction region. Did it do a better job?


spec_knn <- nearest_neighbor(mode = "classification",
                             neighbors = tune()) %>%
  set_engine("kknn")

wf_knn <- workflow() %>%
  add_formula(y ~ x1 + x2) %>%
  add_model(spec_knn)

grid <- tibble(neighbors = 1:30)

folds <- vfold_cv(train, v = 10)

fits <- wf_knn %>% tune_grid(resamples = folds,
                             grid = grid)

neighbors_best <- fits %>%
  select_by_one_std_err(desc(neighbors),
                        metric = "accuracy")

wf_knn_final <- wf_knn %>%
  finalize_workflow(neighbors_best)

fit_knn <- wf_knn_final %>%
  fit(data = train)


plot_grid %>% bind_cols(predict(fit_knn, plot_grid)) %>%
  ggplot(aes(x1, x2)) +
  geom_tile(aes(fill = .pred_class),
            alpha = 0.5) +
  geom_point(data = train, aes(colour = y)) +
  theme_bw()

train %>% bind_cols(predict(fit_knn, train)) %>%
  conf_mat(truth = y, estimate = .pred_class)


The fit is not perfect, but it is significantly better than logistic
regression!

Part 4 : logisitc regression fit

Trying to get a better logisitc regression fit

Let's try enriching our covariate space so that it can better capture nonlinear structures.

Question: Look up the help for the recipes::step_poly()`. Modify your recipe to include polynomials of x1 and x2. Does it include all possible polynomial terms? Does this improve the fit?

A useful way to look at the result of your recipe is to prepare and bake it. The following R commands will output the first 6 rows of a dataset defined by the recipe rec.


#THIS IS AN EXAMPLE CHUNK TOO
rec %>%
  prep() %>%
  bake(new_data = NULL) %>%
  head()

The steps are:

Prepare the recipe with the prep function. This does any background work needed to actually make the recipe

Bake the recipe (aka make the data). The new_data = NULL option just uses the dataset defined in the recipe. If you change it to new_data = test you would get the baked test set!


## We are going to need a recipe for this!

rec_quad <- recipe(y ~ x1 + x2, data = train ) %>%
  step_poly(x1, x2, degree = 2)

rec_quad %>%
  prep() %>%
  bake(new_data = NULL) %>%
  head()


wf_quad <- workflow() %>%
  add_recipe(rec_quad) %>%
  add_model(spec_linear)

fit_quad <- wf_quad %>%
  fit(data = train)

## We need to _bake_ the plotting data!

plot_data_quad <- rec_quad %>%
  prep() %>%
  bake(new_data = plot_grid) %>%
  bind_cols(predict(fit_quad, plot_grid))

## We also need to bake the training data if we want
## to plot it on! because otherwise it doesn't get
## scaled appropriately
plot_data_train <- rec_quad %>%
  prep() %>%
  bake(new_data = train)


## We also need to change the dimension names
## as x1 is now x1_poly_1
plot_data_quad %>%
  ggplot(aes(x1_poly_1, x2_poly_1)) +
  geom_tile(aes(fill = .pred_class),
            alpha = 0.5) +
  geom_point(data = plot_data_train,
             aes(colour = y)) +
  theme_bw()

train %>% bind_cols(predict(fit_quad, train)) %>%
  conf_mat(truth = y, estimate = .pred_class)

The fit is perfect, but then again the model is exactly correct!

Part 5 : SVM

Moving further into the tidymodels space

While the previous method worked, it relied on the rather unrealistic situation where we knew exactly how the data was generated. In this section, we are going to look at some more complex methods for solving this problem.

One of the advantages of tidymodels is that we can use the same specifications to fit some quite complex classifiers to data. We do not need to know anything about how they fit the classifier, just some information about how to tune them.

In this section we are going to try to fit a support vector machine classifier and a random forest classifier. These are both advanced techniques for finding non-linear classification boundaries.

First up, let's look at the support vector machine, or SVM. This is an extension of logistic regression^[well, it's very similar to an extension of linear regression] that automatically enriches the covariates with a clever set of non-linear functions.

Because it doesn't know exactly which function to use, it will not work as well as the quadratic polynomial in the previous attempt, but it should do pretty well. We do, however, have to tune it. This is because the method needs to know just how wiggly the boundaries should be. This is difficult to specify manually, so we have to use tune_grid().

The model specification^[the rbf stands for radial basis function, which is the technical component that lets this method find non-linear classification boundaries.] is


spec_svm <- svm_rbf(mode = "classification",
                    rbf_sigma = tune()) %>%
  set_engine("kernlab")

## Show the code
spec_svm %>% translate()

The translate() function tells you how the background package is called. In this case the model is fit using the ksvm function from the kernlab package (you will be prompted to install it when you try to fit the model!).

There is one tuning parameter called rbf_sigma. It controls how wiggly the classification boundary can be. A small value indicates that the function can be very wiggly (or very complicated if you prefer), while a larger value ties to make the function as linear as possible. (It is very similar to the penalty argument we saw last time.)

You can make a good grid of potential values for rbf_sigma using the helper function


rbf_sigma_grid <- grid_regular(rbf_sigma(), levels = 10)

Question: Fit a SVM classifier to the training data, plot the classification boundaries and compute its confusion matrix on the test data. This will take a moment to run! It is also a good idea to normalize all of your predictors for this type of model, as the rbf_sigma_grid kinda assumes this!


spec_svm <- svm_rbf(mode = "classification",
                    rbf_sigma = tune()) %>%
  set_engine("kernlab")

rec_basic <- recipe(y ~  x1 + x2, data = train) %>%
  step_normalize(all_predictors())

wf_svm <- workflow() %>%
  add_recipe(rec_basic) %>%
  add_model(spec_svm)

grid <- grid_regular(rbf_sigma(), levels = 10)

fits <- wf_svm %>% tune_grid(resamples = folds,
                             grid = grid)

sigma_best <- fits %>%
  select_by_one_std_err(desc(rbf_sigma),
                        metric = "accuracy")

wf_svm_final <- wf_svm %>%
  finalize_workflow(sigma_best)

fit_svm <- wf_svm_final %>%
  fit(data = train)


plot_grid %>% bind_cols(predict(fit_svm, plot_grid)) %>%
  ggplot(aes(x1, x2)) +
  geom_tile(aes(fill = .pred_class),
            alpha = 0.5) +
  geom_point(data = train, aes(colour = y)) +
  theme_bw()

train %>% bind_cols(predict(fit_svm, train)) %>%
  conf_mat(truth = y, estimate = .pred_class)

SVMs are a nice tool, but they don't tend to scale well for big datasets, so we will look at a different option, called a random forest. Unlike SVMs, random forests are entirely unrelated to logistic regression and use a different method to find the classifier (they are also unrelated to k-nearest neighbours).

Random forests work by partitioning the covariate space into a sequence of boxes (kinda like one of those famous paintings by Mondrian). For each box, the model predicts which value of the outcome is the best. The magical trick with random forests is it builds multiple such partitions and then combines the prediction to get a better one.

Part 6 :

In tidymodels, we can specify a random forest with the command


spec_rf <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

spec_rf %>% translate()

This will need the ranger package installed. There are some tuning parameters that we can change for random forests, but the default values in ranger are pretty robust so we usually don't need to tune it.

Question: Fit a random forest to the data. Plot the classification boundaries. Compute the confusion matrix for the test set.


spec_rf <- rand_forest( mode = "classification") %>%
  set_engine("ranger")

wf_rf <- workflow() %>%
  add_formula(y ~ x1 + x2) %>%
  add_model(spec_rf)

fit_rf <- wf_rf %>%
  fit(data = train)

plot_grid <- expand_grid(x1 = seq(-10, 10, length.out = 100),
                         x2 = seq(-10, 10, length.out = 100))

plot_grid %>% bind_cols(predict(fit_rf, plot_grid)) %>%
  ggplot(aes(x1, x2)) +
  geom_tile(aes(fill = .pred_class),
            alpha = 0.5) +
  geom_point(data = train, aes(colour = y)) +
  theme_bw()


train %>% bind_cols(predict(fit_rf, train)) %>%
  conf_mat(truth = y, estimate = .pred_class)


This worked about as well as the SVM and was much faster!
There are some artifacts, though. The rectangular partioning
is very evident in the prediction reigions!

Question: Of the four methods you have looked at, which one did the best job? Which one did the best job without already knowing the answer?


Obviously, the quadratic logistic regression did the best.
But of the other options, the SVM probably made the most
aesthetically pleasing classifier, but I prefer random
forests, which are as accurate in this case at the
minor cost of some rectangling.

Week 10 : Bayesian A/B Testing

Part A : Preparation for A/B Testing (part B)

Q1. (hyper-)parameters of the posterior distribution.

Step 1 : Create a function like the one below, named beta_binomial(),


beta_binomial <- function(n,y,alpha=1, beta=1){

atil <- alpha + y
btil <- beta + n - y

cf <- n/(alpha + beta + n)

out <- list(alpha_tilde = atil, beta_tilde = btil, credibility_factor = cf)
return(out)
}

Step 2 : Using beta_binomial ( ) to ﬁnd the (hyper-)parameters of the posterior distribution.

(Where θ ∼ Beta(α = 4, β = 4) when Y is modelled with a Binomial(n = 40, θ) distribution, and y = 12 is observed. )


n <- 40
yobs <- 12
alpha <- 4
beta <- 4
Q1out <- beta_binomial(n=n, y=yobs, alpha=alpha, beta=beta)
Q1o

Step 3 : visualisation


#### Code to visualise components of Bayes theorem ####

# A colourblind-friendly palette with black:
cbbPal <- c(black="#000000", orange="#E69F00", ltblue="#56B4E9",
            "#009E73",green="#009E73", yellow="#F0E442", blue="#0072B2",
            red="#D55E00", pink="#CC79A7")

cbbP <- cbbPal[c("orange","blue","pink")] #choose colours for p1

thetayy <- seq(0.001,0.999,length.out=100)
prioryy <- dbeta(thetayy,alpha,beta)
postyy <- dbeta(thetayy, Q1out$alpha_tilde, Q1out$beta_tilde)
likeyy <- dbinom(yobs,size=n,prob=thetayy)
nlikeyy <- 100*likeyy/sum(likeyy)

df <- tibble(theta=thetayy, `prior pdf`=prioryy,
             `normalised likelihood`=nlikeyy, `posterior pdf`=postyy)

df_longer <- df%>% pivot_longer(-theta, names_to = "distribution",
                                values_to="density" )

p1 <- df_longer %>% ggplot(aes(x=theta, y=density, colour=distribution, fill=distribution)) +
  geom_line() +
  scale_fill_manual(values=cbbP) +
  theme_bw()

p1

Q2. ﬁnd both the prior probability Pr(0.2 ≤ θ ≤ 0.4), and the posterior probability Pr(0.2 ≤ θ ≤ 0.4 | y).


int <- c(0.2, 0.4)
#prior probability
prior_prob <- pbeta(int[2], alpha, beta) - pbeta(int[1], alpha, beta)
#posterior probability
posterior_prob <- pbeta(int[2], Q1out$alpha_tilde, Q1out$beta_tilde) - pbeta(int[1], Q1out$alpha_tilde, Q1out$beta_tilde)

Q3. Find a list containing the mean (named mean) and variance (named var) of the Beta(α, β) distributions

Write a function, named beta_meanvar, that takes as input the parameter values alpha and beta, corre- sponding to α and β, respectively, and returns a list containing the mean (named mean) and variance (named var) of the Beta(α, β) distributions


beta_meanvar <- function(alpha, beta){

  mean <- alpha/(alpha+beta)
  var <- mean*beta/((alpha+beta)*(alpha+beta+1))
  out <- list(mean=mean, var=var)
  return(out)
}

use your function to calculate the mean and variance for both the θ ∼ Beta(α = 4, α = 4) prior distribution, and for the corresponding posterior distribution for θ when y = 12, obtained in Question 1.


prior_meanvar <- beta_meanvar(alpha=4, beta=4)
prior_meanvar
posterior_meanvar <- beta_meanvar(alpha=Q1out$alpha_tilde, beta=Q1out$beta_tilde)
posterior_meanvar

Thus, under the prior distribution θ ∼Beta(α = 4,β = 4), the prior mean of θ is 0.5 and the prior variance of θ is 0.0278. Then, after observing y = 12 out of n = 40 i.i.d. Bernoulli(n,θ) outcomes, the posterior mean for θ is 0.3333 and the posterior variance of θ is 0.0045

Q4. confirm that the posterior mean satisfies the relation

Given the prior distribution, values of y and n, the output of your beta_binomial() and beta_meanvar() functions from Question 1 and Question 3, respectively, confirm that the posterior mean satisfies the relation

We can just numerically check that the relation holds given all of the calculations. If it holds, we are done. If not, you might want to check to see if there has been some sort of rounding error.


beta_meanvar(Q1out$alpha_tilde,Q1out$beta_tilde)$mean ==
(Q1out$credibility_factor)*yobs/n +
(1-Q1out$credibility_factor)*beta_meanvar(alpha=4, beta=4)$mean

Part B : A/B Testing

What is A/B Testing

A/B testing (bucket tests or split-run testing) is a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing or “two-sample hypothesis testing” as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective.

As the name implies, two versions (A and B) are compared, which are identical except for one variation that might affect a user’s behavior. Version A might be the currently used version (control), while version B is modified in some respect (treatment).

Use case : A promotional email campaign

Given information

We want to answer

1. Which of the two Calls to Action, A or B, is more effective? i.e., which Call to Action strategy has the higher effectiveness rate?

2. What is the distribution of possible values for the difference between θA and θB?

Q6 . According to the prior distribution, what is the expected number of people in group A who will take up their promotional offer?


nA <- 1000
nB <- 1000
alphaA <- 2
betaA <- 26

priorA <- beta_meanvar(alphaA, betaA)
priorA

LUstar <- qbeta(c(0.025, 0.975), alphaA, betaA) # prior quantiles for parameter
LU <- nA*LUstar # prior credible interval for expected number of people

Q7 . According to the prior distributions for effectiveness rates θA and θB, and given nA = nB = 1000, what is the difference between the expected number of people from group A who will take up their promotional offer and the expected number of people from group B who will take up their promotional offer? Explain why.

Answer The difference between the expected number of people from group A who will take up their pro- motional offer and the expected number of people from group B who will take up their promotional offer is zero. This is because nA = nB, and θA and θB are independent with the same (marginal) prior distribution, the two experiments have the same a priori expected number of successes.

Q8 . what are the posterior distributions for the effectiveness rates for group A and for group B, respectively?

Given that :

After running the A/B testing protocol, yA = 63 customers purchased the product using discount code A and yB = 45 customers purchased the product using discount code B.

Code:


yobsA <- 63
yobsB <- 45
postA <- beta_binomial(n=nA, y=yobsA, alpha=alphaA, beta=betaA)
postA
postB <- beta_binomial(n=nB, y=yobsB, alpha=alphaA, beta=betaA)
postB

Q9 . what is the (posterior) mean and variance of ∆?

Given that :

Let ∆ = θB − θA denote the difference between the effectiveness rates associated with Call to Action B and Call to Action A.

Answer The two posterior distributions are independent, because θA and θB are independent at the outset before we see any data, and all of the individual Bernoulli trials are independent too, so we can update each posterior without worrying about the other and the posterior distributions themselves are independent. This makes calculating the posterior mean and variance of ∆ easy!

The code chunk below completes the calculations, with the full answer again shown further below.



postA_meanvar <- beta_meanvar(postA$alpha_tilde, postA$beta_tilde)
postB_meanvar <- beta_meanvar(postB$alpha_tilde, postB$beta_tilde)
Delta_mean <- postB_meanvar$mean - postA_meanvar$mean
Delta_mean

## [1] -0.01751

Delta_var <- postB_meanvar$var + postA_meanvar$var
Delta_var

## [1] 9.996e-05

Q10 . Using simulation to approximate the posterior of ∆

Simulate a sample of R = 10000 replicated values of ∆ from p(∆ | yA = 63, yB = 45).

Use this sample to approximate each of the following items:

a) the posterior mean of ∆,

b) the standard deviation of ∆, and

c) the probability that the two effectiveness rates will differ by more than 0.01, i.e. Pr(|θB − θA|> 0.01 | yA = 63, yB = 45).


R <- 10000
set.seed(20208)
sampleA <- rbeta(R, postA$alpha_tilde, postA$beta_tilde)
sampleB <- rbeta(R, postB$alpha_tilde, postB$beta_tilde)
sampleD <- sampleB - sampleA

ds <- tibble(Delta = sampleD)
ds %>% ggplot(aes(x=Delta, y=..density..)) +
geom_histogram(colour="blue",fill="blue",alpha=0.4,bins=100) +
geom_vline(xintercept=c(0.01,-0.01), colour="red") +
geom_density(colour="blue",fill="blue",alpha=0.4) +
theme_bw() +
ggtitle(expression(paste("Simulated posterior pdf of ", Delta)))


indic <- rep(0,R)
indic[abs(sampleD)>0.01] <- 1
diffeffective <- mean(indic)
# Thus our approximate value is Pr(|∆|> 0.01 | yA = 63, yB = 45) ≈ 0.7746.