# Initialize Otter
import otter
grader = otter.Notebook("hw07.ipynb")
Homework 7: Confidence Intervals¶
Helpful Resource:
Python Reference: Cheat sheet of helpful array & table methods used in Data 8!
Recommended Reading:
Please complete this notebook by filling in the cells provided. Before you begin, execute the cell below to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.
For all problems that you must write explanations and sentences for, you must provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!
Deadline:
This assignment is due Wednesday, 4/1 at 11:00am PT. Submissions after this time will be accepted for 24 hours and will incur a 20% penalty. Any submissions later than this 24 hour period will not be accepted unless an extension has been granted as per the policies page. Turn it in by Tuesday, 3/31 at 11:00am PT for 5 extra credit points.
Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.
You should start early so that you have time to get help if you’re stuck. Office hours are held Monday through Friday in Warren Hall 101B or online. The office hours schedule appears here.
The point breakdown for this assignment is given in the table below:
| Category | Points |
|---|---|
| Autograder (Coding questions) | 92 |
| Written (Visualization questions) | 8 |
| Total | 100 |
# Don't change this cell; just run it.
import numpy as np
from datascience import *
# These lines do some fancy plotting magic.",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)Andrew and Marissa are trying to see what the best Thai restaurant in Berkeley is. They survey 1,500 UC Berkeley students selected uniformly at random and ask each student which Thai restaurant is the best. (Note: This data is fabricated for the purposes of this homework.) The choices of Thai restaurants are Lucky House, Imm Thai, Thai Temple, and Thai Basil. After compiling the results, Andrew and Marissa release the following percentages of votes that each restaurant received, from their sample:
| Thai Restaurant | Percentage |
|---|---|
| Lucky House | 8% |
| Imm Thai | 53% |
| Thai Temple | 25% |
| Thai Basil | 14% |
These percentages represent a uniform random sample of the population of UC Berkeley students. We will attempt to estimate the corresponding parameters, or the percentage of the votes that each restaurant will receive from the population (i.e. all UC Berkeley students). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.
The table votes contains the results of Andrew and Marissa’s survey.
# Just run this cell
votes = Table.read_table('votes.csv')
votesQuestion 1.1. Complete the function one_resampled_percentage below. It should return Imm Thai’s percentage of votes after taking the original table (tbl) and performing one bootstrap sample of it. Remember that a percentage is between 0 and 100. (8 Points)
Note 1: tbl will always be in the same format as votes.
Note 2: This function should be completed without .group or .pivot. Using these functions will cause your code to timeout.
Hint: Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? Be sure to use percentages, not proportions, for this question!
def one_resampled_percentage(tbl):
...
one_resampled_percentage(votes)grader.check("q1_1")Question 1.2. Complete the percentages_in_resamples function such that it simulates and returns an array of 2025 elements, where each element represents a bootstrapped estimate of the percentage of voters who will vote for Imm Thai. You should use the one_resampled_percentage function you wrote above. (8 Points)
Note: We perform our simulation with only 2025 trials in this problem to reduce the runtime, but we should generally use more repetitions.
def percentages_in_resamples():
percentage_imm = make_array()
...grader.check("q1_2")In the following cell, we run the function you just defined, percentages_in_resamples, and create a histogram of the calculated statistic for the 2025 bootstrap estimates of the percentage of voters who voted for Imm Thai.
Note: This might take a few seconds to run.
resampled_percentages = percentages_in_resamples()
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")Question 1.3. Using the array resampled_percentages, find the values at the two ends of the middle 95% of the bootstrapped percentage estimates. Assign the lower and upper ends of the interval to imm_lower_bound and imm_upper_bound respectively. (8 Points)
Hint: If you are stuck on this question, try looking over Chapter 13.1 of the textbook, and check out the percentile function on the python reference sheet.
imm_lower_bound = ...
imm_upper_bound = ...
print(f"Bootstrapped 95% confidence interval for the percentage of Imm Thai voters in the population: [{imm_lower_bound:.2f}, {imm_upper_bound:.2f}]")grader.check("q1_3")Question 1.4. The survey results seem to indicate that Imm Thai is beating all the other Thai restaurants among the voters. We would like to use confidence intervals to determine a range of likely values for Imm Thai’s percentage lead over all the other restaurants combined. The calculation for Imm Thai’s lead over Lucky House, Thai Temple, and Thai Basil combined is:
Define the function one_resampled_difference that returns exactly one value of Imm Thai’s percentage lead over Lucky House, Thai Temple, and Thai Basil combined from one bootstrap sample of tbl. (8 Points)
Hint 1: Imm Thai’s lead can be negative.
Hint 2: Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? Be sure to use percentages, not proportions, for this question!
Note: If the skeleton code provided within the function is not helpful for you, feel free to approach the question using your own variables.
def one_resampled_difference(tbl):
bootstrap = ...
imm_percentage = ...
...grader.check("q1_4")Question 1.5. Write a function called leads_in_resamples that computes 2025 bootstrapped elements of Imm Thai’s percentage lead over Lucky House, Thai Temple, and Thai Basil combined (the result of calling one_resampled_difference). It should return an array of the 2025 bootstrapped estimates. Afterwards, run the cell to plot a histogram of the resulting samples. (8 Points)
Hint: If you see an error involving NoneType, consider what components a function needs to have!
def leads_in_resamples():
...
sampled_leads = leads_in_resamples()
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead")Question 1.6. Use the simulated data in sampled_leads from Question 1.5 to compute an approximate 95% confidence interval for Imm Thai’s percentage lead over Lucky House, Thai Temple, and Thai Basil combined. (10 Points)
diff_lower_bound = ...
diff_upper_bound = ...
print("Bootstrapped 95% confidence interval for Imm Thai's true lead over Lucky House, Thai Temple, and Thai Basil combined: [{:f}%, {:f}%]".format(diff_lower_bound, diff_upper_bound))grader.check("q1_6")Tim computed the following 95% confidence interval for the percentage of Imm Thai voters:
(Your answer from 1.3 may have been a bit different due to randomness; that doesn’t mean it was wrong!)
Question 2.1. Tim also created 70%, 90%, and 99% confidence intervals from the same sample, but he forgot to label which confidence interval represented which percentages! First, match each confidence level (70%, 90%, 99%) with one of the corresponding intervals given (e.g. CI_70_percent = 1). (5 points)
The intervals are below:
[50.03, 55.94]
[52.1, 54]
[50.97, 54.99]
Hint: If you are stuck on this question, try looking over Chapters 13.3 and 13.4 of the textbook.
CI_70_percent = ...
CI_90_percent = ...
CI_99_percent = ...grader.check("q2_1")Question 2.2. How did you arrive at your answer to Question 2.1? Select all correct answers that correctly interpet confidence levels and assign them to the array confidence_answers. (e.g. confidence_answers = make_array(1,2,3)) (5 Points)
Higher confidence levels lead to wider intervals because we include more of the statistics.
As confidence increases, we become more confident that the population parameter lies within our confidence interval.
Higher confidence levels lead to smaller intervals because we include less of the statistics.
As confidence increases, we become less confident that the population parameter lies within our confidence interval.
confidence_answers = make_array(...)grader.check("q2_2")Question 2.3. Suppose we produced 6,000 new samples (each one a new/distinct uniform random sample of 1,500 students) from the population and created a 95% confidence interval from each one. Roughly how many of those 6,000 intervals do you expect will actually contain the true percentage of the population? (10 Points)
Assign your answer to true_percentage_intervals.
true_percentage_intervals = ...grader.check("q2_3")Recall the second bootstrap confidence interval you created, which estimated Imm Thai’s lead over Lucky House, Thai Temple, and Thai Basil combined. Among voters in the sample, Imm Thai’s lead was 6%. Tim’s 95% confidence interval for the true lead (in the population of all voters) was:
Suppose we are interested in testing a simple yes-or-no question:
“Is the percentage of votes for Imm Thai equal to the percentage of votes for Lucky House, Thai Temple, and Thai Basil combined?”
Our null hypothesis is that the percentages are equal, or equivalently, that Imm Thai’s lead is exactly 0. Our alternative hypothesis is that Imm Thai’s lead is not equal to 0. In the questions below, don’t compute any confidence interval yourself—use only Tim’s 95% confidence interval.
Hint: Try thinking about the width of the 95% confidence interval in comparison to the new confidence intervals in the questions below. Drawing a picture may help.
Question 2.4. Say we use a 5% p-value cutoff. Do we reject the null, fail to reject the null, or are we unable to tell using Tim’s confidence interval? (10 Points)
Assign cutoff_five_percent to the number corresponding to the correct answer.
Reject the null / Data is consistent with the alternative hypothesis
Fail to reject the null / Data is consistent with the null hypothesis
Unable to tell using Tim’s confidence interval
Hint: Consider the relationship between the p-value cutoff and confidence. If you’re confused, take a look at this chapter of the textbook.
cutoff_five_percent = ...grader.check("q2_4")Question 2.5. What if, instead, we use a p-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using Tim’s confidence interval? (10 Points)
Assign cutoff_one_percent to the number corresponding to the correct answer.
Reject the null / Data is consistent with the alternative hypothesis
Fail to reject the null / Data is consistent with the null hypothesis
Unable to tell using Tim’s confidence interval
cutoff_one_percent = ...grader.check("q2_5")Question 2.6. What if we use a p-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using our confidence interval? (10 Points)
Assign cutoff_ten_percent to the number corresponding to the correct answer.
Reject the null / Data is consistent with the alternative hypothesis
Fail to reject the null / Data is consistent with the null hypothesis
Unable to tell using Tim’s confidence interval
cutoff_ten_percent = ...grader.check("q2_6")3. Midsemester Feedback Form¶
Fill out this form to complete the homework. Please use your Berkeley email to access the form. At the end of the form, there will be a secret word that you should input into the box below. Remember to put the secret word in quotes when inputting it (i.e.“hello”). The quotation marks indicate that it is a String type!
Note: This is the same form as you filled out in lab. If you have completed Lab 07, you should have already filled out the form. If so, please feel free to copy your answer from the Lab!
secret_word = ...grader.check("q3")You’re done with Homework 7!
Important submission steps:
Run the tests and verify that they all pass.
Choose Save Notebook from the File menu, then run the final cell.
Click the link to download the zip file.
Go to Pensive and submit the zip file to the corresponding assignment. The name of this assignment is “HW 07 Autograder”.
It is your responsibility to make sure your work is saved before running the last cell.
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()Submission¶
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting!
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)