CS228

Probabilistic modeling languages for representing complex domains, algorithms for reasoning and decision making using these representations, and learning these representations from data. Focus is on probabilistic graphic models, including Bayesian and Markov networks, extensions to temporal modeling such as hidden Markov models and dynamic Bayesian networks, and extensions to decision making such as influence diagrams. Basic techniques and their applications to domains including speech recognition, biological modeling and discovery, medical diagnosis, message encoding, vision, and robot motion planning. Prerequisites: basic probability theory and algorithm design and analysis.

Structured Probabilistic Models: Principles and TechniquesGates B012cs228-qa@cs.stanford.edu11:00AM12:15PM12625920001268380800winter2010scpdProblem Set 1

See the "Submission Information" and "Honor Code" handouts under "Assignments" for more information

The mean on the assignment was 70 points, with a standard deviation of 19.

1. [6 points]
(a) [1 point] How many networks are equivalent to the simple directed chain X1-->X2-->...-->Xn?
(b) [5 points] How many networks are equivalent to a tree over n nodes?

2. [15 points] Exercise 3.11.

3. [14 points] Exercise 5.14.

4. [20 points]
(a) [10 points] Exercise 6.3.
(b) [10 points] Exercise 6.4.

5. [20 points] Exercise 6.6. Clarification: there are K objects and L sensors, and in each time point, the labelings of the sensed objects are independent and uniformly distributed over 1,...,K.

6. [25 points] To answer this question, read Sections 13.6.1 and 13.6.2.
(a) [10 points] Exercise 13.14.
(b) [15 points] Exercise 13.15. You will need to define a reduced Markov net over the binary T variables corresponding to the labels v1 and v2, and variables that have labels different from v1 and v2 are not part of the reduced Markov net.

Programming Assignment 2

The mean was 95.8, with 11.8 standard deviation.

Interactive session schedule

Here is a list of the interactive session topics. This is tentative and subject to change.

Jan. 7th: Knowledge engineering a Bayesian network (car insurance)

Jan. 14th: Networks for genetic pedigree analysis

Jan. 21st: Topic models for text and vision

Jan. 28th: Inference for extended HMMs, including aligning of sequences, with applications to biology, helicopter flight, and speech

Feb. 4th: MCMC algorithms and getting them to work

Feb. 11th: Models and inference methods for robot localization and mapping

Feb. 18th: The correspondence problem and its applications

Feb. 25th: Learning models with hidden variables

March 4th: Discriminative versus generative models with applications to
language and vision

March 11th: Using decision-theoretic models and value of information

Assignment submission instructions

Problem sets

Problem sets must be handed in to the submission box at the bottom of the Gates A wing (West) stairwell by the beginning of class on the day they are due. You must write the time and date of submission on the assignment. It is an honor code violation to write down the wrong time.

SCPD students only should e-mail the assignment to cs228-qa@cs.stanford.edu as a single PDF file with the subject "SCPD problem set 1" (or 2, or 3). If you write out your assignment by hand, please scan it in if you can and e-mail it to us (again, as a single file, please). If you are unable to do this, please see the SCPD student handbook for submission instructions and fax it in to SCPD. Please do not both e-mail and fax in your assignment!

Programming assignments

• Make sure that all of your necessary files (listed below for each assignment) are in a single directory on the AFS filesystem (for example, in your home directory once you log in to myth.stanford.edu). Start by going to that directory.
• To submit Programming Assignment #1, type '/afs/ir/class/cs228/bin/submit 1'. For subsequent assignments, change "1" to the assignment number. The script that you've called should submit the relevant files (that we've designated) into the course's local directory. You may read /afs/ir/class/cs228/bin/submithelp to see a small help file.
• If the submission is successful, you will see "Submission completed." If you are missing one of the necessary files in your directory, it will tell you which you are missing.
• For each assignment, you will need at least the files README and Students.m. README should be an ASCII text file with the answers to the questions in the assignment. Students.m should contain the information of all students in the group that is making the submission. To make sure that your Students.m is the correct format, you should copy a sample from /afs/ir/class/cs228/ProgrammingAssignments/General/Students.m. (Once the assignment is graded, the scores will be posted on CourseWare. If you see that your score is not posted while the rest of your group's scores are, please let us know.)
• You may submit as many times as you wish; your previous submission will be deleted each time you resubmit, so be careful. The time on your final submission will be the time counted for the purpose of late days. You may want to submit a first version early so that you are familiar with the submission process and don't run into any problems near the deadline.
• The assignments are due at 11:59 pm on the due date. Late days apply to programming assignments, and will be applied to everyone in the team.
Late day policy, grading, and Honor Code

Late day policy

The quizzes are due by Wednesday at midnight of each week with no late days allowed. Problem sets must be handed in to the submission box by the beginning of class on the day they are due. Programming assignments must be submitted electronically by 11:59 PM on the due date.

Recognizing that students may face unusual circumstances and require some flexibility in the course of the quarter, each student will have a total of seven free late (calendar) days to use as s/he sees fit. Once these late days are exhausted, any homework turned in late will be penalized at the rate of 20% per late day (or fraction thereof). Under no circumstances will a homework be accepted more than five days after its due date. Furthermore, no late days are allowed for the final problem set. Late days are in 24 hour increments after the due date/time. Late homeworks should be turned in to the submission box. You must write the time and date of submission on the assignment. It is an honor code violation to write down the wrong time.

Late days also apply to programming assignments, and will be applied to everyone in the team. If you turn in a programming assignment one day late and you have no late days left, you will be penalized 20% while your partner will be deducted one late day.

Quizzes will count for 12% of the final grade, problem sets will count for 15% each, programming assignments will count for 10% each, the programming project will count for 18% (and will have some extra credit associated with it), and the take-home final problem set will count for 20%. Each student's top eight (highest scoring) of nine weeks of quizzes will count in the grade. Some extra credit may be awarded for class participation.

CS228 may be taken pass/no-credit. The requirements are as specified in the university regulations. You take the class, and get a letter grade as per the standard grading curve. Any letter grade higher than a C- is a passing grade, otherwise not. As a side note, we recommend at least attempting most of the work, since much of the learning occurs through solving homework problems.

Honor Code

• Students may discuss and work on problem sets in groups but must write up their own solutions. When writing up the solutions, students should write the names of people with whom they dicussed the assignment. Also, when writing up the solutions students should not use notes from group work. If we start receiving solution sets which are too similar, we may reconsider this policy. Collaboration is not permitted on the quizzes. Programming assignments may be completed in teams of 2 students. Only one student needs to submit the code, but please indicate clearly in the submission README who was on the team.
• Important Note: As we often reuse problem set questions from previous years, we expect the students not to copy, refer to, or look at the solutions in preparing their answers. It will be considered an honor code violation to intentionally refer to previous year's solutions. The purpose of problem sets in this class is to help you think about the material, not just give us the right answers. Moreover, some of the problems are taken from recently published papers. Students are expected to come up with the answers on their own, rather than extracting them from published solutions. Therefore, please restrict attention to the books mentioned on the webpage when solving problems on the problem set. If you do happen to use other material, it must be acknowledged clearly with a citation on the submitted solution.

Here is roughly the reading schedule for the rest of the quarter. Please refer to the annotations of the video lectures for more precise and up-to-date information.

Background&
Chapter 1
Chapter 2
Week 1

Chapter 3
Chapter 4
Week 2

Chapter 5 except 5.4.3, 5.4.4
Chapter 6 up to (including) 6.4.1
Week 3
Chapter 9 except 9.1
Week 4

Chapter 10 except 10.2, 10.4.2
Chapter 11: Section 11.3 except for 11.3.6
Week 5
Chapter 12
Chapter 13: Sections 13.1, 13.2, 13.5
Week 6
Chapter 15 except 15.3.4, 15.4
Chapter 16
Week 7
Chapter 17
Chapter 18 except 18.6
Week 8
Chapter 19: Sections 19.1, 19.2.2
Chapter 20 except for 20.5, 20.7
Week 9
Chapter 21: Sections 21.1, 21.2, 21.3.1, 21.7, 21.8
Chapter 22: Sections 22.1, 22.2
Week 10
Chapter 23: Sections 23.2, 23.7
Chapter 24

Overview of all CS228 assignments

There will be nine weekly quizzes, two programming assignments, two problem sets, a programming project and a final take-home problem set. The problem sets will consist of five to eight problems. Programming assigments and the programming project may be completed in teams of up to 2 students and will be implemented in MATLAB. The programming assignments will typically require around 30-50 lines of coding. We will provide starter code and stubbed out functions that you will be implementing. The actual coding will tend to be relatively straightforward, so even students with little MATLAB experience should be able to complete the assignments. The programming project will include an open-ended component that will allow you to implement a range of strategies and experiments, though we will provide a list of possibilities to help guide you.

Every week the students are expected to watch the required segments of the online lectures and complete the corresponding quizzes. See the "syllabus" for the weekly reading schedule as well. The quizzes are due by Wednesday at midnight of each week, starting on 1/13. No late days are allowed on the quizzes. The quizzes are open book, but each student must take the quiz alone.

Both written assignments and programming assignments will be due two weeks after they go out, with the exception of the project. For instructions on submitting the assignments, see the "Submission instructions" section.

• Problem Set 1: Out 1/12, Due 1/26 at 11:00am
• Programming Assignment 1: Out 1/19, Due 2/2 at 11:59pm
• Problem Set 2: Out 2/2, Due 2/16 at 11:00am
• Programming Assignment 2: Out 2/9, Due 2/23 at 11:59pm
• Programming Project: Out 2/2, Due 3/9 at 11:59pm
• Final Problem Set: Out 3/4, Due 3/18 at 3:30pm
Programming Assignment 3

The mean was 83.3, with a standard deviation of 28.

CS228 course description, course format, prerequisites, and textbook

### Course Description

The course covers modeling (knowledge representation) languages suitable for dealing with an uncertain world. The focus is on probabilistic models, including Bayesian networks, Markov networks, hidden Markov models, and influence diagrams. The course will cover syntax, probabilistic semantics, inference, and learning algorithms for these frameworks. It will also discuss recent applications to domains such as speech recognition, medical diagnosis, data mining, statistical text modeling, and robot motion planning.

### Course Format

This quarter for the first time, CS228 will be offered through an innovative media-based Concept Flow Model. Short video segments will be integrated with online interactions advancing self-paced active learning. The course will be offered through the Computer Science department's new CourseWare learning management system which leverages the best of today's social networking including calendaring, course materials, discussion forum, blog functionality and a course wiki. There will be optional weekly live sessions to supplement the online material.

Starting the second week of the quarter, Tuesdays during class time we will be showing some of the core segments of the video lectures. We will make an effort to allocate time for questions about the material as well. Due to time constraints, we will not be able to show all the required segments, and students are expected to watch the rest of the lectures on their own time. We will announce ahead of time which segments are being shown.

Thursday class time will be devoted to an interactive session with guided small-group discussions about real-world problems and applications of the material covered in the online lectures. These sessions are optional but highly recommended.

Both lecture times will be televised by SCPD, but the interactive format does not lend itself nicely to being videotaped, and thus on-campus students are strongly encouraged to attend in person.

### Prerequisites

• Familiarity with the basic concepts of probability theory. (Stat116 is sufficient but not necessary.)
• Knowledge of basic computer science principles and skills at the level of CS103.
• Mathematical ability and the ability to understand and analyze fairly complicated algorithms and data structures. (CS161 is sufficient but not necessary.)

### Textbook

The primary reading materials will be the textbook by Professor Koller, co-authored with Nir Friedman:

The book is available at the Stanford Bookstore.

Optional books containing relevant material:

• Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, David J. Spiegelhalter. Probabilistic Networks and Expert Systems Springer-Verlag, 1999.
• Judea Pearl. Probabilistic Reasoning in Intelligent Systems Morgan Kaufmann, 1988.
Problem set 3

Deadline: Thursday, 3/18 at 3:30pm (no late submissions will be accepted)

See the "Submission Information" and "Honor Code" handouts under "Assignments" for more information

On the first 6 long questions, the mean was 69 with a standard deviation of 26. On the take-home final section with the short answers (graded out of 40 points) the mean was 26.5 with a standard deviation of 10.

Programming Assignment 1

See the "Submission Information" and "Honor Code" handouts under "Assignments" for more information

The mean was 96.5, with 15.5 standard deviation.

The .tar file has a pdf doc, pa1_0910.pdf, describing the contents of PA1, an assignment folder which has all the files needed for this assignment and that you have to complete, and a mysolution folder which is initialized empty for you to put down your solution files in.

Problem Set 2

See the "Submission Information" and "Honor Code" handouts under "Assignments" for more information.

The mean on the assignment was 78.5, with a standard deviation of 17 points.

Constructing Graph Cuts for MAP (Mandatory)Suppose we are performing MAP inference over 4 binary variables that have edges $X_1 - X_2$, $X_2 - X_3$, $X_3 - X_4$, and $X_4 - X_1$. The following are all nonzero energies that compose our energy function:

 $\epsilon_1[0] = 1$ $\epsilon_2[0] = 8$ $\epsilon_3[1] = 1$ $\epsilon_4[1] = 5$ $\lambda_{1,2} = 2$ $\lambda_{2,3} = 7$ $\lambda_{3,4} = 1$ $\lambda_{1,4} = 4$

The following is a cut of the corresponding graph construction for this problem, where we omit the graph edges:

What is the value of this cut?radio10The graph cut implies an assignment to $X_1, X_2, X_3, X_4$ of $(0,1,1,0)$. This incurs penalties from the terms $\epsilon_1[0] = 1$, $\epsilon_3[1] = 1$, $\lambda_{1,2} = 2$, and $\lambda_{3,4} = 1$, which is a total energy of 5.0203041509012The above network is now modified so that each variable can take on 3 possible values instead of 2. The above energies remain the same, and we have the following additional energy terms over the variables $X_3$ and $X_4$:

 $\epsilon_{3,4}[0,2] = 2$ $\epsilon_{3,4}[1,2] = 3$ $\epsilon_{3,4}[2,2] = 1$ $\epsilon_{3,4}[2,0] = 6$ $\epsilon_{3,4}[2,1] = 4$

We will now perform alpha-expansion to try to solve the MAP problem. We begin with the assignment $(0,0,0,0)$ to $X_1,X_2,X_3,X_4$, and the target label is 2. What is the value of the new energy potential $\epsilon'_{3,4}[t_3^0,t_4^1]$?radio100112030406Will we be able to compute the exact MAP assignment of the new energy (after the alpha-expansion) in the previous question using a graph cut?radio10The energy is submodular (the only terms you need to check are the new terms), sot he graph cut will be exact.1Yes0NoMax Product Loopy Belief Propagation (Mandatory)Suppose that you are in charge of setting up a soccer league for a bunch of kindergartners, and your job is to split the $N$ children into $K$ teams. The parents are very controlling and also uptight about which friends their kids associate with. So some of them bribe you to set up the teams in certain ways.

Each parent's bribe is of one of two forms: either "I will pay you a certain amount of dollars if you put my kid on the same team as $Y$," where $Y$ is some other child, or "I will pay you some amount of dollars if you put my kid on team $k$." In our notation, this translates to either $f_{i,j}(x_i,x_j) = A_{ij}\cdot \mathbf{1}\{x_i=x_j\}$ or $g_i(x_i) = B_i\cdot \mathbf{1}\{x_i=k\}$, respectively, where $x_i$ is the assigned team of child $i$ and $\mathbf{1}\{\}$ is the indicator function.

Being greedy and devoid of morality, you want to make as much money as possible from these bribes. What are you trying to find?radio10Each function tells you how much money you will make if the assignment to $\bar{x}$ satisfies the condition in the indicator function. The total amount of money is the sum of these functions, so you want to find the assignment that maximizes the sum.0$\textrm{argmax}_{\bar{x}} \sum_i g_i(x_i)$1$\textrm{argmax}_{\bar{x}} \sum_i g_i(x_i) + \sum_{i,j} f_{i,j}(x_i,x_j)$0$\textrm{argmax}_{\bar{x}} \prod_i g_i(x_i)$0$\textrm{argmax}_{\bar{x}} \prod_i g_i(x_i) \cdot \prod_{i,j} f_{i,j}(x_i,x_j)$You want to find the optimal solution to the above problem using a clique tree over a set of factors $\phi$. How could you accomplish the such that you are guaranteed to find the optimal solution? (Ignore issues of tractability, and assume that if you specify a set of factors $\phi$, you will be given a valid clique tree of minimum tree width.)radio10We want to compute $\textrm{argmax}_{\bar{x}} \sum_i g_i(x_i) + \sum_{i,j} f_{i,j}(x_i,x_j) = log \left[ \prod_i exp(g_i(x_i)) + \prod_{i,j} f_{i,j}(x_i,x_j) \right]$. Since maximizing $log(z)$ over $z$ is the same as maximizing $z$ over $z$, we can simply compute $\textrm{argmax}_{\bar{x}} \prod_i exp(g_i(x_i)) + \prod_{i,j} f_{i,j}(x_i,x_j)$, which is the definition max product inference. So setting the potentials appropriately and running clique tree inference (which is exact) is guaranteed to get the optimal solution.0Set $\phi_{i,j} = f_{i,j}$, $\phi_i = g_i$, get the clique tree, run sum product message passing, and decode the marginals0Set $\phi_{i,j} = f_{i,j}$, $\phi_i = g_i$, get the clique tree, run max product message passing, and decode the marginals0Set $\phi_{i,j} = exp(f_{i,j})$, $\phi_i = exp(g_i)$, get the clique tree, run sum product message passing, and decode the marginals1Set $\phi_{i,j} = exp(f_{i,j})$, $\phi_i = exp(g_i)$, get the clique tree, run max product message passing, and decode the marginals0The optimal solution is not guaranteed to be found in this manner using clique treesNow suppose you want to find an approximate solution to the above problem using a cluster graph instead (specifically, it will not necessarily satisfy all the requirements of a clique tree). How could you do this such that you are guaranteed to find an approximate solution? (Ignore issues of tractability, and assume that if you specify a set of factors $\phi$, you will be given a valid cluster graph and that message passing will converge.)radio10Even if the message passing converges, we cannot guarantee that the marginals can be decoded to find a consistent solution (for example, if there is a frustrated cycle) -- there may be no locally optimal joint assignment and no solution to the constraint satisfaction problem. See Chapter 13.4.1.2 in the book.0Set $\phi_{i,j} = f_{i,j}$, $\phi_i = g_i$, get the cluster graph, run sum product message passing, and decode the marginals0Set $\phi_{i,j} = f_{i,j}$, $\phi_i = g_i$, get the cluster graph, run max product message passing, and decode the marginals0Set $\phi_{i,j} = exp(f_{i,j})$, $\phi_i = exp(g_i)$, get the cluster graph, run sum product message passing, and decode the marginals0Set $\phi_{i,j} = exp(f_{i,j})$, $\phi_i = exp(g_i)$, get the cluster graph, run max product message passing, and decode the marginals1Even an approximate solution is not guaranteed to be found in this manner using cluster graphsLearning in Parametric Models (Mandatory)Suppose that you are playing Dungeons & Dragons, and you suspect that the 4-sided die that your Dungeon Master is using is biased. In the past 60 times that you have attacked with your dagger and the 4-sided die was rolled to calculate how many hit points of damage you inflicted, 20 times it has come up 1, 15 times it has come up 2, 15 times it has come up 3, and 10 times it has come up 4.

Let $\theta_1$ be the true probability of the die landing on $1$, and similarly for $\theta_2$, $\theta_3$, and $\theta_4$. You want to estimate these parameters from the past 60 rolls that you observed using a simple multinomial model. What are the sufficient statistics of those rolls?radio10The sufficient statistics for a multinomial model are the "counts" of each possible result.1The number of times each digit was rolled0The number of times each digit was rolled times the digit itself0The number of times each digit was rolled divided by the digit itselfIn the above scenario, what is the unique Maximum Likelihood Estimate (MLE) of the parameters $\theta$?radio10The MLE solution is $\theta_i = \frac{M[x^i]}{M}$, where $M$ are the sufficient statistics, with $M^1 = 20, M^2 = 15, M^3 = 15, M^4 = 10$, and $M = \sum_i M^i = 60$.0$\theta_1 = 20, \theta_2 = 15, \theta_3 = 15, \theta_4 = 10$1$\theta_1 = \frac{1}{3}, \theta_2 = \frac{1}{4}, \theta_3 = \frac{1}{4}, \theta_4 = \frac{1}{6}$0$\theta_1 = \frac{1}{6}, \theta_2 = \frac{1}{4}, \theta_3 = \frac{1}{4}, \theta_4 = \frac{1}{3}$0There is no unique solutionFinding the MAP Assignment from Max Product (Video)What are the values of the factor $\tau_1$?checkbox0$\tau_1(a) = \textrm{max}_b \psi(a,b) = \textrm{max}_b P(b\mid a)$.00:05:340$\tau_1(a^0) = .1$0$\tau_1(a^0) = .65$1$\tau_1(a^0) = .9$0$\tau_1(a^0) = 1$0$\tau_1(a^1) = .45$1$\tau_1(a^1) = .55$0$\tau_1(a^1) = 1$0$\tau_1(a^1) = 1.35$Decomposed Likelihood Function for a BN (Mandatory)Suppose we want to estimate the parameters of the Bayesian network $X\rightarrow Y$ with table CPDs, where both variables are binary.

We have the following set of fully observed data instances. The first column is the instance and the second column is how many times that instance appears in our dataset.

 Instance # $\langle x^0,y^0 \rangle$ 3 $\langle x^0,y^1 \rangle$ 1 $\langle x^1,y^0 \rangle$ 2 $\langle x^1,y^1 \rangle$ 4

Which of the following is equal or equivalent to the likelihood function for the given network over this dataset? (There may be more than one correct answer.)checkbox10The two correct answers are equivalent because $\Large{\theta_{x^1} = 1-\theta_{x^0}}$. For the general form, see Chapter 17.2 and in particular Equation 17.4 (p. 726).0$\Large{\theta_{x^0}^4 \cdot \theta_{y^0\mid x^0}^3 \cdot \theta_{y^0\mid x^1}^2}$0$\Large{\theta_{y^0\mid x^0}^3 \cdot \theta_{y^1\mid x^0}^1 \cdot \theta_{y^0\mid x^1}^2 \cdot \theta_{y^1\mid x^1}^4}$1$\Large{\theta_{x^0}^4 \cdot \theta_{x^1}^6 \cdot \theta_{y^0\mid x^0}^3 \cdot \theta_{y^1\mid x^0}^1 \cdot \theta_{y^0\mid x^1}^2 \cdot \theta_{y^1\mid x^1}^4}$1$\Large{\theta_{x^0}^4 \cdot (1-\theta_{x^0})^6 \cdot \theta_{y^0\mid x^0}^3 \cdot (1-\theta_{y^0\mid x^0})^1 \cdot \theta_{y^0\mid x^1}^2 \cdot (1-\theta_{y^0\mid x^1})^4}$0$\Large{\theta_{x^0}^{4/10} \cdot \theta_{y^0\mid x^0}^{3/4} \cdot \theta_{y^0\mid x^1}^{2/6}}$0$\Large{\theta_{y^0\mid x^0}^{3/4} \cdot \theta_{y^1\mid x^0}^{1/4} \cdot \theta_{y^0\mid x^1}^{2/6} \cdot \theta_{y^1\mid x^1}^{4/6}}$0$\Large{\theta_{x^0}^{4/10} \cdot \theta_{x^1}^{6/10} \cdot \theta_{y^0\mid x^0}^{3/4} \cdot \theta_{y^1\mid x^0}^{1/4} \cdot \theta_{y^0\mid x^1}^{2/6} \cdot \theta_{y^1\mid x^1}^{4/6}}$Which of the following are maximum likelihood estimates for the parameters $\bar{\theta}$ of the given Bayesian network ($X\rightarrow Y$)? (We aren't including all values of $\bar{\theta}$.)checkbox10$\Large{\theta_{x^0,y^0}}$ isn't a network parameter. See Equation 17.5 in the book (p. 726) for the general formula for the MLE estimates of $\bar{\theta}$.0$\Large{\theta_{x^1} = \frac{4}{10}}$1$\Large{\theta_{x^1} = \frac{6}{10}}$0$\Large{\theta_{x^0,y^0} = \frac{3}{4}}$0$\Large{\theta_{x^0,y^0} = \frac{3}{10}}$0$\Large{\theta_{y^0\mid x^1} = \frac{2}{4}}$1$\Large{\theta_{y^0\mid x^1} = \frac{2}{6}}$0$\Large{\theta_{y^0\mid x^1} = \frac{2}{10}}$Max Product Variable Elimination (Video)Which strategy have we used in the past for inference that can be used analogously on this equation next?radio0$P(a)$ can be pushed through $max_b$.00:01:220Exploiting independence to break up terms0Removing a term (because it sums or maxes to 1 or a constant)1Pushing terms through summations or maxesIf we eliminate all variables in this way, what do we get at the end?radio000:07:560Marginals over one variable0Marginals over all variables0A number (the value of the partition function)1A number (the unnormalized probability of the MAP assignment)0The MAP assignment for one variable0The MAP assignment for all variablesWhat is $\tau_1(A)$? It is the unnormalized probability of...radio0As will be explained now, $\tau_1(A)$ represents a conditional max, which is the maximum value of $B$ conditioned on the value of $A$.00:10:390$a^*$, the MAP solution for $B$0$a^*$ and $b^*$, the MAP solution for both variables0$A$, for each possible value of $A$0$A$ and $b^*$, for each possible value of $A$, where $b^*$ is the MAP solution for $B$1$A$ and $b^*_a$, for each possible value of $A$, where $b^*_a$ is the most likely solution for $B$ given the assignment $A=a$MAP Inference (Video)The MAP problem is a max of a product of factors. Which problem that we've seen does this seem similar to in terms of form and type of computation?radio0If we replace the max with a sum, it's the same problem. Taking the max over a variable or set of variables is similar to taking a sum in terms of the the type of computation done. For this reason, many of the methods we've used for sum product inference (such as variable elimination) can be slightly modified and used for MAP inference.00:04:301Sum product inference0Computing the partition function0Particle filteringParticle Filtering (Video)In regular likelihood weighting, a sample is an assignment to all the variables in the network. If we apply likelihood weighting to a DBN, what do you think a single sample should be?radio0If we apply straightforward likelihood weighting to the unrolled DBN, then a single sample is the full trajectory of all variables in all time slices. This is the approach we will consider first.00:01:040An assignment to all variables in a given time slice0An assignment to a single variable throughout all time slices1An assignment to all variables throughout all time slicesI thought in likelihood weighting a single sample was an assignment through all time slices, but here it's just the assignment up to the current time slice $t$. How is this still likelihood weighting?radio0Since we don't want to or can't wait until the final time slice (since there may not be one), we are continuously running the likelihood weighting algorithm as we go. We keep track of many samples all at once instead of getting each full sample one at a time.00:02:461At this point we are still in the middle of the likelihood weighting algorithm and are continuing to sample as the time slices progress0This is not equivalent to regular likelihood weightingWhich general problem of likelihood weighting are we suffering from in this case?radio0When the evidence is at or near the leaves, when we sample most or all of the variables we are ignoring the evidence, so our samples can be very unlikely given the evidence and the weights will therefore be small. This means that we'll need very many samples to get any reasonable estimate of our target distribution. In this case the evidence is all at the leaves.00:10:210The distribution is hard to accurately sample from1The weights are small when the evidence is all at the leaves0Computing the weights can be inaccurateWhy shouldn't we weight these samples?radio000:15:571Because we already took the weight into account when we "resampled" them, so we would be double-counting0Because this is now regular forward sampling without weights0Because the weight would involve the entire network and be impossible to computeEntanglement (Mandatory)What makes inference in DBNs difficult?checkbox10(a) is not true. (c) is incorrect because we can apply clique tree inference to a DBN, it just might be slow and undesirable because of (b) and (d).0(a) As $t$ grows large, we generally lose all independencies in the ground network1(b) As $t$ grows large, we generally lose independencies of the form $(X^{(t)} \perp Y^{(t)} \mid Z^{(t)})$0(c) Standard clique tree inference cannot be applied to a DBN1(d) In many networks, maintaining an exact belief state over the variables requires a full joint distribution over all variables in each time sliceWhich independencies hold in the following 2-TBN for all $t$?

(Note, it may be helpful to draw the unfolded DBN for several slices)checkbox10$Weather^t$ is blocked by $Weather^{t-1}$ for all $t$. We know immediately that none of the others can be correct from entanglement.0$(Weather^t \perp Velocity^t \mid Obs^{1...t})$0$(Failure^t \perp Velocity^t \mid Obs^{1...t})$1$(Weather^t \perp Velocity^t \mid Weather^{(t-1)}, Obs^{1...t})$0$(Weather^t \perp Location^t \mid Velocity^t, Obs^{1...t})$0None of theseParticle Filtering (Mandatory)What is the key difference between particle filtering and likelihood weighting for DBNs?radio10Likelihood weighting keeps the same set of particles, while particle filtering resamples at each step, so the set of particles stays relevant and helpful.1In particle filtering, the particles are resampled at each time step.0In particle filtering, the particles are reweighed at each time step.0The particle filter performs exact inference at each time step.0Particle filtering starts with more particles.In general, if we initialize a bootstrap particle filter with a set of particles, where all particles are the same, i.e. $x(0)[m] = x$, for all $m=1,...,M$, then as the algorithm proceeds...radio10Each particle progresses forward independently through the transition model, so equal particles can become different in the next time step. The belief state will change based on how well the particles match the observations.0the belief state and all particles will remain unchanged: $x(t)[m] = x$0the belief state will change, but all particles will be equal, $x(t)[m] = x(t)[n]$1the belief state will change and the particles will become different, $x(t)[m] != x(t)[n]$We are using bootstrap particle filtering in the following DBN, where $A, B, C, O_1$ and $O_2$ are all binary random variables, and have CPDs as provided. At time $t$ we have 5 samples, with corresponding weights, which are shown in the table below.

We now start sampling for time $t+1$, and we randomly choose the first sample ($\langle a^0, b^1, c^1 \rangle$) for propagation. What is the probability of $\langle a^0,b^1,c^0 \rangle$ being the propagated sample for time $t+1$?radio10The probability is $P(A',B',C' \mid A,B,C) = P(A' = a^0\mid A = a^0,B = b^1)P(B' = b^1 \mid B = b^1, C = c^1)P(C' = c^0 \mid C = c^1) = .3 \times .9 \times .4 = .108$.0.0360.0811.1080.40We can't compute the probability without knowing the values for $O_1$ and $O_2$.Suppose that $\langle a^0, b^1, c^0 \rangle$ was in fact generated for time $t+1$, and that we observe $O_1^{(t+1)}=o_1^1$ and $O_2^{(t+1)}=o_2^0$. What is the (unnormalized) weight of the sample $\langle a^0, b^1, c^0 \rangle$?radio10This is simply the probability of the observations given the state of the sample, so 0.3 * 0.95 = .285.0.13240.23751.2850.350.5630.750.95Gibbs Sampling (Video)Suppose we have a network $A\rightarrow B\rightarrow C$. If our initial sample (at time $t=0$) is $\langle A^0=0, B^0=0, C^0=0 \rangle$, what do we sample $A^1$ (the value of $A$ at time $t=1$) from?radio0We sample each variable from its conditional distribution given all other variables.00:07:200$P(A)$0$P(B=0|A)$1$P(A|B=0,C=0)$Again suppose we have a network $A\rightarrow B\rightarrow C$ and that our initial sample is $\langle A^0=0, B^0=0, C^0=0 \rangle$. Now suppose we sample $A^1=1$. What do we sample $B^1$ (the value of $B$ at time $t=1$) from?radio0We keep the new assignment to $A$, and again sample $B$ from its conditional distribution given all other variables.00:07:330$P(B)$0$P(B|A=0)$0$P(B|A=1)$0$P(B|A=0,C=0)$1$P(B|A=1,C=0)$How do we compute this value efficientlycheckbox0The chain rule allows us to compute the numerator by simply multiplying all factors together (operations are linear in the number of factors). We can get the denominator by simply summing out $D$ from the numerator (which is linear in the number of values of $D$). Therefore it's always tractable.00:12:161Multiply all factors together to get the numerator0Perform clique tree inference to get the numerator1Sum the numerator over $D$ to get the denominator0Perform clique tree inference to get the denominator0It can't always be computed efficientlyHow can we simplify this expression further to reduce computation?radio0Only the terms that don't involve $D$ can be summed out. Those that involve $D$ are for a specific $D$ in the numerator, and over all different values of $D$ in the denominator.00:15:331Cancel out all terms0Cancel out all terms that don't involve $D$0The denominator sums to 1What do we sample $X_I^{(1)}$ from?radio000:19:421$P_{\Phi_e}(I|d^0,g^1)$0$P_{\Phi_e}(I|d^1,g^1)$0$P_{\Phi_e}(I|d^0,g^1,L,S)$0$P_{\Phi_e}(I|d^1,g^1,L,S)$Answering Queries with MCMC Samples (Video)What's a way we could estimate $P(x_i)$ from the run of a chain?radio000:06:450Look at $P(x_i|X_{-i})$, which is what we sample $X_i$ from in the current iteration1Count how frequently $x_i$ occurred over a recent window of samples of the chain0Run clique tree inference to get $P(X_i)$What's the problem with this approach?radio000:13:200There will be too many samples to collect1The samples will be highly correlated and won't be independently distributed0The chain may "unconverge"Inference in Temporal Models (Video)The smoothing task conditions on all the observations while the tracking task only conditions on the observations up to time $t$. Since smoothing uses more information, why would we ever perform the tracking task?radio000:06:510The future evidence can be misleading1In live settings, we want the current belief state and we don't know what the future observations will be0Smoothing can be intractable, while tracking is notProperties of the Gibbs Chain (Video)What does it mean for a chain to be regular?radio0A state space can be connected in the directed sense but the chain can still not be regular. For example, if state $x_1$ has no incoming transitions, the graph may be connected, but $x_1$ cannot be reached from any other state.00:01:390That there exists a $k$ such that there are no periods of size $k$ or greater1That there exists a $k$ such that every state can be reached from every other state in $k$ steps with some probability0That the state space is connectedIf we make this modification with $\epsilon$ extremely small (say, $1\times 10^{-9}$), then what is now true of the Gibbs chain?radio0The reason these answers are wrong is...00:07:050It still doesn't technically satisfy regularity1It technically satisfies regularity, but will take an extremely large amount of samples to mix0It technically satisfies regularity, and will now mix with a reasonable number of samplesCollapsed Particles (Video)We call $g(\bar{x_d})$ an expectation. What does it correspond to?radio000:06:040${\Large E_{P_\phi(\bar{x}_d)}[f(\bar{x})]}$1${\Large E_{P_\phi(\bar{x}_d\mid \bar{x}_p)}[f(\bar{x})]}$0${\Large E_{P_\phi(\bar{x}_p)}[f(\bar{x})]}$0${\Large E_{P_\phi(\bar{x}_p\mid \bar{x}_d)}[f(\bar{x})]}$Given $\bar{g}$, what independencies hold in this graph that we can take advantage of?radio0The fact that the $I$ variables are all independent given all of the $D$ variables leads to the natural partition of variables. Given the $D$ variables, it will be easy to perform exact inference on the $I$ variables (and the opposite is true as well).00:09:450$(I_i \perp I_j \mid \bar{g})\ \forall i,j$0$(I_i \perp I_j \mid D_i, D_j, \bar{g})\ \forall i,j$1$(I_i \perp I_j \mid \bar{D}, \bar{g})\ \forall i,j$0$(I_i \perp I_j \mid \bar{I}_{-ij}, \bar{g})\ \forall i,j$, where $\bar{I}_{-ij}$ is all variables $I$ except for $I_i$ and $I_j$We've decided to make the variables $\bar{D}$ the "particle" part of the distribution and $\bar{I}$ the "distributional" part. This means that we will keep track of samples over $\bar{D}$ but not $\bar{I}$. Now if we want to perform Gibbs sampling over $P_\Phi(\bar{d})$, how do we sample the $D$'s? (Consider what the Gibbs sampling algorithm specifies, not how we can tractable perform the sampling, which we will discuss.)radio0We don't have specific values for $\bar{I}$, so it isn't possible to sample from $P(d_k\mid \bar{d}_{-k},\bar{I})$. Instead, we will only consider $P(\bar{d})$, from which we will compute and sample from $P(\bar{d}\mid \bar{d}_{-k})$. We will now discuss how to tractably find this distribution.00:12:000Rejection sampling0Sample each $D_i$ from the prior $P_\Phi(D_i)$ and weight it according to $\bar{G}$ and $\bar{I}$0Initialize $\bar{D}$ and then sample each $D_i$ from $P(d_k\mid \bar{d}_{-k},\bar{I})$1Initialize $\bar{D}$ and then sample each $D_i$ from $P(d_k\mid \bar{d}_{-k})$How do we "compute marginal probabilities" concretely?radio0This can be done tractable because each $I_i$ appears separately in its own factor, so this is just $N$ separate summations.00:20:500Sample $\bar{I}$ and then compute the resulting product of reduced factors1Sum out each $I_i$ from the reduced factors0Sum out each $D_i$ from the reduced factorsWhat allows us to compute $P_\Phi(I_i\mid \bar{d}^{(l)})$?radio0The motivation for splitting up variables as we did was that $(I_i \perp I_j \mid \bar{d},\bar{g})$. So given the sample $\bar{d}^{(l)}$, we simply multiply the $M$ reduced factors that involve $I_i$, and then normalize them by summing out $I_i$, which is tractable.00:25:521Each $I_i$ is independent of all other $\bar{I}$ given $\bar{d}$, so this is just a (normalized) product of a few factors0We can run clique tree inference0We may have to run (approximate) cluster graph inferenceUp-Down Clique Tree Message Passing (Video)Which cluster(s) can we start with according to the specification of the algorithm?checkbox0We start with any one of the leaf clusters.00:09:201$(T,V)$0$(T,L,A)$0$(B,L,A)$1$(B,L,S)$1$(X,A)$0$(D,B,A)$After multiplying these factors together, in order to complete the message from $C_i$ to $C_j$, which variables do we sum out?radio0The message has the scope of the sepset $S_{ij}$, but after multiplying these factors together their scope is $C_i$, so we need to sum out $C_i - S_{ij}$ so that the resulting scope is $S_{ij}$.00:15:320$C_i$0$C_j$0$S_{ij}$1$C_i - S_{ij}$0$C_j - S_{ij}$If this theorem is true, which of the following steps would we take to solve the inference task of computing $P_{\cal{B}}(X)$ in some Bayes Net $\cal{B}$?checkbox0The final belief of the root will be $P_{\cal{B}}(C_r)$. Since $X \in C_r$, just sum out $C_r - \{X\}$ from this belief to get $P_{\cal{B}}(X)$00:19:041Make a cluster tree over the network of $\cal{B}$1Assign the factors of $\cal{B}$ to the cluster graph1Set some cluster with $X$ in it as the root0Make sure all clusters with $X$ in them are leaves1Do a single pass of messages to the root1Sum out everything but $X$ from the final belief of the root0Sum out $X$ from the final belief of the rootTracking in Temporal Models (Video)Which of the terms in this formula can we immediately and tractably compute?checkbox0The summation is over one variable, so it's tractable. $P(X^{t+1}\mid X^t)$ is the state transition model, which is given to us in the DBN, so we can just look it up. $P(X^t\mid O^{1\ldots t})$ is the belief state from the previous time step, which we assume is given to us.00:04:271${\Large \sum_{X^t}}$1${\Large P(X^{t+1}\mid X^t)}$1${\Large P(X^t\mid O^{1\ldots t})}$0None of theseGiven the graphical structure, how can we simplify the term $P(o^{t+1}\mid X^{t+1},o^{1\ldots t})$ so that it's in closed form?radio0The observation is independent of everything else given the state in that time step. $P(o^{t+1}\mid X^{t+1})$ is closed form because we have access to this term directly from the DBN -- it's the observation model.00:07:191$P(o^{t+1}\mid X^{t+1})$0$P(o^{t+1}\mid o^{1\ldots t})$0$P(o^{t+1})$0We cannot rewrite it in closed formEntanglement in Temporal Models (Video)Are there any independencies that hold within a given time slice? That is, are the two highlighted nodes independent given any set of variables inside their time slice?radio0Given only evidence within the time slice, there is an active path between the two highlighted nodes that traces back to previous time slices. This means that if we consider just the belief state of a single time slice in isolation, we can't use independencies to factor it in any way.00:03:130Yes1NoWe say that there are no independencies that hold within a given time slice. What about independencies that hold in the network as a whole between time slices? Why can't we take advantage of those?radio0The question at hand is: can we factor the (exponentially large) belief state over each time slice so that we can tractably represent it? Only independencies within a time slice can help with that. Using beliefs in the first place does in fact use the Markov assumption, which is an independence between time slices. But exact factorization of the belief state itself is impossible in general.00:04:520These independencies disappear in a DBN1We are interested in factoring the belief state, which is only over a single time sliceMax Product Message Passing in Clique Tree (Video)In sum-product message passing, the final belief is $\beta_i(\bar{c_i}) = \sum_{\bar{w}} \tilde{P}_\Phi(\bar{c_i},\bar{w})$, where $\bar{w} = {\cal X} - \bar{c_i}$. What do you think the final $\beta_i(\bar{c_i})$ is for the case of max-product message passing?radio0The final belief for value $\bar{c_i}$ is the unnormalized probability of the most likely full assignment to $\cal{X}$ that is consistent with the value $\bar{c_i}$.00:02:321$\textrm{max}_{\bar{w}} \tilde{P}_\Phi(\bar{c_i},\bar{w})$0$\textrm{max}_{\bar{c_i}} \tilde{P}_\Phi(\bar{c_i},\bar{w})$0$\textrm{max}_{\bar{x_i}} \tilde{P}_\Phi(\bar{c_i},\bar{w})$0$\textrm{max}_{\bar{c_i}} \tilde{P}_\Phi(\bar{c_i})$In sum-product message passing, the clique tree was calibrated when neighboring clusters agreed on their marginals. What do you think the condition is for a clique tree to be max-calibrated?radio0That is, neighboring clusters agree on their max-marginals, meaning they agree on the values of the conditional max assignment.00:06:180$\sum_{C_i-S_{ij}} \beta_i = \sum_{C_j-S_{ij}} \beta_j$1$\textrm{max}_{C_i-S_{ij}} \beta_i = \textrm{max}_{C_j-S_{ij}} \beta_j$0$\textrm{max}_{C_i} \beta_i = \textrm{max}_{C_j} \beta_j$If we take one of the partial assignments that maximizes each cluster separately, will we necessarily get a consistent assignment that is one of the MAP solutions?radio0For example, $(a^1,b^1)$ is one of the max assignments to the first cluster and $(b^0,c^0)$ is one of the max assignments to the second, but these partial assignments conflict on the value of $B$.00:14:230Yes1NoFor regular (non-MAP) inference, the reason that using clique trees is better than variable elimination is that the former can help answer many queries at once, while the latter only answers a single query. But this isn't the case for MAP, because variable elimination gives us the entire answer we seek.

So why would we use clique trees to solve MAP instead of variable elimination?radio0If there is a unique MAP solution (we usually don't know in advance if there is), then we can find the MAP solution from clique tree inference by looking at each max-marginal separately. This is faster than doing traceback, which is always necessary for variable elimination.

So when there's a unique solution (we will see later how to detect this situation), clique trees are faster, but otherwise they are the same efficiency as variable elimination.00:19:291Because we can sometimes get the MAP solution more efficiently0Because we can always get the answer more efficiently0Because they can help answer other MAP queriesMax Product Loopy Belief Propagation (Video)Roughly speaking, what are the elements that help prove that unambiguous cliques imply a unique and globally consistent joint assignment?checkbox0The proof goes roughly like this: Consider a variable $X_i$ in cluster $C_j$. The unique maximum value of the max-marginal implies an assignment $x_i^*$ for $X_i$. Because of the cluster graph RIP, all occurrences of $X_i$ in the cluster graph are connected via a tree through the sepsets. Because neighboring max-marginals agree on their sepsets, the maximum value and corresponding assignment $x_i^*$ will be consistent in this tree. Therefore, if we take the maximum assignment in each cluster, we will have a joint assignment that has a consistent value for $X_i$. The same is true for all variables, so the joint assignment will be consistent over all variables.00:07:051Each max-marginal has a unique maximum1Neighboring marginals agree on their sepsets0Sepsets are the intersection of the scopes of neighboring clusters1The Running Intersection Property for cluster graphs0The max-marginals are exactThis shows that often the solution we get from max-product loopy BP is "good enough" in the sense that a large class of local moves doesn't give a better solution. What needs to be the case for us to guarantee that we can find such a locally optimal solution?checkbox0If these conditions hold, then we have a locally optimal solution, but if message-passing doesn't converge or if the max-marginals are ambiguous, then we have to resort to other methods (either within the loopy BP framework or using other methods).00:18:561Message-passing must converge0There must be a unique MAP solution1The converged max-marginals must be unambiguous0The cluster graph must be a treeBayesian Modeling with the Beta Prior (Video)Without trying to understand all the terms in this distribution, what is one reason that it might it be a good one to use for $P(\theta)$ given that we established that the likelihood is proportional to $P(\theta) \cdot \theta^{M[x^1]} \cdot (1-\theta)^{M[x^0]}$?radio0The resulting likelihood proportional to something of the form $\theta^A \cdot \theta^B$. We will soon see the simple closed form solution for the Beta distribution, and since the likelihood (posterior) will have the same form, this means that we can use the same form of solution for the likelihood.00:05:541The terms have relating to $\theta$ in the likelihood all have the same form, and this will allow us to easily find a closed form solution.0It allows us to cancel out some of the data terms.0It makes the posterior equal to the prior.Which distribution, which we have seen a closed form solution for, is this equation proportional to?radio0It would be proportional to the Bernoulli distribution if the -1 term weren't there (meaning that it's also proportional to the Bernoulli distribution with the counts incremented by 1). The reason we use the Beta distribution is that the normalizing constant happens to work out so that it's actually exactly equal to $Beta(\alpha_1+M[1],\alpha_0+M[0])$.00:14:090$Beta(M[1],M[0])$1$Beta(\alpha_1+M[1],\alpha_0+M[0])$0$Bernoulli(M[1],M[0])$0$Bernoulli(\alpha_1+M[1],\alpha_0+M[0])$The predictive posterior is $\large{\frac{\alpha_1+M[1]}{\alpha_1+M[1]+\alpha_0+M[0]}}$. What does this mean for the three cases listed?checkbox000:17:561We can set an even prior that's weaker (lower $\alpha$) on the thumbtack than the dime so the first case remains near $0.3$ but the second is closer to $0.5$0We can set an even prior that's stronger (higher $\alpha$) on the thumbtack than the dime so the first case remains near $0.3$ but the second is closer to $0.5$1We can set a strong, even prior on the dime so the estimate is near $0.5$ in the second case, but near $0.3$ in the third0We can set a weak, asymmetric prior on both the thumbtack and dime so the first case remains near $0.3$ but the second is closer to $0.5$0We can set a strong, asymmetric prior on both the thumbtack and dime so the first case remains near $0.3$ but the second is closer to $0.5$It was mentioned earlier that the hyperparameters are like "imaginary counts." Given this equation, what is one way to interpret the various components (choose multiple answers)?checkbox000:17:391The hyperparameters $\alpha$ are counts we've "seen" in the past0The hyperparameters $\alpha$ are counts we haven't yet seen1The total $\alpha_1 + M[1]$ is the total number of heads we have "seen" so far0$\alpha_0 + \alpha_1$ is the size of the current dataset1The predictive posterior $(P(X[M+1]\mid D))$ is the MLE of everything we've "seen" so farBayesian Priors for BNs (Video)What does writing the likelihood like this allow us to do?radio000:03:151Break it up by each parameter separately, and solve for each one easily0Cancel out one or more terms0Sum in the data to get a closed form equationIf we are using table CPDs, which independence property can help us further split up the likelihood function?radio0Note that this independence doesn't arise directly from the graph. It is because the CPD of Y is a multiplexer, so it depends on either $\theta_{Y\mid x^0}$ or $\theta_{Y\mid x^1}$ for a given value of $X$, but not both.00:06:130$(\theta_{Y\mid x^0}, \theta_{Y\mid x^1} \perp X)$0$(\theta_{Y\mid x^0} \perp \theta_{Y\mid x^1} \mid X)$1$(\theta_{Y\mid x^0} \perp \theta_{Y\mid x^1} \mid X,Y)$The Dirichlet Prior (Video)The Beta distribution is a special case of the Dirichlet distribution in the sense that...checkbox0The binomial distribution is a special case of the multinomial distribution, where again $k=2$.00:02:551$k=2$0$\alpha_i$ in the Beta distribution is the same as $\alpha_i-1$ in the Dirichlet distribution1Dirichlet is an appropriate prior for parameters $\theta$ of a multinomial distribution, while Beta is appropriate for a binomial, which is a special case of the multinomial distributionLearning with Missing Data (Video)In the case of fully observed data (which has been the setting for the entire course so far), how does this model apply?radio0When the $O$'s are all equal to 1 (observed), then all of the $X$'s are observed, which means that the $Y$'s are deterministically equal to the $X$'s and there are no missing values.00:07:591All of the $O$'s are 10The $O$'s are equal to the $X$'s0$Y_i$ is equal to $X_i$ plus some noiseWhich of the following is/are true of MAR?checkbox000:17:241If $P_{missing}$ is MAR, then it is MCAR0Whether the data is missing can directly depend on the missing value and still be MAR1The setting mentioned where a doctor doesn't run certain tests based on the observed symptoms and diagnosis is MAR0If each $O_x$ is the child of the corresponding $X$, then this setting is MARGradient Ascent for MN Learning (Mandatory)What makes learning in Markov Networks more difficult than learning in Bayes Nets?checkbox10The partition function prevents the likelihood function from decomposing so that the parameters cannot be learned independently. Furthermore, when learning the parameters, inference must be performed at each stage.1The presence of a global partition function couples parameters0Markov Networks allow cycles, so structure search is harder1We generally use iterative methods, each of whose steps can require inference in the networkLikelihood with Missing Data (Video)We saw earlier that each likelihood function has a unique local maximum and is easily optimized. Is this true for the sum of two likelihoods?radio0The new likelihood is now much harder to maximize.00:06:150Yes1NoWhat does having multiple global optima necessarily mean?radio0The likelihood function doesn't distinguish between the true parameters and other global optima, so there is no way to know if a given set of parameters is the true one using the likelihood function.00:09:000Finding any global optimum is difficult1Even if we find a global optimum easily, it may not be the true underlying parameters0If we find a global optimum, it is necessarily the true underlying parameters0None of theseWhat happens if we don't observe $X$?radio0There is now an active path between $\theta_{Y\mid X}$ and $\theta_X$ through the v-structure with $Y$ as the observed child.00:12:010New independencies between the parameters arise1$\theta_{Y\mid X}$ is no longer independent of $\theta_X$0$\theta_{Y\mid X}$ and $\theta_X$ are independent but correlatedExpectation Maximization Algorithm (Video)In what sense is $\theta^1$ "better" than $\theta^0$?radio0Because $\theta^1$ is explicitly the maximum likelihood estimate for $D^+$, the likelihood of $\theta^1$ is greater than or equal to all other possible $\theta$'s, including $\theta^0$.00:06:270$\forall i,\ \theta^1_i \geq \theta^0_i$1$L(\theta^1 : D^+) \geq L(\theta^0 : D^+)$, where $L$ is the likelihood function and $D^+$ is the "completed" data0It is a step "closer" intuitively, but nothing concrete can be saidWhat is the scope of the distribution $Q_i$ for this network?radio0Each $Q_i$ is a distribution over the data missing in instance $i$ given the observed data for instance $i$.00:10:480$\{B,C\}$0$\{A,B,C,D\}$1Whichever variables are missing in instance $i$Why can we not necessarily compute $M[c,d]$ in the case of missing data?radio000:12:270Because $C$ and $D$ are never directly observed1Because $C$ and $D$ are not necessarily both observed in all instances0Because we don't know if there exist instances that we have never seenGiven the process we just described, what do these statements correspond to concretely?checkbox000:18:190"complete the missing data" corresponds to finding the MAP values of the unobserved variables in each instance1"complete the missing data" corresponds to finding the expected sufficient statistics for the dataset1"use completed data as if it were real" corresponds to finding the MLE for the expected sufficient statistics0"use completed data as if it were real" corresponds to finding the best parameters for all possible completions of the dataLearning CRFs (Mandatory)Recall that a conditional random field encodes a distribution $P(Y|X) = \frac{1}{Z(X)}P'(Y, X)$. Which of the following is/are advantage(s) of a CRF?checkbox10Because we don't explicitly model $P(X)$, we can use variables $X$ that may be very complicated or rich and model only the dependency of $Y$ on $X$.1We avoid encoding the distribution over a set of variables $X$ in our model1We can use a richer set of variables on which we condition than in a generative model1We can incorporate knowledge about a domain whose dependencies may be poorly understood0None of theseCausal Models (Mandatory)A survey in some state before the 2008 election showed that among women surveyed, more said they would vote for Obama than McCain, and the same is true for men. Can we conclude that among those surveyed overall, more would vote for Obama than McCain (assuming they were telling the truth)?radio10Simpson's Paradox cannot apply here. We have that $P(Obama|Male)>P(McCain|Male)$, $P(Obama|Female)>P(McCain|Female)$, and $P(Male)+P(Female) = 1$. Therefore, $P(Obama) = \sum P(Obama|Gender)P(Gender) = P(Obama|Male)P(Male) + P(Obama|Female)P(Female) > P(McCain|Male)P(Male) + P(McCain|Female)P(Female) = P(McCain)$.1Yes0NoFor a causal network B, which of the following is true in general?
radio10The probability of an interventional query is the probability according to the mutilated network.0(a)1(b)0(c)0(d)Interventional Queries (Mandatory)In the next three questions you will compute sufficient statistics with interventional data. The following table shows the counts for different interventions:

What is $M[x^0;y^0 z^0]$?radio10We use all counts for this assignment except those corresponding to $do(X=x^0)$.04011112015019According to the table above, what is $M[y^0;x^0 z^0]$?radio10Again, we add all counts except those for $do(Y = y^0)$.040718015019Given the above information, what is $M[x^0]$?radio10Again, we aggregate all counts consistent with $x^0$ except for those for $do(X=x^0)$.0709015117024Suppose we have learned the following network from some dataset $\mathcal{D}$: $X\rightarrow Y\rightarrow Z \leftarrow W$.
Which edges can we assert are causal in the underlying distribution $P$?checkbox10With finite data we can't even conclude that the learned network is I-equivalent to $P$, let alone that the edges are causal.0$X\rightarrow Y$0$Y\rightarrow Z$0$Z\leftarrow W$1None of theseIn the setting of the previous question, if we can assume that there are no confounding factors, which edges can we assert are causal?checkbox10With finite data we still can't conclude that the learned network is I-equivalent to $P$, so the lack of confounding factors doesn't help.0$X\rightarrow Y$0$Y\rightarrow Z$0$Z\leftarrow W$1None of theseIn the setting of the previous two questions, if we can assume that there are no confounding factors and that the network is a perfect map of the underlying distribution $P$, which edges can we assert are causal?checkbox10Because these edges form an immorality, any I-equivalent graph will have these directed edges, so we can conclude that they are causal (assuming no confounding factors).0$X\rightarrow Y$1$Y\rightarrow Z$1$Z\leftarrow W$0None of theseParameter Priors (Video)How can using a prior distribution over $\theta$ help get more appropriate estimates for the three situations on the left with the dime and thumbtack?checkbox0As in a normal Bayes net, the CPDs (including the prior over $\theta$) are formed separately from the actual observations and the inference process, so the latter two answers aren't right.00:05:051The prior can be different for the dime and thumbtack, expressing that we're more confident that the dime is 50-50 than the thumbtack.1The estimate for $\theta$ combines the influence of the prior and the data, so with little data the prior has more influence but with a lot of data the prior has less influence.0The prior distribution can depend on the amount of data.0The prior distribution can depend on the maximum likelihood estimate of the data.How does this new estimate for the next coin flip compare to the maximum likelihood estimate for $\theta$?radio0The new estimate is equivalent to the MLE estimate if we had one extra head and one extra tail in our observations (because $M[x^1]$ and $M[x^0]$ would both be incremented). So this is slightly closer to being uniform.00:11:420In this simple case it's the same as the MLE estimate, but it can be different in general1It's slightly closer to being uniform (.5) than the MLE estimate0It's slightly further from being uniform (.5) than the MLE estimate0It's different than the MLE estimate, but not necessarily closer to or further from being uniform (.5)Learning with Shared Parameters (Mandatory)Consider performing EM on a DBN, and how inference and learning can be done. Which of the followings is/are true?checkbox10Inference is done as usual in DBNs, which is on the unrolled network. Because the parameters in a DBN are shared through all time slices, they are learned together, which means that the M-step is performed on the template level.1The E step needs to be performed over the variables in the unrolled (ground) network.0The E-step can be performed over the template-level variables alone.0The M step needs to be performed over the variables in the unrolled (ground) network.1The M-step can be performed over the template-level variables alone.MAP with Graph Cuts (Video)Maximizing $\tilde{P}_\Phi(\cal{X})$ is equivalent to minimizing what?radio000:03:550$\prod_i \epsilon_i(x_i) \cdot \prod_{ij} \epsilon_{ij}(x_i,x_j)$0-$\prod_i \epsilon_i(x_i) \cdot \prod_{ij} \epsilon_{ij}(x_i,x_j)$1$\sum_i \epsilon_i(x_i) + \sum_{ij} \epsilon_{ij}(x_i,x_j)$0-$\sum_i \epsilon_i(x_i) + \sum_{ij} \epsilon_{ij}(x_i,x_j)$How does this guarantee compare to other methods we've seen for solving this MAP problem (for pairwise binary MRFs) in terms of exactness and efficiency?radio000:08:361Clique tree inference can accomplish the same thing0Move making algorithms can accomplish the same thing0Third ChoiceHow does this guarantee compare to other methods we've seen for solving this MAP problem (for pairwise binary submodular MRFs) in terms of exactness and efficiency?checkbox0Clique tree inference can be intractable if the MRF is too connected and the resulting tree width is too large. Cluster graph inference isn't guaranteed to converge or find a globally consistent solution, regardless of the exactness issue. Move making algorithms that we've seen can't guarantee finding the optimal solution. This is a special case that can specifically be solved exactly with a min-cut algorithm, which runs in polynomial time.00:08:360Clique tree inference accomplishes the same thing0Cluster graph inference accomplishes the same thing except it won't be exact0Move making algorithms accomplish the same thing1We haven't seen another method for exactly solving this problem with this efficiencyConstructing Graph Cuts for MAP (Video)We are trying to reduce our energy minimization problem to the min-cut problem, which finds the graph cut of minimum cost. Looking at the form of the cost, when we do our reduction, what do you think the cost of a given cut will correspond to?checkbox0As we will see, the edges in the graph will correspond to energy terms, a cut will correspond to an assignment to $\cal{X}$, the cut cost will correspond to the total energy of an assignment to $\cal{X}$, and so minimizing the cost will correspond to minimizing the total energy (which is equivalent to maximizing the probability).00:07:300An assignment to $\cal{X}$0The probability of a given assignment to $\cal{X}$1The total energy of a given assignment to $\cal{X}$Why is this a worthwhile goal?radio000:11:001Because minimizing the cut cost will minimize the energy, which is equivalent to the original MAP problem0Because the decomposition is a good approximation to the original MAP problem0Because it will give us a convenient way to get an assignment to $\cal{X}$ from a given cut, though there are guarantees on how well this solves the MAP problemWhat should the edge weights be from $z_i$ to $z_j$?checkbox0The cost of the cut only counts edges going in one direction, so the penalty for putting $z_i$ and $z_j$ on different sides of the cut is $\lambda_{ij}$.00:13:4900 in both directions00 in one direction, $\lambda_{ij}$ in the other1$\lambda_{ij}$ in both directionsIntroduction to Parameter Learning (Video)We haven't seen models expressed in quite this way before. Which of the following is true of this plate model?checkbox0We are treating the continuous parameter $\theta$ as a random variable, though we haven't dealt much with continuous variables. The CPD of $X$ is a function of $X$ and $\theta$, where that function is written here.00:03:221$\theta$ is considered a random variable0$\theta$ is discrete (as opposed to continuous)0The CPD of $X$ is a table CPD0The CPD of $X$ is a multiplexer1The CPD of $X$ is a numerical function1Each coin toss has the same CPD0There is a different $\theta$ for every coin tossWithout doing any math, what do you think the optimal solution $\hat{\theta}$ is?radio0It's the fraction of heads that we observe.00:09:360$M[x^1]$0$M[x^1] - M[x^0]$1$\Large{\frac{M[x^1]}{M[x^1]+M[x^0]}}$Learning in Parametric Models (Video)Since the sufficient statistics for a coin toss were the total number of heads and tails, what does that imply about our model?radio000:09:161The order of the flips doesn't affect the likelihood function0The order of flips does affect the likelihood function0Doesn't imply anything about the order of the flipsParameter Learning in a Bayesian Network (Video)Based on our knowledge of the network and probabilistic operations, what is this equation equivalent to?radio000:03:380$\prod_m P(x[m]\mid \theta)P(y[m]\mid \theta)$0$\prod_m P(x[m]\mid y[m])P(y[m]\mid \theta)$1$\prod_m P(x[m]\mid y[m], \theta)P(y[m]\mid \theta)$What will observing that these independencies hold allow us to do?radio000:08:061Decompose the likelihood function0Ignore certain values of $X$ and $Y$0Ignore certain values of $\theta$The Dirichlet Prior (Mandatory)Suppose we are interested in estimating the distribution over the English letters. We assume an alphabet that consists of 26 letters and the space symbol, and we ignore all other punctuation and the upper/lower case distinction. We model the distribution over the 27 symbols as a multinomial parametrized by $\theta = (\theta_1,\ldots,\theta_{27})$ where $\sum_i \theta_i = 1$ and all $\theta_i \geq 0$.

Now we go to Green library and repeat the following experiment: randomly pick up a book, open a page, pick a spot on the page, and write down the nearest symbol that is in our alphabet. There is no correlation between the repetitions of this experiment, and we use $X[m]$ to denote the outcome of the $m$th experiment.

In the end, we have collected a dataset $D = \{x[1],\ldots,x[2000]\}$ consisting of 2000 symbols, among which "e" appears 260 times. We use a Dirichlet prior over $\theta$, i.e. $P(\theta) = Dirichlet(\alpha_1,\ldots,\alpha_{27})$ where each $\alpha_i = 10$. What is the predictive probability that letter "e" occurs with this prior, i.e., what is $P(X[2001] = "e" \mid D)$?radio10The prediction with Dirichlet priors needs to add the hyperparameters to our counts. In this problem, $\alpha = 10*27 = 270$ and $\alpha_5 = 10$ (the hyperparameter that corresponds to $\theta_5$), so the prediction is $\Large{\frac{260 + 10}{2000 + 270} = \frac{270}{2270}}$.0$\Large{\frac{260}{2000}}$0$\Large{\frac{270}{2000}}$0$\Large{\frac{270}{2010}}$0$\Large{\frac{260}{2270}}$1$\Large{\frac{270}{2270}}$0None of theseFor the previous question, suppose we have collected $M$ symbols, and let $\alpha = \sum_i \alpha_i$ (we no longer assume that each $\alpha_i = 10$). In which situation(s) does the Bayesian predictive probability using the Dirichlet prior (i.e., $P(X[M+1]\mid D)$) converge to the MLE estimation?checkbox10From the Eqn 17.11 in the book, we can see that the prediction with Dirichlet prior is a weighted average of the prior mean and the MLE estimate, and it's easy to see that when $M$ goes to $\infty$ or $\alpha$ goes to 0, the prediction with the prior converges to MLE. Intuitively, in the first case the data is washing out the influence of the prior, and in the second case the influence of the prior goes away.1$M \rightarrow \infty$ and $\alpha$ is fixed0$\alpha \rightarrow \infty$ and $M$ is fixed0$M \rightarrow 0$ and $\alpha$ is fixed1$\alpha \rightarrow 0$ and $M$ is fixed0None of theseBIC Score (Mandatory)Consider the following 4 graphs:

Which of the following statements is/are true?checkbox10I-equivalent graphs have the same likelihood score, and since mutual information is non-negative, adding edges cannot decrease the likelihood score.1$score_L(G1 : D) = score_L(G3 : D)$ for every dataset $D$0$score_L(G1 : D) \geq score_L(G2 : D)$ for every dataset $D$1$score_L(G4 : D) \geq score_L(G3 : D)$ for every dataset $D$1$score_L(G2 : D) \geq score_L(G3 : D)$ for every dataset $D$0$score_L(G2 : D) \geq score_L(G4 : D)$ for every dataset $D$0None of theseConsider the same 4 graphs as in the previous question, but now we will use the BIC score. Which of the following statements is/are true?checkbox10I-equivalent graphs have the same BIC score. Since the BIC score penalizes more complex model (more edges in the graph, i.e., more parameters), when adding edges, the likelihood score may go up but it might be canceled by the penalty term of the dimensionality.1$score_{BIC}(G1 : D) = score_{BIC}(G3 : D)$ for every dataset $D$0$score_{BIC}(G2 : D) \geq score_{BIC}(G1 : D)$ for every dataset $D$0$score_{BIC}(G4 : D) \geq score_{BIC}(G1 : D)$ for every dataset $D$0$score_{BIC}(G2 : D) \ne score_{BIC}(G3 : D)$ for every dataset $D$0None of the above is trueNow consider $G2$ and $G3$. We have a dataset $D$ generated from some probability distribution $P$, and the likelihood scores for $G2$ and $G3$ are $score_L(G2 : D)$ and $score_L(G3 : D)$, respectively. Let $\theta^*_2$ and $\theta^*_3$ be the maximum likelihood parameters for each respective network, and let $L(X : G, \theta)$ represent the likelihood of dataset $X$ given the graph $G$ and parameter $\theta$, so $score_L(G2 : D) = L(D : G2, \theta_2^*)$ and $score_L(G3 : D) = L(D : G3, \theta_3^*)$.

Suppose that $L(D : G2, \theta_2^*) > L(D : G3, \theta_3^*)$. If we draw a new dataset $E$ from the distribution $P$, we can guarantee:radio10$\theta^*_2$ and $\theta^*_3$ correspond to the ML estimation from dataset $D$. Given the new dataset $E$, they might not be the ML estimation parameters. Moreover, since the dataset $D$ and $E$ might not be sufficiently large enough to accurately characterize $P$, there is no guarantee on the relation of likelihood scores of the new dataset.0$L(E : G2, \theta_2^*) > L(E : G3, \theta_3^*)$0$L(E : G2, \theta_2^*) < L(E : G3, \theta_3^*)$0$L(E : G2, \theta_2^*) = L(E : G3, \theta_3^*)$1None of theseUsing Decomposability During Search (Mandatory)Suppose we use a greedy hill-climbing procedure to search for a graph structure over some distribution. Given a sufficiently large dataset, can we guarantee to find the true graph structure or its I-equivalence class?radio10Greedy hill-climbing procedure is not guaranteed to find the global maximum, since it may get stuck in a local maximum.1No0YesNow your friend gives you the dataset that she was talking about with the following sufficient statistics:

What is the optimal operation $o$? (Remember that there is a unique optimal operation, so only consider the ones you selected in the previous question.)radio10Based on the previous question, you only need to compute and compare the delta scores for adding edge $B\rightarrow C$ and $C\rightarrow B$. The way to compute the $\delta$ scores is the same as above, and the FamScore using likelihood score is $\textrm{FamScore}_L(X | Pa_X : {\cal D}) = M \times [I_P(X; Pa_X) - H_P(X)]$.0Delete edge $A\rightarrow B$1Add edge $C\rightarrow B$0Add edge $B\rightarrow C$0Add edge $C\rightarrow A$0Add edge $A\rightarrow C$0Reverse edge $A\rightarrow B$We are trying to decide between the following two graphs using the likelihood score.

What is the difference in likelihood scores $score_L(G_1 : D) - score_L(G_2 : D)$ given a dataset $D$ of size $M$ in terms of the entropy $H$ and mutual information $I$? The subscripts below denote empirical values according to $D$ -- for example, $H_D(X)$ is the empirical entropy of the variable $X$ in the dataset $D$. You can find the definition of entropy and mutual information in Section A.1 starting from p. 1136, and the relation to the likelihood score is in section 18.3.1.2 and 18.3.6.2.radio10From the referenced sections in the book, the difference of the likelihood scores between these two graphs are M times the difference of the mutual information of each family. (Note that $I_D(A; B) = I_D(B; A)$.)1$M \times [I_D(D;C) + I_D(E;D) - I_D(B;A) - I_D(D;C;E)]$0$M \times [I_D(D;C) + I_D(E;D) - I_D(A;B) - I_D(D;C;E) - H_D(A,B,C,D,E)]$0$M \times I_D(A;B)$0$M \times [I_D(A;B) - H_D(A,B)]$0$M \times [I_D(C;A,B) + I_D(D;C) + I_D(E;D) - I_D(B;A) - I_D(D;C;E)]$Consider performing graph structure search using a decomposable score. Suppose our current candidate is graph $G$ below.

We want to compute the changes of scores associated with applying three different operations:
• Delete the edge $A\rightarrow D$
• Reverse the edge $C\rightarrow E$
• Add the edge $F\rightarrow E$

Let $\delta(G : o_1), \delta(G : o_2), \delta(G : o_3)$ denote the score changes associated with each of these three operations, respectively. Which of the following equations is/are true for all datasets $\cal{D}$?checkbox10This follows directly from Proposition 18.5 on p. 818.1$\delta(G : o_1) = \textrm{FamScore}(D,\{B,C\} : {\cal D})\ -\ \textrm{FamScore}(D,\{A,B,C\} : {\cal D})$0$\delta(G : o_1) = \textrm{FamScore}(D,\{B,C\} : {\cal D})$0$\delta(G : o_2) = \textrm{FamScore}(C,\{A,E\} : {\cal D})$0$\delta(G : o_2) = \textrm{FamScore}(C,\{A,E\} : {\cal D})\ - \ \textrm{FamScore}(C,\{A\} : {\cal D})$0$\delta(G : o_2) = \textrm{FamScore}(C,\{A,E\} : {\cal D})\ - \ \textrm{FamScore}(C,\{A\} : {\cal D})\ - \ \textrm{FamScore}(E,\{C\} : {\cal D})$0$\delta(G : o_2) = \textrm{FamScore}(C,\{A,E\} : {\cal D})\ - \ \textrm{FamScore}(E,\{C\} : {\cal D})$1$\delta(G : o_3) = \textrm{FamScore}(E,\{C,F\} : {\cal D})\ - \ \textrm{FamScore}(E,\{C\} : {\cal D})$0$\delta(G : o_3) = \textrm{FamScore}(C,\{A,E\} : {\cal D})$0None of theseYour friend is performing greedy hill-climbing structure search over a network with three variables using three possible operations and the likelihood score with dataset $\cal{D}$:
• Delete an edge
• Reverse an edge
She tells you that after examining $\cal{D}$, she took a single step and got the following graph:

She also tells you that for the next step she has determined that there is a unique optimal greedy operation $o$ to take. Which of the following steps could $o$ be?checkbox10Reversing $A\rightarrow B$ doesn't change the score. Adding $A\rightarrow C$ or $C\rightarrow A$ result in the same score (you are told there is only a unique optimal operation). Deleting the edge $A\rightarrow B$ decreases the likelihood score. Adding $B\rightarrow C$ and $C\rightarrow B$ result in different scores, because one triggers a v-structure that involves $A$, and the other does not.0Reverse edge $A\rightarrow B$0Add edge $A\rightarrow C$0Add edge $C\rightarrow A$1Add edge $B\rightarrow C$1Add edge $C\rightarrow B$0Delete edge $A\rightarrow B$Learning with Missing Data (Mandatory)Suppose we are conducting a survey of job offers and salaries for Stanford graduates. In the survey form, each graduating student is asked to fill out his/her major, and to list up to two job offers and salaries he/she received. Which of the following scenarios is/are missing at random (MAR)?checkbox10The definition of MAR is that whether the data is missing is independent of the missing values themselves given the observed values (see p. 854). (a) and (e) are not MAR because whether the data is missing depends on the missing value (the salary). (b) and (d) are MAR because whether the data is missing depends only on the major, which is observed. (c) is MCAR.0(a) Students who accepted a low-salaried job offer tended not to reveal it.1(b) CS students get more offers than other majors and therefore might neglect to list all of them.1(c) The person recording the information accidentally lost some of the completed survey forms.1(d) The person recording the information didn't care about humanities students and neglected to record their salaries.0(e) The person recording the information didn't like people who were paid more than he is, and so removed salaries that were higher than his.0None of theseAnalysis of EM (Mandatory)Learning Bayesian Network parameters with missing data (partially observed instances) is more difficult than learning with complete data for which of the following reasons?checkbox10We do not need to throw out incomplete instances, but there is not a single optimal value for the parameters.0We require more training data, because we must throw out all incomplete instances.1We lose local decomposition, whereby each CPD can be estimated independently.0While there is still always a single optimal value for the parameters, it can only be found using an iterative method.0None of these.In a Bayesian Network with partially observed training data, computing the likelihood of observed data for a given set of parameters...radio10With fully observed data, the likelihood is computed without inference. With missing data, inference is required to complete the expected sufficient statistics for the expected likelihood function.0requires probabilistic inference, AS IN the case of fully observed data.0cannot be achieved by probabilistic inference, while it CAN in the case of fully observed data.1requires probabilistic inference, while it DOES NOT in the case of fully observed data.Alice wants to test her friend Bob's knowledge of probabilistic models. She makes a Bayes net, chooses some specific parameters $\theta^*$, and writes a script in Matlab that generates instances from her network. She adds a "hide" flag to the script that, if set to true, randomly hides some values in each instance.

She gives Bob the script along with a description of the network, which reads $\theta^*$ from an encrypted file so that he can't read it. She wants to see if he can figure out the true parameters $\theta^*$ using her script, which he can use to generate as many instances as he likes. Which of the following is true?radio10If there is missing data, the likelihood function may have multiple global optima, which means it is unidentifiable -- the true parameters $\theta^*$ have the same likelihood as other parameters, so they cannot be guaranteed to be found. If there is no missing data, there is a single global optimum that can be found using standard Bayes net learning techniques.0Bob can guarantee that he can find $\theta^*$ EVEN IF the "hide" flag is set to true.1Bob can guarantee that he can find $\theta^*$ ONLY IF the "hide" flag is set to false.0Bob cannot guarantee that he can find $\theta^*$ no matter what the "hidden" flag is set to.Given the network and data instances shown below, how do we compute the expected sufficient statistics for a particular value of the parameters?

radio10The expected sufficient statistics for the assignment $(x_0,y_0,z_0)$ are the sum of the probability that each instance is consistent with that assignment (which is 0 for all inconsistent instances).0$\bar{M}[x_0,y_0,z_0] = 3$1$\bar{M}[x_0,y_0,z_0] = P(y_0,z_0\mid x_0, \theta) + P(z_0\mid x_0, y_0, \theta) + P(x_0,z_0\mid y_0, \theta)$0$\bar{M}[x_0,y_0,z_0] = P(z_0\mid x_0, \theta) + P(z_0\mid x_0, y_0, \theta) + P(z_0\mid y_0, \theta)$0$\bar{M}[x_0,y_0,z_0] = P(y_0,z_0\mid x_0, \theta) + P(z_0\mid x_0, y_0, \theta) + P(z_0\mid x_1, y_1, \theta) + P(z_0\mid x_0, y_1, \theta) + P(x_0,z_0\mid y_0, \theta)$Value of Perfect Information (Mandatory)Let $G$ be the influence diagram as shown below.

Now consider the following graphs:

What is the value of perfect information for $X_2$ at $D_1$? (Note that $MEU(I)$ denotes the value of the strategy that maximizes expected utility in influence diagram $I$.)radio100$MEU(G) - MEU(G_1)$0$MEU(G) - MEU(G_2)$0$MEU(G) - MEU(G_3)$0$MEU(G) - MEU(G_4)$0$MEU(G_1) - MEU(G)$0$MEU(G_2) - MEU(G)$0$MEU(G_3) - MEU(G)$1$MEU(G_4) - MEU(G)$Independence (Video)In the Student example, how many free parameters are required to describe the full joint distribution, assuming no independence and that each variable is binary?radio0There are 2^3 = 8 entries in the full table for 3 variables with two values each, and since the entries have to sum to 1, only 7 of them are free parameters.00:09:0417060308In the Student example, how many free parameters are required to describe the full joint distribution, assuming that the variables are fully independent and each variable is binary?radio0Each variable needs 2 parameters since it is binary, only 1 of which is free since they must sum to 1. This was mentioned in general near 5:17 of this lecture.00:09:4403140607Reduced Factors (Video)If we condition on the context $B=b$, which of the following happens to the graph?checkbox000:10:431The node for $B$ is removed1The edge $B-A$ is removed1The edge $B-D$ is removed1The edge $B-C$ is removed0The edge $A-D$ is removed0The edge $C-D$ is removedBayes Nets vs. Markov Nets (Video)Does there exist a Bayesian Network that is I-equivalent to the Misconception Markov Network?radio0We will see why not in the video.00:16:441Yes0NoDoes there exist a Bayesian Network that is an I-map of the Misconception Markov Network?radio0The complete graph is an I-map of every distribution. So a Bayes Net that is a complete graph would be an I-map of the Misconception network, but it would not have all of its independencies (since it wouldn't have any independencies).00:16:440No1YesWhich independence is true in $G$ but not in the moralized graph $H$ in the Student example?radio000:11:410$S \perp G \mid I$0$L \perp I \mid G$1$D \perp I$0$D \perp I \mid G$0$S \perp G$Soundness of d-separation (Mandatory)Examine the Insurance Network at this link (open the link in a new tab or window). Which of the following independencies hold(s)?checkbox100(DrivingSkill $\perp$ AntiTheft)1(DrivingSkill $\perp$ AntiTheft $\mid$ SeniorTrain, Age)0(DrivingSkill $\perp$ AntiTheft $\mid$ DrivQuality, DrivHist)0(DrivingSkill $\perp$ AntiTheft $\mid$ SeniorTrain, Age, DrivQuality, DrivHist)I-maps (Mandatory)Suppose $(A \perp B) \in \mathcal{I}(P)$, and $G$ is an I-map of $P$. Is it necessarily true that $(A \perp B) \in \mathcal{I}(G)$radio10If $G$ is an I-map of $P$, that means that all independencies in $G$ are also in $P$, but this doesn't mean that all independencies in $P$ are also in $G$. An easy way to remember this is that the complete graph, which has no independencies, is an I-map of all distributions.1No0YesI-equivalence (Mandatory)In the figure below, graph $G$ is I-equivalent to which other graph? radio10II, III, and IV all have extra immoralities, and therefore extra independencies. I has the same immoralities, and is thus I-equivalent to $G$.0None of the above0IV0III0II1IGraph $G$ (shown below) is a perfect I-map for distribution $P$ ($\mathcal{I}(G)=\mathcal{I}(P)$). Which of the other graphs is an I-map for $P$? radio10I isn't because it has the extra independence $(A \perp C)$.
II has the extra independence $(B \perp C \mid D)$ (among others).
III has no extra independencies.0None of the above1III0II0IIf Bayesian Network $G$ is a simple directed chain $X_1 \rightarrow X_2 \rightarrow ... \rightarrow X_n$ for some number $n$, how many Bayesian Networks are I-equivalent to $G$ other than $G$ itself?radio10The chain $X_1 \leftarrow ... \leftarrow X_i \rightarrow ... \rightarrow X_n$ is I-equivalent, where $i$ can be $2$ through $n$ (when $i=n$, all arrows point left). Thus, there are $n-1$ I-equivalent networks.00011$n-1$0$n$0$2n$Markov Networks (Mandatory)Let $\pi_1[A,B]$, $\pi_2[B,C]$, and $\pi_3[A,C]$ be all of the factors in a particular undirected graphical model. Then $\sum_{A,B,C} \pi_1[A,B] \times \pi_2[B,C] \times \pi_3[A,C]$ ischeckbox10Since the factors can be any positive function, there is no restriction on $Z$, the partition function, other than it being positive.0Less than or equal to 10Equal to 10Greater than or equal to 11Equal to the partition function, $Z$Factors and Gibbs Distributions (Mandatory)Which of the following sets of factors could factorize over the undirected graph below? checkbox10The scope of each factor in the set must be a clique in the graph. So factors like $\phi(A,B,D)$ and $\phi(A,B,C,D)$ can't be involved in a factorization over this graph.1$\phi(A), \phi(B), \phi(C), \phi(D), \phi(E), \phi(F)$1$\phi(A,B), \phi(C,D), \phi(E,F)$1$\phi(A,B,C), \phi(D,E,F)$0$\phi(A,B,D), \phi(C,E,F)$0$\phi(A,B,C,D), \phi(C,D,E,F)$1$\phi(A,B,C), \phi(B,C,D), \phi(C,D,E), \phi(D,E,F)$1$\phi(A,B,C), \phi(A,B), \phi(C,D,E), \phi(E,F), \phi(F)$Probability Operations (Video)Is the value $\textrm{argmax}_a P(a)$ the same as the value for $a$ in $\textrm{argmax}_{ab} P(a,b)$?radio0The answer will now be explained in the video00:15:200Yes1NoWhat is $\sum_{X_2} P(X_1,X_2,X_4)$?radio000:06:020$P(X_2 | X_1,X_4)$0$P(X_1,X_4 | X_2)$0$P(X_2)$1$P(X_1,X_4)$Context-Specific Independencies (Mandatory)Which of the following are context-specific independences that DO exist in the tree CPD below? (Note: Only consider independencies in this CPD, ignoring other possible paths in the network, and there may be multiple answers) checkbox10A variable $X$ is independent of $E$ given conditioning assignments $\bar{z}$ if all paths consistent with $\bar{z}$ traversed in the tree CPD reach a leaf without querying $X$. This is true for the selected options. The first option is wrong because the tree CPD represents $P(E|A,B,C,D)$, so it does not give any information about whether $A$ and $D$ are independent.0None of the above are correct1$(E \perp_c D, B \mid a^1)$0$(E \perp_c C \mid b^0, d^0)$1$(E \perp_c D \mid b^1)$1$(E \perp_c C \mid a^0, b^0)$0$(E \perp_c D \mid a^0)$0$(A \perp_c D \mid B)$In the following Bayesian network, the node B is a deterministic function of its parent A. Which of the following is an independence statement that holds in the network? checkbox10Since $B$ is a deterministic function of $A$, observing $A$ implies that $B$ is also observed, which d-separated $C$ and $D$. Therefore, $(C \perp D \mid A)$.0$(A \perp B \mid C,D)$1$(C \perp D \mid A)$0$(B \perp D \mid C)$0$(A \perp D \mid C)$I-equivalence (Video)The graph $X \rightarrow Y \rightarrow Z$ is I-equivalent to which of the following graphs?checkbox000:04:430$X \ \ \ \ \ Y \ \ \ \ \ Z$0$X \rightarrow Y \leftarrow Z$1$X \leftarrow Y \rightarrow Z$1$X \leftarrow Y \leftarrow Z$I-maps (Video)$P$ factorizes over $G$ if and only if...checkbox000:00:541$P \models I_l(G)$ ($P$ satisfies the local independencies of $G$)0$P \models I(G)$ ($P$ satisfies all independencies of $G$)0$G$ is a tree0$G$ is a complete graphIndependence (Mandatory)How many independent parameters are required to uniquely define the CPD of C in the graphical model below, if A, B, and D are binary, and C and E have three values each? radio10There are 4 possibilities for the values of C's parents (A and B, which are binary). For each of these possibilities, there are 3 possible values for C, which corresponds to 2 free parameters (since the 3 numbers have to sum to 1). So there are 4*2=8 total free parameters.012011180704Bayesian Networks (Video)If instead of observing the class difficulty, we observe that the student did well on the SATs, how does that affect the probability that the course was difficult compared to our previous estimate?radio0This will now be answered and explained in the video.00:22:310Stays the same0Goes down1Goes upOnce we've observed that the grade is a C, if we now observe that the course is difficult, how does the new observation affect our estimate of the probability of the student being smart?radio0This will now be answered and explained in the video.00:19:351Goes up0Goes down0Stays the sameGibbs Sampling (Mandatory)Suppose we have the Bayesian network shown in this image:

If we are sampling the variable $X_{23}$ as a substep of Gibbs sampling, what is the closed form equation for the distribution we should use over the value $x_{23}'$? By closed form, we mean that all computation such as summations are tractable and that we have access to all terms without requiring extra computation.radio10$P(x_{23}' \mid x_{-23})$ is correct but not in closed form because we don't have direct access to this term. To get it, we have to expand it, and since all factors involving variables not in the Markov blanket cancel out (see equation 12.23 in Chapter 12.3.3), we get the correct answer.0$P(x_{23}' \mid x_{22}, x_{24})P(x_{15} \mid x_{23}', x_{14}, x_{9}, x_{25})$0$P(x_{23}' \mid x_{-23})$ where $x_{-23}$ is all variables except $x_{23}$0${\Large \frac{P(x_{23}' \mid x_{22}, x_{24})P(x_{15} \mid x_{23}', x_{14}, x_{9}, x_{25})}{\sum_{x_9'',x_{14}'',x_{22}'',x_{24}'',x_{25}''}P(x_{23}' \mid x_{22}'', x_{24}'')P(x_{15}'' \mid x_{23}', x_{14}'', x_{9}'', x_{25}'')}}$0$P(x_{23}' \mid x_{22}, x_{24})$1${\Large \frac{P(x_{23}' \mid x_{22}, x_{24})P(x_{15} \mid x_{23}', x_{14}, x_{9}, x_{25})}{\sum_{x_{23}''}P(x_{23}'' \mid x_{22}, x_{24})P(x_{15} \mid x_{23}'', x_{14}, x_{9}, x_{25})}}$0None of these; they are all either incorrect or not in closed formSuppose we are running the Gibbs sampling algorithm on the Bayesian network $X\rightarrow Y\rightarrow Z$. If the current sample is $\langle x_0, y_0, z_0 \rangle$ and we sample $y$ as the first substep of the Gibbs sampling process, with what probability will the first subsample be $\langle x_0, y_1, z_0 \rangle$?radio10The probability of a transition in Gibbs sampling is the conditional probability of the variable being sampled given all other variables.0$P(y_1 | x_0)$0$P(x_0, y_1, z_0)$0$P(x_0, z_0 \mid y_1)$1$P(y_1 \mid x_0, z_0)$Local Independencies in BNs (Video)What is $X_i$ not necessarily independent of given its parents?radio0This will now be explained in the video.00:01:420Its parents0Its parents' parents1Its childrenGlobal Independencies in BNs (Video)In the Student network, are $G$ and $S$ conditionally independent when we observe $I$?radio000:08:211Yes0NoIf we have an edge $X \rightarrow Y$ in the graph, is there any set of variables that we can condition on that would necessarily make $X$ and $Y$ independent?radio000:03:261No0YesLocal Independencies in BNs (Video)What are the non-descendants of D?checkbox000:03:440L0G1I1SWhat is $X_i$ not necessarily independent of given its parents?radio000:01:421Its children0Its parents' parentsMarkov Networks (Video)Why is $P(a,b,c,d)$ as defined not a proper probability distribution?checkbox0The individual factors $\phi$ cannot be less than 0, but otherwise their product is not constrained.00:08:350It can be less than 01It is not necessarily between 0 and 11It does not sum to 1 ($\sum_{a,b,c,d} P(a,b,c,d) \ne 1)$Factors and Gibbs Distributions (Video)How many parameters are in $\Phi_T$?radio000:16:321$2k^3$0$k^3$0$2k^2$0$3k^2$0$2k^8$What is the scope of the factor product $\phi(A,B) \times \phi(B,C)$?radio000:05:011{$A,B,C$}0{$B$}0{}What do you think would be a good condition for X and Y to have an undirected edge in the network? (This is just a brainstorming question.)checkbox000:10:590If they never appear together in any factor1If they appear together in some factor $\phi$1If there exists a factor $\phi(X,Y)$Independence in Markov Nets (Video)What is the difference between d-separation and separation as defined here?radio000:03:550d-separation is for Bayes Nets and separation is for Markov Nets, but otherwise they are the same1They are different in the case where $Z$ is the descendant of a v-structure0They are different in the case where $Z$ is the mutual ancestor of $X$ and $Y$Intuitively, what independence property does this correspond to?radio000:01:580$A \perp C$1$A \perp C \mid B,D$0$A \perp B,D \mid C$Concrete Temporal Models and DBNs (Mandatory)In which of the following 2-TBNs will $(X^{(t)} \perp Z^{(t)} \mid Y^{(t)})$ hold for all $t$, assuming $Obs^{(t)}$ is observed for all $t$ and the other variables are never observed? (Hint: unroll each 2-TBN) checkbox10(a) is incorrect because there is still an active path from $X^{(t)}$ to $Z^{(t)}$ through the previous time step variables $(X^{(t)} \leftarrow X^{(t-1)} \rightarrow Y^{(t-1)} \rightarrow Z^{(t-1)} \rightarrow Z)$. (c) is incorrect because of active path $X^{(t)} \rightarrow Y^{(t)} \rightarrow Obs^{(t)} \leftarrow Z^{(t)}$.0None of the above0(c)1(b)0(a)In the following DBN, which of the following independence assumptions are TRUE? checkbox10$(O^{(t)} \perp O^{(t-1)})$ is wrong because there is an active trail from $O^{(t)}$ to $O^{(t+1)}$ through $X^{(t)}$ and $X^{(t+1)}$.1$(O^{(t)} \perp X^{(t-1)}, X^{(t+1)} \mid X^{(t)})$1$(X^{(t-1)} \perp X^{(t+1)} \mid X^{(t)})$0$(O^{(t)} \perp O^{(t-1)})$1$(O^{(t)} \perp O^{(t-1)} \mid X^{(t)})$Which of the time-series graphs satisfies the Markov assumption? checkbox10(b) is the only time-series graph in which all variable in each time slice are independent of all variables in time slices at least 2 time slices before, given all variables in the previous time slice. In (a) this fails because of the direct edges. In (c) it fails because of the backwards edges, which, when given the previous time slice create an active v-structure.0None of the above0(c)1(b)0(a)Plate Models (Mandatory)Consider the plate model below (with edges removed). Which of the following might a given instance of $X$ possibly represent in the grounded model? (Keep in mind that this question addresses the variable's semantics, not its CPD) radio10In the grounded model, there will be an instance of $X$ for each combination of Teacher and Class. Thus, we are looking at a random variable that will say something about a specific teacher and class combination. The correct answer is the only one that does this.0Whether a specific teacher T is a tough grader0Whether a teacher with expertise E taught a course of difficulty D0Whether a teacher with expertise E taught a course of difficulty D at school S0Whether a teacher with expertise E taught a course of difficulty D at a school in location L1Whether a specific teacher T taught a specific course C at school S0Whether a specific course C at school S is difficultWhich of the following is a valid grounded model for the plate shown? checkbox10(a) is incorrect because only the variables in the plate get replicated when the plate model is grounded, and $\theta$ is not in the plate.0None of the above1(c)0(b)0(a)Consider the plate model shown below. Assume we are given $K$ Markets, $L$ Products, $M$ Consumers and $N$ Locations. What is the total number of instances of the variable $P$ in the grounded BN? radio10There will be one grounded instance of $P$ for each combination of Market plate, Consumer plate, and product plate. There will be $K \cdot L \cdot M$ of these combinations.0$K + L + M$1$K \cdot L \cdot M$0$K \cdot L \cdot M \cdot N$0$(L \cdot M)^K$Consider the following scenario:
On each campus there are several Computer Science students and several Psychology students (each student belongs to one or the other group). We have a binary variable $L$ for whether the campus is large, a binary variable $S$ for whether the CS student is shy, a binary variable $C$ for whether the Psychology student likes computers, and a binary variable $F$ for whether the Computer Science student is friends with the Psychology student.
Which of the following plate models can represent this scenario? radio10(B) is wrong because there is an $F$ variable for every pair of CS and Psych students, so $F$ must be inside both plates. (C) is wrong because $L$ is a property of the single campus and doesn't belong within the plates of students. (D) is wrong because it only has one plate for the students, but the set of CS students (one plate) is different than the set of Psych students (a second plate).1(A)0(B)0(C)0(D)General Definition of Variable Elimination (Mandatory)Consider running variable elimination on the following Bayesian network over binary variables. The elimination of which node results in the largest factor? radio10Eliminating $X_5$ results in the intermediate factor $\tau(X_2,X_3,X_4,X_6)$, which is larger than for any of the other options.0$X_2$0$X_4$1$X_5$0$X_6$Before answering the following two questions, you should perform Variable Elimination on the graph shown here with the variable ordering $B,A,C,F,E,D$:

Using this ordering, what is the intermediate factor (denoted by $\psi$'s in the book) produced by the third step (just before $C$ is eliminated)?radio10After eliminating $B$ we have a new factor $\phi_1(A, D, C)$, and after eliminating $A$, the factor becomes $\phi_2(D, C)$, then when eliminating $C$, the intermediate factor is $\psi(C,D,F) = \phi_2(D, C)P(F|C)$.0$\psi(D,E,F)$1$\psi(C,D,F)$0$\psi(C,E,F)$0$\psi(A,C,D,F)$Which of the following is the induced graph over the previous run of Variable Elimination?
radio10See the explanation in the book (9.4.2.3) for how to construct an induced graph from a variable elimination ordering.1I0II0III0None of theseConsider the following function: $$t(X,Y) = \sum_Z P(Z\mid X,Y)\sum_W P(X\mid W)P(Y\mid W)$$ Because this function does not represent any meaningful probability or conditional probability, it cannot be a factor produced during Variable Elimination.radio10Factors produced during Variable Elimination do not necessarily correspond to meaningful probabilities (see book section 9.3.1.3).0True1FalseParticle Based Methods (Mandatory)Consider the process of rejection sampling to generate samples from the posterior distribution $P(X \mid e)$. If we want to keep $M$ samples, what is the expected number of samples that would need to be drawn from $P(X)$?radio10This is explained in Chapter 12.1.3.0$M \cdot P(X \mid e)$0$M \cdot (1 - P(X \mid e))$0$M \cdot P(e)$0$M \cdot (1 -P(e))$1$M / P(e)$0$M / (1 - P(e))$Likelihood Weighting (Mandatory)Consider using likelihood weighting sampling in the network below to estimate the conditional probability $P(A\mid C=0)$.

Assume we have just sampled a particle with $A = 0$. What will be the weight of the particle?radio10The weight is the probability of the sample given the evidence, which is simply $P(C\mid A=0)$.00.0210.100.200.50None of theseImportance Sampling (Mandatory)Consider using unnormalized importance sampling to estimate the expected value of some function relative to $P(X,Y)$ defined by a Bayesian network over binary-valued variables, $X$ and $Y$.

Suppose we sample from a proposal distribution $Q(X,Y) = Q(X)Q(Y)$ with $Q(x0) = Q(y0) = 0.6$. We draw sample $\langle x1, y0 \rangle$. What correction weight should be applied to the sample?radio10In unnormalized importance sampling, $w\langle x1,y0\rangle = P'(x1,y0)/Q(x1,y0) = P'(x1)\cdot P'(y0|x1)/(Q(x1)\cdot Q(y0)) = (0.4*0.7)/(0.4*0.6).$0$(0.6 * 0.5) / (0.4 * 0.6)$0$(0.4 * 0.6) / (0.6 * 0.5)$1(0.4 * 0.7) / (0.4 * 0.6)0(0.4 * 0.6) / (0.4 * 0.7)0None of theseIf we are doing importance sampling in the following Bayesian Network, which operation(s) is/are necessary to create a mutilated network for evidence $C=c$?

checkbox10To make the mutilated network, we remove the parents of each evidence variable and set its CPD to be deterministically consistent with the evidence assignment.0remove edges $C \rightarrow D, C \rightarrow E$1remove edges $A \rightarrow C, B \rightarrow C$0alter the CPDs of $D$ and $E$0remove edge $B \rightarrow E$1alter the CPD of $C$0alter the CPD of $E$0None of theseNormalized Importance Sampling (Mandatory)Suppose you have a Bayesian network $\cal{B}$ over variables $\cal{X}$ and some observations ${\bf E} = {\bf e}$. You want to know the probability of the observed variables given the evidence. It is intractable to compute $P(\cal{X} \mid {\bf E})$ from the network directly, so you want to estimate it using normalized importance sampling. Recall that in order to do this, you must have some unnormalized distribution $\tilde{P}$ and a proposal distribution $Q$. If $x[m]$ is a sample assignment to all variables in $\cal{X}$ that is consistent with the evidence ${\bf e}$, which of the following can you use for $\tilde{P}(x[m])$ in all circumstances?checkbox10$\prod_{X_i \in \cal{X}} P_{\cal B}(x[m]\mid Pa_{X_i}[m])$ satisfies the requirement of being proportional to the target distribution $P(\cal{X} \mid {\bf E})$. The other three answers can be intractable to compute -- the first one by definition of the problem, and the other two because the summation may be intractable (they are also not proportional to $P(\cal{X} \mid {\bf E})$).0$\prod_{X_i \in \cal{X} - \bf{E}} P_{\cal B}(x[m]\mid Pa_{X_i}[m])$1$\prod_{X_i \in \cal{X}} P_{\cal B}(x[m]\mid Pa_{X_i}[m])$0$P_{\cal B}(x[m]\mid {\bf e})$0$\prod_{X_i \in \cal{X} - \bf{E}} P_{\cal B}(x[m]\mid Pa_{X_i}[m],\bf{e})$Suppose that the variables $\cal{X}$ represent different weather conditions (temperature, barometric pressure, humidity, etc.) on a particular day, and that the evidence $\bf{e}$ corresponds to the day being sunny with a temperature of 70 degrees. Which of the following can you use in all circumstances for the proposal distribution $Q(x[m])$ for normalized importance sampling?checkbox10The requirement is that $Q(x[m])$ be possible to sample from and that it be possible to compute so that the weight can be computed. The correct answers match these criteria. $P(x[m] \mid {\bf e})$ cannot be sampled from in general (which is why we're doing importance sampling). The weather.com sampling process can be sampled from, but the probability cannot be computed.0$P(x[m])$1Uniform distribution over each $X_i \notin \bf{E}$ separately0Use the following sampling process: Have a friend go on weather.com and randomly select a city and a day in the past that agrees with your evidence (sunny and 70 degrees), and tell you the rest of the weather information in $\cal{X}$ from that day1Forward sampling from the mutilated network $\cal{B}_{\bf{E}=\bf{e}}$Markov Chain Stationary Distributions (Mandatory)(Note: There were originally 10 total questions for this week, including 2 on this quiz. We removed 1 on this quiz, so there are now a total of 9 questions.)

Consider the simple Markov chain shown in the figure below.

By definition, a stationary distribution $\pi$ for this chain must satisfy which of the following properties?checkbox10The first answer reflects the property that the probability of a given state is the sum of probabilities of each other state times the transition probability to this state. The final condition is true of all normalized distributions.1$\pi(x_1) = 0.2\pi(x_1) + 0.3\pi(x_3)$0$\pi(x_1) = 0.2\pi(x_1) + 0.4\pi(x_2) + 0.4\pi(x_3)$0$\pi(x_3) = 0.3\pi(x_1) + 0.7\pi(x_3)$0$\pi(x_3) = 0.4\pi(x_1) + 0.5\pi(x_2)$0$\pi(x_1) = \pi(x_2) = \pi(x_3)$1$\pi(x_1) + \pi(x_2) + \pi(x_3) = 1$Properties of BP at Convergence (Video)Which term can be taken out of the sum?radio00:05:141$\beta_i$1$1/\delta_{ji}$0$\beta_i \cdot \delta_{ij}$0$\beta_i/\delta_{ij}$What is the consequence of this analysis?radioUnlike for clique trees, the sepset is not necessarily the intersection of variables in neighboring clusters. This analysis shows only that the beliefs agree on the sepset variables.00:07:010The beliefs of neighboring clusters agree on all variables1The beliefs of neighboring clusters agree on the variables in the sepset0The beliefs of neighboring clusters agree on all variables they have in commonImportance Sampling (Video)If we have a Bayes Net for $P(\cdot)$, why is $P(x \mid e)$ hard to sample from directly in general?radio000:01:051Because it requires summing over many variables at once, which may be intractable0Because multiplying the factors together may be intractable0It's not; you can always sample from $P(x\mid e)$Assuming this equation is true, what have we accomplished?radio0$P$ was hard to sample from, but by using the "correction" of the dataset sampled from $Q$, it is as if we have samples directly from $P$.00:05:100Estimated how close Q is to P0Generated samples from P, which we couldn't sample from initially1Generated a weighted dataset that matches the distribution of P, which we couldn't sample from initallyHow do we change the graph structure related to each variable $Z_i$?radio000:10:000Remove its children0Add children1Remove its parents0Add parentsThis distribution is hard to estimate because it requires summing out a lot of variables, which may be intractable. What proposal distribution $Q$ do we sample from instead to get a dataset that will give us an estimate of $P(\bar{y}\mid \bar{z})$?radio000:12:440$P(\bar{y},\bar{z})$0The mutilated network given the target assignments $\bar{y}$1The mutilated network given the evidence $\bar{z}$Normalized Importance Sampling (Video)Suppose $\mathcal{X} = \bar{Y} \cup \bar{E}$. Which of the following is a scenario that meets this criteria?radio0$P(\bar{x}\mid \bar{e})$ can be hard to sample from (again, because of the summation). $P(\bar{x}, \bar{e})$ is an unnormalized version of $P(\bar{x}\mid \bar{e})$ because $P(\bar{x}\mid \bar{e})$ = $P(\bar{x},\bar{e})/P(\bar{e}).$00:01:500Want to compute $P(\bar{x}, \bar{e}$), can compute $\tilde{P} = P(\bar{x}\mid \bar{e})$1Want to compute $P(\bar{x}\mid \bar{e})$, can compute $\tilde{P} = P(\bar{x},\bar{e})$0Want to compute $P(\bar{e})$, can compute $\tilde{P} = P(\bar{x})$What is the empirical expectation of $\tilde{w}$, $E_Q[\tilde{w}]$?radio0The empirical expectation according to $Q$ is simply the sum of samples generated by $Q$.00:07:120$\frac{1}{M}\sum_m f(\xi[m])$1$\frac{1}{M}\sum_m \tilde{w}(\xi[m])$0$\frac{1}{M}\sum_m Q(\xi[m])$Markov Chain Stationary Distributions (Video)What is $\pi(x^2)$ equal to?radio000:04:060$0.3 \pi(x_3)$0$0.5 \pi(x_3)$0$0.7 \pi(x_2) + 0.3 \pi(x_3)$1$0.7 \pi(x_2) + 0.5 \pi(x_3)$What other condition would ensure regularity?radio0The reason these answers are wrong is...00:12:200At least one variable has a self-loop with nonzero probability1Each variable has a self-loop with nonzero probability0Each variable has either a self-loop or is connected to all other variablesImproving Convergence of BP (Mandatory)Which of the following idea(s) can be used to help handle the problem of non-convergence of Generalized Belief Propagation?checkbox10See Box 11.B for more information.1Stopping BP after a fixed amount of time regardless of complete convergence0Removing edges from the cluster graph until convergence is achieved1Intelligent message scheduling heuristics1Dampening changes to messages in successive iterations0None of theseProperties of BP at Convergence (Mandatory)Consider the pairwise MRF, H, shown below with potentials over {A,B}, {B,C}, {A,D}, {B,E}, {C,F}, {D,E} and {E,F}.

Which of the following is/are valid cluster graph(s) for H? (A cluster graph is VALID means it admits inference via belief propagation.)checkbox10(a) is correct because a single cluster with all variables is always a valid cluster graph for any distribution. (c) and (f) also satisfy cluster graph RIP and family preservation. (b), (d), and (e) don't satisfy family preservation.1(a)
0(b)
1(c)
0(d)
0(e)
1(f)
Complexity of Variable Elimination (Mandatory)In a chain Bayesian Network of $n$ variables, each of which can take on $k$ values, what is the computational cost of Variable Elimination if we start from one end and go to the other?radio10When eliminating $X_1$, we sum out $X_1$ from $P(X1)P(X2|X1)$ to get $\phi_1(X_2)$. For each value of $X_2$, we have to do $k$ multiplications and $k-1$ summations, which is $O(k)$. Since $X_2$ can take $k$ different values, to get $\phi_1(X_2)$, the computation complexity is $O(k^2)$. The process continues for each $X_i$, so in total we have $O(nk^2)$.0$O(nk)$0$O(k^n)$0$O((n+1)k)$1$O(nk^2)$In the previous question, can the elimination ordering (e.g., not necessarily from one end to the other) make a difference to the complexity?radio10If we start by eliminating $X_2$, we create an intermediate factor over $X_1, X_2, X_3$, and continue from $X_3$ to the end and then $X_1$. Since the scope of the intermediate factor involves three variables, the complexity would be $O(nk^3)$.1Yes0NoClique Tree Calibration (Mandatory)Consider the following Markov Network over potentials $\phi_{A,B}, \phi_{B,C}, \phi_{A,D}, \phi_{B,E}, \phi_{C,F}, \phi_{D,E},$ and $\phi_{E,F}$:

Which of the following properties are necessary for a valid clique tree for the above network, but are NOT satisfied by this graph:
checkbox10Family Preservation is violated because $\phi_{B,E}$ cannot be assigned anywhere. The graph contains a loop so it's not a clique tree. It does, however, satisfy the Running Intersection Property. Finally, having an in-degree less than or equal to 1 is not a necessary condition of a clique tree.0Running Intersection Property0In-degree is less than or equal to 11Family Preservation1No loops0This graph is a valid clique tree for the Markov NetworkSuppose we have a factor $P(A \mid C)$ that we wish to include in our sum-product message passing inference. We should:radio10Definition 10.1 in the book (specifically Family Preservation) explains that the proper construction of a clique tree (cluster graph) requires assigning each factor to one cluster whose scope contains the scope of the factor.0Assign the factor to ALL cliques that contain $A$ OR $C$0Assign the factor to ONE clique that contain $A$ OR $C$0Assign the factor to ALL cliques that contain $A$ AND $C$1Assign the factor to ONE clique that contain $A$ AND $C$0None of theseIn the clique tree above, what is the correct form of the message from clique 3 to clique 2, $\delta_{3\rightarrow 2}$, where $\psi_i(C_i)$ is the initial potential of clique $i$.radio10We compute the message $\delta{3\rightarrow 2}$ by multiplying its initial potential ($\psi_3(C_3)$) by all its incoming messages ($\delta{4\rightarrow 3}$ and $\delta{5\rightarrow 3}$) except the one from $C_2$ ($\delta{2\rightarrow 3}$), and then eliminate the variables in $C_3-S_{3,2}$ (where $C_3=\{B,D,H,G\}$ and $S_{3,2} = \{B,D\}$ is the sepset.)0$\sum_{B,D} \psi_3(C_3) \times \sum_{D,H} \left(\psi_4(C_4) \times \delta_{4\rightarrow 3} \right)\times \sum_{B,H} \left(\psi_5(C_5) \times \delta_{5\rightarrow 3} \right)$0None of these0$\sum_{B,D,G,H} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}$0$\sum_{B,D} \psi_3(C_3)$0$\sum_{G,H} \psi_3(C_3) \times \delta_{2\rightarrow 3}$0$\sum_{B,D} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}$1$\sum_{G,H} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}$In the clique tree below which of the following starting message-passing orders is/are valid? (Note: These are not necessarily full sweeps that result in calibration.)
checkbox10In $C_1\rightarrow C_2, C_2\rightarrow C_3, C_5\rightarrow C_3, C_3\rightarrow C_4$, $C_4$ is the root of the clique tree and all messages are passing upward. In $C_4\rightarrow C_3, C_5\rightarrow C_3, C_3\rightarrow C_2, C_1\rightarrow C_2$, $C_2$ is the root and all messages are passing upward.0$C_4\rightarrow C_3, C_3\rightarrow C_2, C_2\rightarrow C_1$0$C_4\rightarrow C_3, C_5\rightarrow C_3, C_2\rightarrow C_3$1$C_1\rightarrow C_2, C_2\rightarrow C_3, C_5\rightarrow C_3, C_3\rightarrow C_4$0$C_4\rightarrow C_3, C_3\rightarrow C_5, C_3\rightarrow C_2, C_1\rightarrow C_2$1$C_4\rightarrow C_3, C_5\rightarrow C_3, C_3\rightarrow C_2, C_1\rightarrow C_2$0None of these0$C_1\rightarrow C_2, C_2\rightarrow C_3, C_3\rightarrow C_4, C_3\rightarrow C_5$Inference Tasks (Video)Why is the expression $$\sum_{\bar{W}}P(\bar{Y},\bar{W},\bar{e})$$ hard to compute in general?checkbox0The summation over all values of $\bar{W}$ is exponential. If $\bar{W}$ has 100 binary variables, then summing will take $2^{100}$ operations. $P(\bar{Y},\bar{W},\bar{e})$ is always easy to compute because it is just the product of all CPDs.00:04:030It may be hard to compute $P(\bar{Y},\bar{W},\bar{e})$1It may be hard to sum over $\bar{W}$0It isn't hard; it's always tractableGeneral Definition of Variable Elimination (Video)If we want to eliminate $S$ next, which factors should we multiply together?checkbox0The factors that involve $S$ are multiplied together (and then $S$ is summed out).00:14:271$\phi_S(S)$1$\phi_B(B,S)$1$\phi_L(L,S)$0$\phi_D(D,A,B)$0$\phi_A(A,T,L)$0$\phi_X(X,A)$0$\tau_1(T)$Which variables do we sum over when performing Variable Elimination with evidence?radio0We are trying to compute $$P(\bar{Y}\mid \bar{e} = P(\bar{Y},\bar{e})/P(\bar{e})$$ (where $\bar{Y}$ are the query variables and $\bar{E}$ are the evidence). Computing the numerator involves summing out everything except for the query and evidence variables, and computing the denominator requires further summing over the query variables so that the entire equation sums to 1.00:23:470All except the query and evidence variables1All except the evidence variables0All except the query variables0All variablesWhat is this expression equal to: $$\tau_1(B) \times \phi_C(B,C) \times \phi_D(C,D) \times \phi_E(D,E)$$radio0It resulted from multiplying all factors together (which is the joint distribution) and then summing out $A$. So it's equal to $\sum_A P(A,B,C,D,E) = P(B,C,D,E)$.00:07:380$P(A,B,C,D,E)$1$P(B,C,D,E)$0Nothing meaningful; it is just a product of factorsVariable Elimination on a Chain (Video)How many entries are in a table describing the full joint distribution, where each of the $n$ variables has $k$ possible values?radio000:08:031$k^n$0$nk^2$0$kn^2$0$n^k$Why do you think we've gone through this process of pushing through summations rather than just multiplying all the terms together first and then summing everything out together?checkbox0We will now see why pushing through summations saves computations and results in faster running time. Also, we are saving space because if we first multiplied everything together, we would get a factor over $(a,b,c,d,e)$, and if each variable is binary then this factor is of size $2^5$. When we push through summations, we never have a factor whose scope is greater than 2 variables, so we never need to store anything of size greater than $2^2$.00:04:481Memory storage efficiency1Fun1Running time efficiencyIf we sum out $b$ next, what will the next equation look like after pushing terms through the summation?radio0This will now be shown in the video (after momentary technical difficulties).00:03:230$$\sum_d \sum_c P(c\mid b) P(d\mid c) P(e\mid d) \sum_bP(b)$$0$$\sum_d \sum_c P(c\mid b) P(b)\sum_b P(d\mid c) P(e\mid d)$$1$$\sum_d \sum_c P(d\mid c) P(e\mid d) \sum_b P(c\mid b) P(b)$$Complexity of Variable Elimination (Video)How many times does each entry in $\psi_i$ get added in to one of the $\tau_i$ entries?radio0We are computing a single sum over some entries in $\psi_i$, so each entry is involved in the sum at most once.00:05:150We can't know0At least once0Exactly once1At most onceEliminating which of the following variables introduces a new edge (fill edge) into the induced graph?radio0These steps will now be shown in the video.00:12:210$A$1None of these0$X$0$D$0$B$0$T$Running Intersection Property (Video)Suppose that we changed the scope of cluster $(B,L,A)$ to $(B,L)$. The graph would then not satisfy the Running Intersection Property because of $A$.
Intuitively, why would this break the guarantee that message-passing to the root $(D,B,A)$ would produce the correct belief $\tilde{P}_\Phi(D,B,A)$?radio0All initial factors can still be assigned and the algorithm can still be run, but marginalizing out $A$ before reaching the root would lose information about $A$. This is just an intuitive explanation (a similar explanation is given at around 7:30 in the video), but the proof makes this precise.00:05:050Because it would be impossible to assign all initial factors to the cluster tree.0Because it would become impossible to run the Variable Elimination algorithm and pass all messages to the root.1Because the message from $(T,L,A)$ to $(B,L)$ would lose information about $A$ by marginalizing it out, but that information needs to be preserved for the final belief $\tilde{P}_\Phi(D,B,A)$.Which property tells us that none of the variables in $S_{ij}$ are in $V_i$?radio0$S_{ij}$ separates cluster $C_i$ from the left blob and $C_j$ from the right blob. Since by definition none of the variables in $V_i$ appear in the right blob, no variable $X \in V_i$ is in $C_j$. So the fact that $S_{ij} = C_i \cap C_j$ means that $X \notin S_{ij}$.00:11:300Running Intersection Property1Definition of sepsets0Family Preservation0The graph represents a proper distributionClique Tree Calibration (Video)Which messages are the same when $(B,L,A)$ is the root as when $(D,B,A)$ is the root?checkbox0The only thing that is different is that now a message is passed from $(D,B,A)$ to $(B,L,A)$, and the message in the opposite direction is irrelevant. All other messages remain exactly the same.00:04:471$(T,V)$ to $(T,L,A)$1$(T,L,A)$ to $(B,L,A)$1$(B,L,S)$ to $(B,L,A)$0$(B,L,A)$ to $(D,B,A)$1$(X,A)$ to $(D,B,A)$Suppose we did a run of the asynchronous algorithm. It would be equivalent to an up-down message passing run with which cluster as root?radio0This will be explained now in the video.00:11:240Any leaf could be considered the root0All clusters could be considered the root1Whichever cluster was the first to receive all its incoming messagesWe saw at the beginning of this chunk that a naive algorithm for computing all marginals $P(X_i)$ in a Bayes net would run in time $O(kc)$, where $k$ is the number of clusters and $c$ is the cost of a single run of Variable Elimination on the clique tree. How much computation does calibration (which also achieves this) take?radio0We have shown that an upward and a downward pass are enough for full calibration, and each takes time $c$, so it takes $2c$ computation to fully calibrate the graph, which allows us to get all the marginals.00:18:230$c$1$2c$0$\frac{k}{2}c$0$k$Clique Tree Invariant (Video)Which of the following steps can we take next?radio0Since none of the variables in $\delta_{ji}$ are being summed over, we can move it out of the summation.00:04:461Move $\delta_{ji}$ out of the summation0Move $\psi_i$ out of the summation0Move the product over $\delta_{ki}$ out of the summation0Remove $\delta_{ji}$ since it sums to $1$0Remove $\psi_i$ since it sums to $1$How many times does each message appear in the numerator and the denominator?radio0This will now be explained in the video.00:09:191Once in the numerator, once in the denominator0Once in the numerator, twice in the denominator0Twice in the numerator, once in the denominator0Twice in the numerator, twice in the denominator0Can't determine exactlyTracking in Temporal Models (Mandatory)Which of the following template clique-trees allows for exact filtering in the given 2-TBN? (See section 15.2.4 in the book. Remember the requirements for it to be a valid clique tree.)
checkbox10B violates running intersection with the $C$ variable, C violates family preservation, as $\phi(A,B,D,B')$ does not have a clique to call home, D and E do not have the correct interfaces at one end or the other of the clique (first clique must have $(A,B,C,D)$, last must have $(A',B',C',D')$).1A0B0C0D0E