Hongxu Xu

PhD Student (May 2025 – Present)
Supervised by Prof. Chengnian Sun
Cheriton School of Computer Science
University of Waterloo, Canada

B.Sc. in Computer Science (Sep 2014 – Jun 2018)
Beijing Normal University, China

Research Interests

Software Engineering
Programming Languages

without a focus yet…

Socials

GitHub: https://github.com/xuhongxu96
LinkedIn: https://www.linkedin.com/in/xuhongxu/
Email: h4️⃣4️⃣5️⃣xu 🌀 uwaterloo.ca
微信公众号: xuhongxu_it
(Chinese only. Scan the QR code to follow my WeChat Official Account.)

Misc.

Chinese Name: 许宏旭
Pronunciation: Hong-shyoo (IPA: [xʊŋ ɕy])

Publications

Books

Feb 2024 CMake构建实战：项目开发卷 (CMake Build Practice: Project Development Volume)
Published by 人民邮电出版社 (Posts & Telecom Press, China)
Produced by 异步图书 (epubit)

Experience

PhD Student (May 2025 – Present)

Supervised by Prof. Chengnian Sun

Cheriton School of Computer Science
University of Waterloo, Canada

Senior SDE (Sep 2022 – Mar 2025)

Seed/Data Speech Team
ByteDance, Shanghai, China

Led the development of the Text-to-Speech engine and contributed to the Doubao AI assistant application.

SDE-2 (Aug 2021 – Sep 2022)

MSAI Team
Microsoft STC-Asia, Suzhou, China

Led the development of Microsoft WordBreaker and initiated a modern NLP toolkit for Office 365.

SDE (Jul 2018 – Aug 2021)

MSAI Team
Microsoft STC-Asia
Beijing, China (Relocated to Suzhou, Jiangsu, China in May 2019)

Worked on 20-year-old Microsoft WordBreaker.

Short-Term Contributor (Nov 2020 – Jan 2021)

Windows APS Team (temporary assignment)
Microsoft STC-Asia, Suzhou, China

Contributed to the formation of the new team, and Windows 11 application development.

B.Sc. in Computer Science (Sep 2014 – Jun 2018)

Beijing Normal University, China

SDE Intern (Jul 2017 – Dec 2017)

Bing Search Relevance Team
Microsoft STC-Asia, Beijing, China

Answer triggering model development for Bing Search.

Awards

Outstanding Graduate, Beijing Normal University, 2018
Top Ten Volunteer, Beijing Normal University, 2015
First Prize, National Olympiad in Informatics in Provinces (NOIP), 2013

Teaching

Instructional Apprentice, CS 246 - Object-Oriented Software Development, Fall 2025
Teaching Assistant, CS 246 - Object-Oriented Software Development, Spring 2025

CS 452/652: Real-Time Programming (the Train Course)

Spring 25

I dropped this course after I learned that it is really time-consuming.

However, before I dropped it, I did some research and found some useful resources, which I summarized in MarklinSim All-in-One repo.

In short, follow the readme in the repo to set up:

MarklinSim (an emulator for the train kit)
QEMU (an emulator for the Raspberry Pi 4B)
Example Image (your real-time system).

I believe it will save you a lot of time. Good luck!

CS 480/680: Introduction to Machine Learning

Spring 25

This is not a hard course.

Download PDF

\documentclass{article}
\usepackage{fullpage}
\usepackage{float}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{mathtools}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{url, hyperref}
\usepackage{algorithm2e}
\usepackage[margin=0.25in]{geometry}
\usepackage{pgfplots}
\pgfplotsset{width=10cm,compat=1.18}

\usepgfplotslibrary{external}
\tikzexternalize

\lstset{basicstyle=\fontfamily{pcr}\footnotesize}

\graphicspath{./}

\newcommand{\paren}[1]{\left( #1 \right)}
\newcommand{\iprod}[1]{\left\langle #1 \right\rangle}

\newcommand*\MY@rightharpoonupfill@{%
    \arrowfill@\relbar\relbar\rightharpoonup
}
\newcommand*\overrightharpoon{%
    \mathpalette{\overarrow@\MY@rightharpoonupfill@}%
}

\theoremstyle{definition}
\newtheorem{question}{Question}

\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\diag}{diag}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\Var}{Var}
\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}
\newcommand{\llnorm}[1]{\left\| #1 \right\|_2}
\newcommand{\mnorm}[1]{\left\| #1 \right\|_1}
\newcommand{\fnorm}[1]{\left\| #1 \right\|_F}

\title{\large CS480/680, Spring 2025\\\huge Review Notes}

\author{}
\date{\today}
\setlength\parindent{0pt}

%\input{preamble}

\begin{document}

\maketitle

\newpage

\part{For Mid-Term}

\section{Perceptron}

\begin{question}[Linear Function]
    \begin{equation}
        \begin{split}
            & \forall \alpha \beta \in \mathbb{R}, \forall \boldsymbol{x}, \boldsymbol{z} \in \mathbb{R}^d, 
            f(\alpha \boldsymbol{x} + \beta \boldsymbol{z}) = \alpha \cdot f(\boldsymbol{X}) + \beta \cdot f(\boldsymbol{z}) \\
            \Longleftrightarrow
            & \exists \boldsymbol{w} \in \mathbb{R}^d, f(\boldsymbol{x}) = \iprod{\boldsymbol{x}, \boldsymbol{w}}
        \end{split}
    \end{equation}

    \textbf{Proof $\Rightarrow$:}
    Let $\boldsymbol{w} \coloneq \left[ f(\boldsymbol{e_1}), \dots, f(\boldsymbol{e_d}) \right]^T$, where
    $\boldsymbol{e_i}$ is the $i$-th coordinate vector.
    \begin{equation}
        \begin{split}
            f(\boldsymbol{x}) & = f( x_1 \boldsymbol{e_1} + \dots + x_d \boldsymbol{e_d} ) \\
            & = x_1 f(\boldsymbol{e_1}) + \dots + x_d f(\boldsymbol{e_d}) \\
            & = \iprod{\boldsymbol{x}, \boldsymbol{w}}
        \end{split}
    \end{equation}

    \textbf{Proof $\Leftarrow$:}
    \begin{equation}
        \begin{split}
            f(\alpha \boldsymbol{x} + \beta \boldsymbol{z}) & = \iprod{\alpha \boldsymbol{x} + \beta \boldsymbol{z}, \boldsymbol{w}} \\
            & = \iprod{\alpha \boldsymbol{x}, \boldsymbol{w}} + \iprod{\beta \boldsymbol{z}, \boldsymbol{w}} \\
            & = \alpha \iprod{\boldsymbol{x}, \boldsymbol{w}} + \beta \iprod{\boldsymbol{z}, \boldsymbol{w}} \\
            & = \alpha f(\boldsymbol{x}) + \beta f(\boldsymbol{z})
        \end{split}
    \end{equation}
\end{question}

\begin{question}[$\boldsymbol{w}$ is Orthogonal to Decision Boundary $H$]
    Any vector on $H$ can be written as $\overrightharpoon{\boldsymbol{x}\boldsymbol{x'}} = \boldsymbol{x'} - \boldsymbol{x}$,
    \begin{equation}
            \iprod{\boldsymbol{x'} - \boldsymbol{x}, \boldsymbol{w}} 
            = \iprod{\boldsymbol{x'}, \boldsymbol{w}} - \iprod{\boldsymbol{x}, \boldsymbol{w}}
            = -b - (-b) = 0
    \end{equation}
\end{question}

\begin{question}[Update Rule for Perceptron]
\begin{equation} \label{perceptron-update-rule}
    \begin{split}
        \boldsymbol{w} \leftarrow \boldsymbol{w} + y_i \boldsymbol{x_i} \\
        b \leftarrow = b + y_i
    \end{split}
\end{equation}
\end{question}

\begin{question}[Feasibility of Perceptron]
The goal is to find $\boldsymbol{w} \in \mathbb{R}^d, b \in \mathbb{R}$, 
such that $\forall i, y_i \paren{\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b} > 0$.

According to the update rule \ref{perceptron-update-rule},
\begin{equation}
    \begin{split}
        y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{new}}} + b_{new} \right]
        &= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old} + y \boldsymbol{x}}} + b_{old} + y \right] \\
        &= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + y\left[ \iprod{\boldsymbol{x} ,y \boldsymbol{x}} + y \right] \\
        &= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + y\left[ y \llnorm{\boldsymbol{x}}^2 + y \right] \\
        &= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + y^2 \llnorm{\boldsymbol{x}}^2 + y^2 \\
        &= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + \underbrace{\llnorm{\boldsymbol{x}}^2 + 1}_{\text{always positive}}
    \end{split}
\end{equation}
Notice that $y \in \left\{ \pm 1 \right\} \Rightarrow y^2 = 1$.

$\llnorm{\boldsymbol{x}}^2 + 1$ is always positive, which means we always increase the confidence $y \hat{y}$ after the update.
\end{question}

\begin{question}[Trick for Hiding the Bias Term -- Padding]
    \begin{equation}
        \begin{split}
            \iprod{\boldsymbol{x}, \boldsymbol{w}} + b 
            & = \iprod{ \begin{pmatrix} \boldsymbol{x} \\ 1 \end{pmatrix}, \begin{pmatrix} \boldsymbol{w} \\ b \end{pmatrix} } \\
            & = \iprod{ \boldsymbol{x_{pad}}, \boldsymbol{w_{pad}}}
        \end{split}
    \end{equation}
    Correspondingly, the update rule can be written as:
    \begin{equation}
        \boldsymbol{w_{pad}} \leftarrow \boldsymbol{w_{pad}} + y \boldsymbol{x_{pad}}
    \end{equation}
\end{question}

\begin{question}[Margin]
    Suppose $\exists \boldsymbol{w^*}$ such that $\forall i, y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w^*}} > 0$.

    We normalize $\boldsymbol{w^*}$ such that $\llnorm{\boldsymbol{w^*}} = 1$.

    In other words, $\boldsymbol{w^*}$ is the normalized weight for the deicision boundary.

    \begin{equation} \label{margin}
        \text{Margin } \gamma \coloneq \min_i \left| \iprod{\boldsymbol{x_i}, \boldsymbol{w^*}} \right|
    \end{equation}
\end{question}

\begin{question}[Convergence Theorem -- Linearly Separable Case]
    Assume that $\forall i, \llnorm{\boldsymbol{x_i}} \leq C$ (i.e. within a circle with radius $C$).

    Then the Perceptron algorithm converges after $\frac{C^2}{\gamma^2}$ mistakes.

    \vspace{5mm}
    \textbf{Proof:}

    Suppose $\boldsymbol{w}$ is the updating weight, and $\theta$ is the angle between $\boldsymbol{w}$ and $\boldsymbol{w^*}$. 

    We have $\iprod{\boldsymbol{w}, \boldsymbol{w^*}} = \llnorm{\boldsymbol{w}} \llnorm{\boldsymbol{w^*}} \cos{\theta} = \llnorm{\boldsymbol{w}} \cos{\theta}$.
    
    After an update, $\llnorm{\boldsymbol{w_{new}}} \cos{\theta_{new}}$ will be
    \begin{equation}
        \begin{split}
            \iprod{\boldsymbol{w} + y \boldsymbol{x}, \boldsymbol{w^*}}
            & = \iprod{\boldsymbol{w}, \boldsymbol{w^*}} + y \iprod{\boldsymbol{x}, \boldsymbol{w^*}} \\
            & = \iprod{\boldsymbol{w}, \boldsymbol{w^*}} + \left|\iprod{\boldsymbol{x}, \boldsymbol{w^*}}\right| \\
            & \geq \iprod{\boldsymbol{w}, \boldsymbol{w^*}} + \gamma
        \end{split}
    \end{equation}

    Let's see the change of $\iprod{\boldsymbol{w_{new}}, \boldsymbol{w_{new}}} = \llnorm{\boldsymbol{w_{new}}}^2$,
    \begin{equation}
        \begin{split}
            \iprod{\boldsymbol{w} + y \boldsymbol{x}, \boldsymbol{w} + y \boldsymbol{x}}
            & = \iprod{\boldsymbol{w}, \boldsymbol{w}} + 2y \iprod{\boldsymbol{w}, \boldsymbol{x}} + y^2 \iprod{\boldsymbol{x}, \boldsymbol{x}} \\
            & = \llnorm{\boldsymbol{w}}^2 + 2y \iprod{\boldsymbol{w}, \boldsymbol{x}} + \llnorm{\boldsymbol{x}}^2
        \end{split}
    \end{equation}
    Because $y\iprod{\boldsymbol{w}, \boldsymbol{x}} < 0$ and $\llnorm{\boldsymbol{x}} \leq C$,
    \begin{equation}
        \begin{split}
            \iprod{\boldsymbol{w} + y \boldsymbol{x}, \boldsymbol{w} + y \boldsymbol{x}}
            & = \llnorm{\boldsymbol{w}}^2 + 2y \iprod{\boldsymbol{w}, \boldsymbol{x}} + \llnorm{\boldsymbol{x}}^2 \\
            & \leq \llnorm{\boldsymbol{w}}^2 + C^2
        \end{split}
    \end{equation}

    Finally, suppose it converges after $M$ updates, we have $\iprod{\boldsymbol{w}, \boldsymbol{w^*}} \geq M \gamma$ and $\llnorm{\boldsymbol{w}}^2 \leq M C^2$
    \begin{equation}
        \begin{split}
            1 = \cos{\theta} & = \frac{\iprod{\boldsymbol{w}, \boldsymbol{w^*}}}{\llnorm{\boldsymbol{w}} \llnorm{\boldsymbol{w^*}}} \\
            & \geq \frac{M \gamma}{\sqrt{M C^2} \times 1} \\
            & = \sqrt{M} \frac{\gamma}{C}
        \end{split}
    \end{equation}

    which means $M \leq \frac{C^2}{\gamma^2}$.
\end{question}

\begin{question}[Perceptron Loss]
    \begin{equation}
        \begin{split}
            l(\boldsymbol{w}, \boldsymbol{x_t}, y_t) 
            & = -y_t \iprod{\boldsymbol{w}, \boldsymbol{x_t}} \mathbb{I}[\text{mistake on }\boldsymbol{x_t}] \\
            & = -\min \left\{ y_t \iprod{\boldsymbol{w}, \boldsymbol{x_i}}, 0 \right\}
        \end{split}
    \end{equation}
    \begin{equation}
        L(\boldsymbol{w}) = -\frac{1}{n} \sum_{t=1}^n y_t \iprod{\boldsymbol{w}, \boldsymbol{x_t}} \mathbb{I} [\text{mistake on }\boldsymbol{x_t}]
    \end{equation}
\end{question}

\section{Linear Regression}

\begin{question}[Least Square Regression]
   \begin{equation}
    \min_{f: \mathcal{X} \rightarrow \mathcal{Y}} \mathbb{E} \llnorm{f(\boldsymbol{X}) - Y}^2
   \end{equation} 

   The optimal regression function is
   \begin{equation}
    f^*(\boldsymbol{x}) = m(x) = \mathbb{E} [Y | \boldsymbol{X} = \boldsymbol{x}]
   \end{equation}

   Calculating it needs to know the distribution, i.e., all pairs $(\boldsymbol{X}, Y)$.
\end{question}

\begin{question}[Bias-Variance Decomposition]
    \begin{equation}
        \begin{split}
            \mathbb{E} \llnorm{f(\boldsymbol{X}) - Y}^2
            & = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x) + m(x) - Y}^2 \\
            & = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + 2 \mathbb{E} \iprod{f(\boldsymbol{X}) - m(x), m(x) - Y} \\
            & = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + \mathbb{E} \; \mathbb{E}_{Y|\boldsymbol{X}} \left[ \iprod{f(\boldsymbol{X}) - m(x), m(x) - Y} \right] \\
            & = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + \mathbb{E} \iprod{f(\boldsymbol{X}) - m(x), m(x) - \mathbb{E}_{Y|\boldsymbol{X}} [Y]} \\
            & = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + \mathbb{E} \iprod{f(\boldsymbol{X}) - m(x), m(x) - m(x)} \\
            & = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \underbrace{\mathbb{E} \llnorm{m(x) - Y}^2}_{\text{noise (variance)}}
        \end{split}
    \end{equation} 

    The last term is the noise (variance), irrelevant to $f$. 
    So, to minimize the squared error, we need $f \approx m$.

    However, $m(\boldsymbol{x})$ is incalculable, because $\mathbb{E} [Y | \boldsymbol{X} = \boldsymbol{x}]$ is unknown.

    Let's learn $f_D$ from the training data $D$. Define $\bar{f}(\boldsymbol{X}) = \mathbb{E}_D[f_D(\boldsymbol{X})]$.
    \begin{equation}
        \begin{split}
            \underbrace{\mathbb{E}_{\boldsymbol{X}, Y, D} \llnorm{f_D(\boldsymbol{X}) - Y}^2}_{\text{test error}}
            & = \mathbb{E}_{\boldsymbol{X}} \llnorm{f_D(\boldsymbol{X}) - m(x)}^2 + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
            & = \mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}) + \bar{f}(\boldsymbol{X}) - m(x)}^2 + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
            & = \mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X})}^2 + \mathbb{E}_{\boldsymbol{X}} \llnorm{\bar{f}(\boldsymbol{X}) - m(x)}^2 \\
               & \qquad + 2 \mathbb{E}_{\boldsymbol{X}, D} \iprod{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} \\
               & \qquad + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
            & = \dots + 2 \mathbb{E}_{\boldsymbol{X}} \mathbb{E}_{D} \iprod{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} + \underline{\dots} \\
            & = \dots + 2 \mathbb{E}_{\boldsymbol{X}} \iprod{\mathbb{E}_{D} [f_D(\boldsymbol{X})] - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} + \underline{\dots} \\
            & = \dots + 2 \mathbb{E}_{\boldsymbol{X}} \iprod{\bar{f}(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} + \underline{\dots} \\
            & = \dots + 0 + \underline{\dots} \\
            & = \mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X})}^2 + \mathbb{E}_{\boldsymbol{X}} \llnorm{\bar{f}(\boldsymbol{X}) - m(x)}^2 + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
            & = \underbrace{\mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \mathbb{E}_D [f_D(\boldsymbol{X})]}^2}_{\text{variance}} + \underbrace{\mathbb{E}_{\boldsymbol{X}} \llnorm{\mathbb{E}_D [f_D(\boldsymbol{X})] - m(x)}^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2}_{\text{noise (variance)}} \\
        \end{split}
    \end{equation}
\end{question}

\begin{question}[Sampling $\rightarrow$ Training]
    Replace expectation with sample average: $\paren{\boldsymbol{X_i}, Y_i} \tilde P$.
    \begin{equation}
        \min_{f: \mathcal{X} \rightarrow \mathcal{Y}} \hat{\mathbb{E}} \llnorm{f(\boldsymbol{X}) - Y}^2 \coloneq \frac{1}{n} \sum_{i=1}^n \llnorm{f(\boldsymbol{X_i}) - Y_i}^2
    \end{equation}
    Uniform law of large numbers: as training data size $n \rightarrow \argmin \mathbb{E}$, 

    $\hat{\mathbb{E}} \rightarrow \mathbb{E}$ and (hopefully) $\argmin \hat{\mathbb{E}} \rightarrow \mathbb{E}$.
\end{question}

\begin{question}[Linear Regression]
    Padding: $\boldsymbol{x} \leftarrow \begin{pmatrix}\boldsymbol{x} \\ 1\end{pmatrix}$, $W \leftarrow \begin{pmatrix}W \\ \boldsymbol{b}\end{pmatrix}$
    \begin{align*}
    X & = [\boldsymbol{x_1}, \dots, \boldsymbol{x_n}] \in \mathbb{R}^{(d+1) \times n}, \\
    Y & = [\boldsymbol{y_1}, \dots, \boldsymbol{y_n}] \in \mathbb{R}^{t \times n}, \\
    W & \in \mathbb{R}^{t \times (d+1)}, \\
    \fnorm{A} & = \sqrt{\sum_{ij} a_{ij}^2}
    \end{align*}
    Linear regression is:
    \begin{equation}
        \min_{W \in \mathbb{R}^{t \times (d+1)}} \frac{1}{n} \fnorm{WX - Y}^2
    \end{equation}
\end{question}

\begin{question}[Optimality Condition]
    If $\boldsymbol{w}$ is a minimizer (or maximizer) of a differentiable function $f$
    \textbf{over an open set}, then $f'(\boldsymbol{w}) = 0$.
\end{question}

\begin{question}[Solving Linear Regression]
    \begin{equation}
        L(W) = \frac{1}{n} \fnorm{WX-Y}^2
    \end{equation}
    \begin{equation}
        \begin{split}
            \nabla_W L(W) & = \frac{2}{n} (WX-Y) X^T = 0 \\
            & \Rightarrow W X X^T = Y X^T \\
            & \Rightarrow W = Y X^T (X X^T)^{-1} \\
        \end{split}
    \end{equation}
\end{question}

\begin{question}[Ill-Conditioning]
    Slight pertubation leads to chaotic behavior, which happens whenever
    $X$ is ill-conditioned, i.e., (close to) rank-deficient.

    \vspace{5mm}
    Rank-deficient X means:
    \begin{enumerate}
        \item two columns in $X$ are linearly dependent (or simply the same)
        \item but the corresponding $y$ might be different
    \end{enumerate}
\end{question}

\begin{question}[Ridge Regression]
    \begin{equation}
        \min_{W} \frac{1}{n} \fnorm{WX - Y}^2 + \lambda \fnorm{W}^2
    \end{equation}
    \begin{equation}
        \begin{split}
            \nabla_W L(W) & = \frac{2}{n} (WX - Y) X^T + 2 \lambda W = 0 \\
            & \Rightarrow W X X^T - Y X^T + \lambda W = 0 \\
            & \Rightarrow W (X X^T + n \lambda I) = Y X^T
        \end{split}
    \end{equation}
    \begin{equation}
        \begin{split}
            & X = U \Sigma V^T \\
            \Rightarrow & X X^T = U \Sigma (V^T V) \Sigma U^T = U \Sigma^2 U^T \\
            \Rightarrow & X X^T + n \lambda I = U \underbrace{(\Sigma^2 + n \lambda I)}_{\text{strictly positive}} U^T \\
            \Rightarrow & X X^T + n \lambda I \text{ is of full-rank}
        \end{split}
    \end{equation}

    $\lambda$ is regularization parameter. $\lambda = \infty \Rightarrow W \equiv \boldsymbol{0}$.
\end{question}

\begin{question}[Regularization $\equiv$ Data Augmentation]
    \begin{equation}
        \frac{1}{n} \fnorm{WX - Y}^2 + \lambda \fnorm{W}^2
        = \frac{1}{n} \fnorm{W \left[X \quad \sqrt{n\lambda} I \right] - \left[Y \quad \boldsymbol{0}\right]}^2
    \end{equation}
\end{question}

\section{Logistic Regression}

\begin{question}[Max Likelihood Estimation]
    Let $\mathcal{Y} = \{0, 1\}$. 
    Learn confidence $p(\boldsymbol{x}; \boldsymbol{w}) \coloneq \Pr (Y=1|X=\boldsymbol{x})$.

    \begin{equation}
        \begin{split}
            \max_{\boldsymbol{w}} \Pr (Y_1 = y_1, \dots, Y_n = y_n)
            & = \max_{\boldsymbol{w}} \prod_{i=1}^n \Pr (Y_i = y_i | X_i = x_i) \\
            & \stackrel{\mathcal{Y}=\{0, 1\}}{=} \max_{\boldsymbol{w}} \prod_{i=1}^n \left[p(\boldsymbol{x_i}; \boldsymbol{w})\right]^{y_i} \left[1 - p(\boldsymbol{x_i}; \boldsymbol{w})\right]^{1 - y_i} \\
        \end{split}
    \end{equation}

    Use negative log-likelihood:

    \begin{equation}
        \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ -y_i \log p(\boldsymbol{x_i}; \boldsymbol{w}) - (1 - y_i) \log (1 - p(\boldsymbol{x_i}; \boldsymbol{w})) \right]
    \end{equation}
\end{question}

\begin{question}[Odds Ratio and Sigmoid]
    \begin{equation}
        \text{Odds Ratio} = \frac{\Pr}{1 - \Pr}
    \end{equation}

    Assume $\log \frac{p(\boldsymbol{x};\boldsymbol{w})}{1 - p(\boldsymbol{x};\boldsymbol{w})} = \iprod{\boldsymbol{x}, \boldsymbol{w}}$.

    \vspace{5mm}
    The Sigmoid transformation is:
    \begin{equation}
        p(\boldsymbol{x}; \boldsymbol{w}) = \frac{1}{1 + \exp (-\iprod{\boldsymbol{x}, \boldsymbol{w}})}
    \end{equation}
\end{question}

\begin{question}[Logistic Regression]
    Plug the sigmoid in the negative log-likelihood:
    \begin{equation}
        \begin{split}
        & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ -y_i \log p(\boldsymbol{x_i}; \boldsymbol{w}) - (1 - y_i) \log (1 - p(\boldsymbol{x_i}; \boldsymbol{w})) \right] \\
        = & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ y_i \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] - (1 - y_i) \log \left(1 - \frac{1}{1 + \exp (-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})}\right) \right] \\
        = & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ y_i \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] - (1 - y_i) \log \left(\frac{\exp (-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})}{1 + \exp (-\iprod{\boldsymbol{x}, \boldsymbol{w}})}\right) \right] \\
        = & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ y_i \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + (1 - y_i) \left[\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + \log(1 + \exp (-\iprod{\boldsymbol{x}, \boldsymbol{w}}))\right] \right] \\
        = & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} \right] \\
        \end{split}
    \end{equation}
    Because $y_i \in \{0, 1\}$, let's map it to $\{\pm 1\}$.
    \begin{equation}
        \begin{split}
            L(\boldsymbol{w})
            & \stackrel{y_i \in \{0, 1\}}{=} \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} \\
            & \stackrel{y_i \in \{0, 1\}}{=} \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + \log \left[ \exp ( (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} ) \right] \\
            & \stackrel{y_i \in \{0, 1\}}{=} \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}}) \cdot \exp ( (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} ) ] \\
            & \stackrel{y_i \in \{0, 1\}}{=} \log [\exp((1-y_i)\iprod{\boldsymbol{x_i}, \boldsymbol{w}}) + \exp(-y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w}})] \\
            & \stackrel{y_i \in \{0, 1\}}{=} \begin{cases}
                \log [\exp(\iprod{\boldsymbol{x_i}, \boldsymbol{w}}) + 1] & y_i = 0 \\
                \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] & y_i = 1
            \end{cases} \\
            & \stackrel{y_i \in \{\pm 1\}}{=} \log [ 1 + \exp(-y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w}}) ]
        \end{split}
    \end{equation}
\end{question}

\begin{question}[Multi-Class: Sigmoid $\rightarrow$ Softmax]
    \begin{equation}
        \Pr(Y=k | X = \boldsymbol{x}; \boldsymbol{W} = [\boldsymbol{w_1}, \dots, \boldsymbol{w_c}])
        = \frac{\exp(\iprod{\boldsymbol{x}, \boldsymbol{w_k}})}{\sum_{l=1}^c \exp(\iprod{\boldsymbol{x}, \boldsymbol{w_l}})}
    \end{equation}
    Maximum likelihood estimation (log loss, cross-entropy loss):
    \begin{equation}
        \min_{\boldsymbol{W}} \sum_{i=1}^n \left[ -\log \frac{\exp(\iprod{\boldsymbol{x}, \boldsymbol{w_k}})}{\sum_{l=1}^c \exp(\iprod{\boldsymbol{x}, \boldsymbol{w_l}})} \right]
    \end{equation}
\end{question}

\section{Hard-Margin Support Vector Machines}

\begin{question}[Distance from a Point to a Hyperplane]
    Let $H \coloneq \{\boldsymbol{x}: \iprod{\boldsymbol{x}, \boldsymbol{w}} + b = 0\}$, $\boldsymbol{x}$ be any vector in $H$.
    \begin{equation}
        \text{Distance}(\boldsymbol{x_i}, \boldsymbol{w})
        = \frac{|\iprod{\boldsymbol{x_i} - \boldsymbol{x}, \boldsymbol{w}}|}{\llnorm{\boldsymbol{w}}}
        = \frac{|\iprod{\boldsymbol{x_i}, \boldsymbol{w}} - \iprod{\boldsymbol{x}, \boldsymbol{w}}|}{\llnorm{\boldsymbol{w}}}
        \stackrel{\boldsymbol{x} \in H}{=} \frac{|\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b|}{\llnorm{\boldsymbol{w}}}
        \stackrel{y_i \hat{y_i} > 0}{=} \frac{y_i \hat{y_i}}{\llnorm{\boldsymbol{w}}}
    \end{equation}
\end{question}

\begin{question}[Margin Maximization]
    Margin is the smallest distance to $H$ among all separable data.
    \begin{equation}
        \max_{\boldsymbol{w}, b} \min_i \frac{y_i \hat{y_i}}{\llnorm{\boldsymbol{w}}} 
        \text{, such that }
        \forall i, y_i \hat{y_i} > 0
    \end{equation}
    Let $c > 0$, then $\boldsymbol{w} = c\boldsymbol{w}, b = cb$ keeps the loss same:
    \begin{equation}
        \begin{split}
            \max_{\boldsymbol{w}, b} \min_i \frac{c y_i \hat{y_i}}{\llnorm{c\boldsymbol{w}}} 
            & = \max_{\boldsymbol{w}, b} \min_i \frac{y_i (\iprod{\boldsymbol{x}, c\boldsymbol{w}} + cb)}{\llnorm{c\boldsymbol{w}}} \\
            & = \max_{\boldsymbol{w}, b} \min_i \frac{c y_i (\iprod{\boldsymbol{x}, \boldsymbol{w}} + b)}{c \llnorm{\boldsymbol{w}}} \\
            & = \max_{\boldsymbol{w}, b} \min_i \frac{y_i (\iprod{\boldsymbol{x}, \boldsymbol{w}} + b)}{\llnorm{\boldsymbol{w}}} \\
            & = \max_{\boldsymbol{w}, b} \min_i \frac{y_i \hat{y_i}}{\llnorm{\boldsymbol{w}}} 
        \end{split}
    \end{equation}
    Let $c = \frac{1}{\min_i y_i \hat{y_i}}$,
    \begin{equation}
        \begin{split}
            \max_{\boldsymbol{w}, b} \min_i \frac{c y_i \hat{y_i}}{c \llnorm{\boldsymbol{w}}} 
            & = \max_{\boldsymbol{w}, b} \frac{1}{c \llnorm{\boldsymbol{w}}} \\
            & = \max_{\boldsymbol{w}, b} \frac{1}{\llnorm{\boldsymbol{w}}} \text{ s.t. } \min_i y_i \hat{y_i} = 1
        \end{split}
    \end{equation}
    Max $\rightarrow$ Min:
    \begin{equation}
        \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1
    \end{equation}
\end{question}

\begin{question}[Hard-Margin SVM v.s. Perceptron]
    \begin{align}
        \text{Hard-Margin SVM:} \qquad
        & \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 & \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1 \\
        \text{Perceptron:} \qquad
        & \min_{\boldsymbol{w}, b} 0 & \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1
    \end{align}
\end{question}

\begin{question}[Lagrangian Dual]
    Dual variables $\boldsymbol{\alpha} \in \mathbb{R}^n$.
    \begin{equation}
        \begin{split}
            \min_{\boldsymbol{w}, b} \; \max_{\boldsymbol{\alpha} \geq 0} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 - \sum_i \alpha_i \left[ y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) - 1 \right] 
            & = \min_{\boldsymbol{w}, b} \; \begin{cases}
                + \infty, & \text{if } \exists i, y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) < 1 \quad (\alpha_i = + \infty) \\
                \frac{1}{2} \llnorm{\boldsymbol{w}}^2 , & \text{if } \forall i, y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) \geq 1 \quad (\forall i, \alpha_i = 0) \\
            \end{cases} \\
            & = \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2, \quad \text{s.t. } \forall i, y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) \geq 1 \\
            & = \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1
        \end{split}
    \end{equation}

    Swap $\min$ and $\max$:
    \begin{equation} \label{lagrangian-max-min}
        \max_{\boldsymbol{\alpha} \geq 0} \; \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 - \sum_i \alpha_i \left[ y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) - 1 \right] 
    \end{equation}

    Solve inner problem by setting derivative to 0:
    \begin{equation} \label{solve-derivative-for-svm-dual}
        \frac{\delta}{\delta \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \boldsymbol{x_i} = 0,
        \qquad
        \frac{\delta}{\delta b} = - \sum_i \alpha_i y_i = 0,
    \end{equation}

    Plug them into the loss:
    \begin{equation}
        \begin{split}
            L(\boldsymbol{\alpha}) 
            & = \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 - \sum_i \alpha_i \left[ y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) - 1 \right] \\
            & = \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 - \sum_i \alpha_i y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w}} - \sum_i \alpha_i y_i b + \sum_i \alpha_i \\
            & = \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 - \iprod{\sum_i \alpha_i y_i \boldsymbol{x_i}, \sum_i \alpha_i y_i \boldsymbol{x_i}} - b \sum_i \alpha_i y_i + \sum_i \alpha_i \\
            & = \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 - \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 + \sum_i \alpha_i \\
            & = \sum_i \alpha_i - \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2,
            \qquad \text{s.t. } \sum_i \alpha_i y_i = 0
        \end{split}
    \end{equation}

    So, \ref{lagrangian-max-min} is solved as:
    \begin{equation}
        \max_{\boldsymbol{\alpha} \geq 0} \sum_i \alpha_i - \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2, 
        \qquad \text{s.t. } \sum_i \alpha_i y_i = 0
    \end{equation}

    Max $\rightarrow$ min and expand the norm:
    \begin{equation}
        \min_{\boldsymbol{\alpha} \geq 0} - \sum_i \alpha_i + \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \underbrace{\iprod{\boldsymbol{x_i}, \boldsymbol{x_j}}}_{\text{Kernel, closed form w.r.t. }\boldsymbol{x_i}, \boldsymbol{x_j}}, 
        \qquad \text{s.t. } \sum_i \alpha_i y_i = 0
    \end{equation}
\end{question}

\begin{question}[Support Vectors]
    From \ref{solve-derivative-for-svm-dual}, we know $\boldsymbol{w} = \sum_i \alpha_i y_i \boldsymbol{x_i}$.

    Vectors with $\alpha_i \neq 0$ are support vectors, which lie on the margin.
\end{question}

\section{Soft-Margin Support Vector Machines}

\begin{question}[Goal]
    minimize over $\boldsymbol{w}, b$,
    \begin{equation}
        \Pr(Y \neq \sign(\hat Y)) = \Pr(Y \hat Y \leq 0)
        = \mathbb{E} \; \underbrace{\mathbb{I} [Y \hat Y \leq 0]}_{indicator function}
        \coloneq \mathbb{E} \; l_{0-1}(Y \hat Y)
    \end{equation}
    where $\hat Y = \iprod{X, \boldsymbol{w}} + b, Y = \pm 1$.

    \begin{equation}
        \begin{split}
            \min_{\hat Y: \mathcal{X} \rightarrow \mathbb{R}} \mathbb{E} \; l_{0-1}(Y \hat Y)
            & = \min_{\hat Y: \mathcal{X} \rightarrow \mathbb{R}} \mathbb{E}_X \mathbb{E}_{Y|X} \; l_{0-1}(Y \hat Y) \\
            & = \mathbb{E}_X \min_{\hat Y: \mathcal{X} \rightarrow \mathbb{R}} \mathbb{E}_{Y|X} \; l_{0-1}(Y \hat Y) \\
        \end{split}
    \end{equation}

    Minimizing the 0-1 error is \textbf{NP-hard}.
\end{question}

\begin{question}[Bayes Rule]
    \begin{equation}
        \eta(\boldsymbol{x}) \coloneq \argmax_{\hat y \in \mathbb{R}} \Pr (Y = \hat y | X = \boldsymbol{x})
    \end{equation}
    \begin{equation}
        \eta(\boldsymbol{x}) \coloneq \argmin_{\hat y \in \mathbb{R}} \mathbb{E}_{Y|X=\boldsymbol{x}} \; l_{0-1}(Y \hat y)
    \end{equation}
\end{question}

\begin{question}[Classification Calibrated]
    A loss $l(y \hat y)$ is classification calibrated, iff $\forall \boldsymbol{x}$,
    \begin{equation*}
    \hat y(\boldsymbol{x}) \coloneq \argmin_{\hat y \in \mathbb{R}} \mathbb{E}_{Y | X = \boldsymbol{x}} \; l(Y \hat y)
    \end{equation*}
    has the same sign as the Bayes rule $\eta(\boldsymbol{x}) \coloneq \argmin_{\hat y \in \mathbb{R}} \mathbb{E}_{Y|X=\boldsymbol{x}} \; l_{0-1}(Y \hat y)$

    Notice: $\eta(\boldsymbol{x}), \hat y (\boldsymbol{x})$ provide the \textbf{score}, their sign provides the prediction.
\end{question}

\begin{question}[Characterization under Convexity]
    Any \textbf{convex} loss $l$ is classification calibrated iff $l$ is differentiable
    at $0$ and $l'(0) < 0$.
\end{question}

\begin{question}[Hinge Loss]
    \begin{equation}
        l_{hinge}(y \hat y) = (1 - y \hat y)^+ \coloneq \max \{0, 1 - y \hat y\}
        = \begin{cases}
            1 - y \hat y, & \text{if } y \hat y < 1 \\
            0, & \text{otherwise}
        \end{cases}
    \end{equation}
    The classifier that minimizes the expected hinge loss minimizes the expected 0-1 loss.
\end{question}

\begin{question}[Soft-Margin SVM]
    \begin{equation}
        \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + C \cdot \sum_i l_{hinge}(y_i \hat y_i), \quad \text{s.t. } \hat y_i = \iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b
    \end{equation}
\end{question}
    

\begin{question}[Lagrangian Dual]
    Apply $C \cdot l_{hinge}(t) \coloneq \max\{0, C(1 - t)\} = \max_{0 \leq \alpha \leq C} \alpha (1 - t)$
    \begin{equation}
        \min_{\boldsymbol{w}, b} \max_{0 \leq \boldsymbol{\alpha} \leq C} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i (1 - y_i \hat y_i)
    \end{equation}

    Swap $\min$ with $\max$:
    \begin{equation}
        \max_{0 \leq \boldsymbol{\alpha} \leq C} \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i (1 - y_i \hat y_i)
    \end{equation}

    Solve it by setting derivative to 0:
    \begin{equation} 
        \frac{\delta}{\delta \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \boldsymbol{x_i} = 0,
        \qquad
        \frac{\delta}{\delta b} = - \sum_i \alpha_i y_i = 0,
    \end{equation}

    Plug them into the loss:
    \begin{equation}
        \begin{split}
            \max_{0 \leq \boldsymbol{\alpha} \leq C} \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i (1 - y_i \hat y_i)
            & = \max_{0 \leq \boldsymbol{\alpha} \leq C} \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i \left[ 1 - y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) \right] \\
            & = \max_{0 \leq \boldsymbol{\alpha} \leq C} \sum_i \alpha_i - \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2,
            \qquad \text{s.t. } \sum_i \alpha_i y_i = 0
        \end{split}
    \end{equation}

    Max $\rightarrow$ min and expand the norm:
    \begin{equation}
        \min_{0 \leq \boldsymbol{\alpha} \leq C} - \sum_i \alpha_i + \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \underbrace{\iprod{\boldsymbol{x_i}, \boldsymbol{x_j}}}_{\text{Kernel, closed form w.r.t. }\boldsymbol{x_i}, \boldsymbol{x_j}}, 
        \qquad \text{s.t. } \sum_i \alpha_i y_i = 0
    \end{equation}

    $C \rightarrow \infty \Rightarrow \text{Hard-margin SVM}$,
    $C \rightarrow 0 \Rightarrow \text{a constant classifier}$
\end{question}

\section{Reproducing Kernels}

\begin{question}[(Reproducing) Kernels]
    $k: \mathcal(X) \times \mathcal{X} \rightarrow \mathbb{R}$ is a (reproducing) kernel
    iff there exists some $\Phi: \mathcal{X} \rightarrow \mathcal{H}$ so that
    $\iprod{\Phi(\boldsymbol{x}), \Phi(\boldsymbol{z})} = k(\boldsymbol{x}, \boldsymbol{z})$.
    
    \begin{itemize}
        \item A feature transform $\Phi$ determines the corresponding kernel $k$.
        \item A kernel $k$ determines some feature transforms $\Phi$, but may not be unique. \\
        E.g. $\iprod{\phi(\boldsymbol{x}), \phi(\boldsymbol{z})} = \iprod{\phi'(\boldsymbol{x}), \phi'(\boldsymbol{z})}$
        \begin{enumerate}
            \item $\phi(x) \coloneq [x_1^2, \sqrt{2} x_1 x_2] \in \mathbb{R}^2$
            \item $\phi'(x) \coloneq [x_1^2, x_1 x_2, x_1 x_2] \in \mathbb{R}^3$
        \end{enumerate}
    \end{itemize}
\end{question}

\begin{question}[Mercer's Theorem]
    $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a kernel, \\
    iff $\forall n \in \mathbb{N}, \forall \boldsymbol{x_1}, \dots, \boldsymbol{x_n} \in \mathcal{X}$,
    the kernel matrix $K$ such that $K_{ij} \coloneq k(\boldsymbol{x_i}, \boldsymbol{x_j})$ is symmetric and positive semi-definite (PSD).

    \begin{equation}
        k \text{ is a kernel} \Leftrightarrow \begin{cases}
            K_{ij} = K_{ji} & \text{(symmetric)}\\
            \iprod{\boldsymbol{\alpha}, K \boldsymbol{\alpha}} = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j K_{ij} \geq 0 \qquad \forall \boldsymbol{\alpha} \in \mathbb{R}^n & \text{(PSD)}
        \end{cases}
    \end{equation}
\end{question}

\begin{question}[Symmetric PSD]
    For a symmetric matrix $A$, the following conditions are equivalent.
    \begin{enumerate}
        \item $A \succeq 0$
        \item $A = U^T U$ for some matrix $U$
        \item $x^T A x \geq 0$ for every $x \in \mathbb{R}^n$
        \item All principal minors of $A$ are nonnegative
    \end{enumerate}
\end{question}

\section{Gradient Descent}

\begin{question}[Gradient Descent Template]
    Choose initial point $x^{(0)} \in \mathbb{R}^d$ and repeat:
    \begin{equation}
        x^{(k)} = x^{(k-1)} - \underbrace{\eta}_{\text{step size}} \nabla f(x^{(k-1)}), \quad k = 1, 2, \dots
    \end{equation}
\end{question}

\begin{question}[Interpretation from Taylor Expansion]
    Expand $f$ locally at $x$:
    \begin{equation}
        \begin{split}
            f(y) & \approx f(x) + \nabla f(x)^T(y-x) + \frac{1}{2t}\llnorm{y-x}^2 \\
            \Rightarrow \min_y f(y) & \approx \min_y \left[ f(x) + \nabla f(x)^T(y-x) + \frac{1}{2t}\llnorm{y-x}^2 \right] \\
        \end{split}
    \end{equation}
    When $y - x = \frac{\nabla f(x)}{-2 \frac{1}{2t}} = -t \nabla f(x) \Longrightarrow y = x - t \nabla f(x)$, it reaches the minimum.
\end{question}

\begin{question}[$L$-smooth or $L$-Lipschitz Continuous]
    $f$ is convex and differentiable. $\nabla f$ is $L$-Lipschitz continuous ($L$-smooth):
    \begin{equation}
        L \boldsymbol{I}  \succeq \nabla^2 f(x), \forall x
    \end{equation}
\end{question}

\begin{question}[Convergence Rate for Convex Case]
    Assume $f$ is $L$-smooth.
    Gradient descent with fixed step size $t \leq \frac{1}{L}$ satisfies:
    \begin{equation}
        f(x^{(k)}) - f(x^*) \leq \frac{\llnorm{x^{(0)} - x^*}^2}{2tk}
    \end{equation}
    We say gradient descent has convergence rate $O(\frac{1}{k})$,
    i.e. $f(x^{(k)}) - f(x^*) \leq \epsilon$ can be achieved using only $O(\frac{1}{\epsilon})$ iterations.

    \vspace{5mm}
    \textbf{Proof}

    \begin{equation}
        \begin{split}
            f(y) 
            & = f(x) + \nabla f(x)^T (y-x) + \frac{1}{2} (y-x)^T \nabla^2 f(\xi)(y-x) \\
            & \leq f(x) + \nabla f(x)^T(y-x) + \frac{1}{2} L \llnorm{y-x}^2 \qquad (L\text{-smooth}, L \boldsymbol{I} \succeq \nabla^2 f(\xi))
        \end{split}
    \end{equation}

    Plug in gradient descent:
    \begin{equation} \label{convergence-1}
        \begin{split}
            f(x^+) 
            & = f(y) \\
            & \leq f(x) + \nabla f(x)^T(x - t\nabla f(x)-x) + \frac{1}{2} L \llnorm{x - t\nabla f(x)-x}^2 \\
            & = f(x) -(1 - \frac{1}{2}Lt) t \llnorm{\nabla f(x)}^2 \\
            & \leq f(x) - \frac{1}{2} t \llnorm{\nabla f(x)}^2 \quad (t \leq \frac{1}{L})
        \end{split}
    \end{equation}

    $f$ is convex $\Rightarrow f(x^*) \geq f(x) + \nabla f(X)^T(x^* - x) \Rightarrow f(x) \leq f(x^*) + \nabla f(x)^T(x - x^*)$

    Plug this into \ref{convergence-1}:
    \begin{equation}
        \begin{split}
            & f(x^+) \leq f(x^*) + \nabla f(x)^T (x - x^*) - \frac{1}{2} t \llnorm{\nabla f(x)}^2 \\
            \Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( 2t \nabla f(x)^T(x - x^*) - t^2 \llnorm{\nabla f(x)}^2 \right) \\
            \Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( 2t \nabla f(x)^T(x - x^*) - t^2 \llnorm{\nabla f(x)}^2 - \llnorm{x-x^*}^2 + \llnorm{x-x^*}^2 \right) \\
            \Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( \llnorm{x-x^*}^2 - (\llnorm{x-x^*}^2 + t^2 \llnorm{\nabla f(x)}^2 - 2t \nabla f(x)^T(x - x^*) ) \right) \\
            \Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( \llnorm{x-x^*}^2 - \llnorm{x - x^* - t\nabla f(x)}^2 \right) \\
            \Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( \llnorm{x-x^*}^2 - \llnorm{x^+ - x^*}^2 \right) \\
        \end{split}
    \end{equation}

    Viewing $x^+$ as $x^{(i)}$ and $x$ as $x^{(x-1)}$:
    \begin{equation}
        \begin{split}
            \sum_{i=1}^k \paren{ f(x^{(i)}) - f(x^*) }
            & \leq \sum_{i=1}^k \frac{1}{2t} \paren{ \llnorm{x^{(i-1)} - x^*}^2 - \llnorm{x^{(i)} - x^*}^2 } \\
            & = \frac{1}{2t} \paren{ \llnorm{x^{(0)} - x^*}^2 - \llnorm{x^{(k)} - x^*}^2 } \\
            & \leq \frac{1}{2t} \llnorm{x^{(0)} - x^*}^2 \\
        \end{split}
    \end{equation}

    which implies
    \begin{equation}
        f(x^{(k)}) \leq \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) \leq f(x^*) + \frac{\llnorm{x^{(0)} - x^*}^2}{2tk}
    \end{equation}
\end{question}

\begin{question}[Convergence Rate for Strong Convexity]
    $f$ is differentiable, $L$-smooth, and $m$-strongly convex.

    $m$-strong convexity of $f$ means $f(x) - \frac{m}{2} \llnorm{x}^2$ is convex, i.e. $\nabla^2 f(x) \succeq m \boldsymbol{I}$

    Then, there is a constant $0 < \gamma < 1$ such that gradient descent with fixed step size $t \leq \frac{2}{m + L}$ satisfies:
    \begin{equation}
        f(x^{(k)}) - f(x^*) \leq \gamma^k \frac{L}{2} \llnorm{x^{(0)} - x^*}^2
    \end{equation}
    Rate is $O(\gamma^k)$. Only $O(\log_{\frac{1}{\gamma}}(\frac{1}{\epsilon}))$ iterations needed.
\end{question}

\begin{question}[Convergence Rate for Non-Convex Case]
    $f$ is differentiable and $L$-smooth, but non-convex.

    Gradient descent with fixed step size $t \leq \frac{1}{L}$ satisifes:
    \begin{equation}
        \min_{i=0,\dots,k} \llnorm{\nabla f(x^{(i)})} \leq \sqrt(\frac{2 (f(x^{(0)} - f^*))}{t(k+1)})
    \end{equation}

    Rate is $O(\frac{1}{\sqrt{k}})$ for finding stationary point.
    $O(\frac{1}{\epsilon^2})$ iterations are needed.
\end{question}

\begin{question}[Convergence Rate for Stochastic Gradient Descent]
    For convex and $L$-smooth $f$,

    \begin{itemize}
        \item Gradient Descent
        \begin{equation}
            \boldsymbol{w^+} = \boldsymbol{w} - t \cdot \frac{1}{n} \sum_{i=1}^n \nabla f_i(\boldsymbol{w})
        \end{equation}
            \begin{itemize}
                \item Step size $t \leq \frac{1}{L}$
                \item Time complexity $O(\frac{n}{\epsilon})$
            \end{itemize}
        \item Stochastic Gradient Descent
        \begin{equation}
            \boldsymbol{w^+} = \boldsymbol{w} - t \cdot \nabla f_{I_{random}}(\boldsymbol{w})
        \end{equation}
            \begin{itemize}
                \item Step size $t = \frac{1}{k}, k = 1, 2, 3, \dots$ (adaptive step size)
                \item Time complexity $O(\frac{1}{\epsilon^2})$
            \end{itemize}
    \end{itemize}
\end{question}

\section{Fully-Connected Neural Networks}

\begin{question}[Forward and Backward Pass of a 2-Layer MLP]
    A 2-layer MLP ($k$ is the NN width, $c$ is the output dim):
    \begin{align}
        \boldsymbol{x} & = \text{input} & (\boldsymbol{x} \in \mathbb{R}^d) \\
        \boldsymbol{z} & = \boldsymbol{W} \boldsymbol{x} + \boldsymbol{b_1} & (\boldsymbol{W} \in \mathbb{R}^{k \times d}, \boldsymbol{z}, \boldsymbol{b} \in \mathbb{R}^k) \\
        \boldsymbol{h} & = \text{ReLU}(\boldsymbol{z}) & (\boldsymbol{h} \in \mathbb{R}^k) \\
        \boldsymbol{\theta} & = \boldsymbol{U} \boldsymbol{h} + \boldsymbol{b_2} & (\boldsymbol{U} \in \mathbb{R}^{c \times k}, \boldsymbol{\theta}, \boldsymbol{b_2} \in \mathbb{R}^c) \\
        J & = \frac{1}{2} \llnorm{\boldsymbol{\theta} - \boldsymbol{y}}^2 & (\boldsymbol{y} \in \mathbb{R}^c, J \in \mathbb{R})
    \end{align}
    \begin{align}
        \text{ReLU} = \begin{cases}
            x & x > 0 \\
            0 & x \leq 0
        \end{cases} \\
        \text{ReLU}' = \begin{cases}
            1 & x > 0 \\
            0 & x \leq 0
        \end{cases}
    \end{align}
    
    Backward pass ($\odot$ is the Hadamard product, i.e. element-wise product):
    \begin{align}
        \frac{\delta J}{\delta \boldsymbol{\theta}} & = \boldsymbol{\theta} - \boldsymbol{y} \\
        \frac{\delta J}{\delta \boldsymbol{U}} & = \frac{\delta J}{\delta \boldsymbol{\theta}} \circ \frac{\delta \boldsymbol{\theta}}{\delta \boldsymbol{U}} = (\boldsymbol{\theta} - \boldsymbol{y}) \boldsymbol{h}^T \\
        \frac{\delta J}{\delta \boldsymbol{b_2}} & = \frac{\delta J}{\delta \boldsymbol{\theta}} \circ \frac{\delta \boldsymbol{\theta}}{\delta \boldsymbol{b_2}} = \boldsymbol{\theta} - \boldsymbol{y} \\
        \frac{\delta J}{\delta \boldsymbol{h}} & = \frac{\delta J}{\delta \boldsymbol{\theta}} \circ \frac{\delta \boldsymbol{\theta}}{\delta \boldsymbol{h}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \\
        \frac{\delta J}{\delta \boldsymbol{z}} & = \frac{\delta J}{\delta \boldsymbol{h}} \circ \frac{\delta \boldsymbol{h}}{\delta \boldsymbol{z}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \odot \text{ReLU}'(\boldsymbol{z}) \\
        \frac{\delta J}{\delta \boldsymbol{W}} & = \frac{\delta J}{\delta \boldsymbol{z}} \circ \frac{\delta \boldsymbol{z}}{\delta \boldsymbol{W}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \odot \text{ReLU}'(\boldsymbol{z}) \boldsymbol{x}^T \\
        \frac{\delta J}{\delta \boldsymbol{b_1}} & = \frac{\delta J}{\delta \boldsymbol{z}} \circ \frac{\delta \boldsymbol{z}}{\delta \boldsymbol{b_1}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \odot \text{ReLU}'(\boldsymbol{z}) \\
    \end{align}
\end{question}

\begin{question}[Universal Approximation Theorem by 2-Layer NNs]
    For any continuous function $f: \mathbb{R}^d \rightarrow \mathbb{R}^c$
    and any $\epsilon > 0$,
    there exists $k \in \mathbb{N}, \boldsymbol{W} \in \mathbb{R}^{k \times d}, \boldsymbol{b} \in \mathbb{R}^k, \boldsymbol{U} \in \mathbb{R}^{c \times k}$
    such that
    \begin{equation}
        \sup_{\boldsymbol{x}} \llnorm{f(\boldsymbol{x}) - g(\boldsymbol{x})} < \epsilon
    \end{equation}
    where $g(\boldsymbol{x}) = \boldsymbol{U}(\sigma(\boldsymbol{W}\boldsymbol{x} + \boldsymbol{b}))$
    and $\sigma$ is the element-wise ReLU operation.

    As long as the 2-layer MLP is wide enough, it can approximate any
    continuous function arbitrarily closely.
\end{question}

\section{Convolutional Neural Networks}

\begin{question}[Controlling the Convolution]
    Hyperparameters.

    \begin{itemize}
        \item \textbf{Filter (kernel) size}: width $times$ height.
        \item \textbf{Number of filters (kernels)}. \\
        Weights are not shared between different filters (kernels)
        \item \textbf{Stride}: how many pixels the filter moves each time.
        \item \textbf{Padding}: add zeros around the boundary of the input.
    \end{itemize}
\end{question}

\begin{question}[Size Calculation]
    \begin{align*}
        \text{Input size: } & m \times n \times c_{in} \\
        \text{Filter size: } & a \times b \times c_{in} \\
        \text{Stride: } & s \times t \\
        \text{Padding: } & p \times q
    \end{align*}

    Output size:
    \begin{equation}
        \floor*{1 + \frac{m + 2p - a}{s}} \times \floor*{1 + \frac{n + 2q - b}{t}}
    \end{equation}
\end{question}

\part{For Final}

\section{Transformer}

\begin{question}[Attention Layer Inputs and Outputs]
    Inputs: $V \in \mathcal{R}^{n \times d}$, $K \in \mathcal{R}^{n \times d}$, $Q \in \mathcal{R}^{m \times d}$,
    Outputs: an $m \times d$ matrix.

    \begin{itemize}
        \item \textbf{Self Attention}: $m = n$,
        \item \textbf{Cross Attention}: $m \neq n$ where $m$ is the sequence length of decoder, $n$ is the sequence length of encoder.
    \end{itemize}
\end{question}

\begin{question}[Attention Layer Calculation]
    \begin{equation}
        \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V
    \end{equation}

    Softmax is row-wise, i.e. for each row of $QK^T$, it is normalized to sum to 1.
\end{question}

\begin{question}[Learnable Attention Layer]
    \begin{equation}
        \text{Attention}(XW^v, XW^k, XW^q) = \text{softmax}\left(\frac{XW^q(XW^k)^T}{\sqrt{d}}\right)XW^v
    \end{equation}
\end{question}

\begin{question}[RMSNorm (LLaMA's Choice)]
    \begin{equation}
        \bar{a_i} = \frac{a_i}{\text{RMS}(a)}\gamma = \frac{a_i}{\sqrt{\frac{1}{d}\sum_{j=1}^d a_j^2}}\gamma
    \end{equation}
\end{question}

\begin{question}[Transformer Loss]
    \begin{equation}
        \min_{W} \hat{\mathbb{E}}\left[ - \iprod{Y, \log \hat{Y}} \right]
    \end{equation}
    $Y$ is output sequence, one-hot; \\
    $\hat{Y}$ is the predicted probabilities
\end{question}

\begin{question}[Transformer Implementation] As following.
    \begin{lstlisting}[language=python]
import torch
import torch.nn as nn
import torch.F as F
import math

class RMSNorm(nn.Module):
  def __init__(self, hidden_dim, eps = 1e-6):
    super().__init__()
    self.eps = eps
    self.weight = nn.Parameter(torch.ones(hidden_dim))

  def forward(self, hidden_state):
    norm = hidden_state.pow(2).mean(-1, keepdim = True)
    output = hidden_state * self.weight * torch.rsqrt(norm + self.eps)
    return output

class MultiHeadAttention(nn.Module):
  def __init__(self, hidden_dim, num_heads):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.num_heads = num_heads
    self.head_dim = hidden_dim // num_heads
    self.q_linear = nn.linear(hidden_dim, hidden_dim)
    self.k_linear = nn.linear(hidden_dim, hidden_dim)
    self.v_linear = nn.linear(hidden_dim, hidden_dim)
    self.o_linear = nn.linear(hidden_dim, hidden_dim)
    self.norm = RMSNorm(hidden_dim)

  def forward(self, hidden_state, mask, past_kv = None, use_cache = True):
    bs = hidden_state.shape[0]

    residual = hidden_state

    hidden_state = self.norm(hidden_state) # LLAMA style normalization

    q = self.q_linear(hidden_state) # (bs, seqlen, hidden_dim)
    k = self.k_linear(hidden_state) # (bs, seqlen, hidden_dim)
    v = self.v_linear(hidden_state) # (bs, seqlen, hidden_dim)

    q = q.view(bs, -1, self.num_heads, self.head_dim).tranpose(1, 2)
    k = k.view(bs, -1, self.num_heads, self.head_dim).tranpose(1, 2)
    v = v.view(bs, -1, self.num_heads, self.head_dim).tranpose(1, 2)
    # (bs, nums_head, seqlen, head_dim)

    q, k = apply_rope(q, k)

    # kv cache
    if past_kv is not None:
      past_k, past_v = past_kv
      k = torch.cat([past_k, k], dim = 2)
      v = torch.cat([past_v, v], dim = 2)
    new_past_kv = (k, v) if use_cache else None

    # compute attention
    attention_scores = torch.matmul(q, k.tranpose(-1, -2)) / math.sqrt(self.head_dim)
    attention_scores += mask * -1e9
    attention_scores = F.softmax(attention_scores, dim = -1)
    output = torch.matmul(attention_scores, v)

    # concat
    output = output.tranpose(1, 2).contiguous().view(bs, -1, self.hidden_dim)

    # o_linear
    output = self.o_linear(output)

    output += residual

    return output, new_past_kv if use_cache else output
    \end{lstlisting} 
\end{question}

\section{Large Language Models}

\begin{question}[BERT v.s. GPT]
    BERT is encoder; GPT is decoder.

    \begin{itemize}
        \item BERT predicts middle words; GPT predicts the next word.
        \item BERT is \textbf{NOT} auto-regressive; GPT is auto-regressive.
    \end{itemize}
\end{question}

\begin{question}[GPT -- Generative Pre-Training]
    \begin{equation}
        \min_{\Theta} \hat{\mathbb{E}} - \log \prod_{j=1}^m \Pr(x_j \vert x_1, \dots, x_{j-1}; \Theta)
    \end{equation}
\end{question}

\begin{question}[Fine-Tuning Tasks]
    Supervised fine-tuning tasks:
    \begin{equation}
        \min_{\Theta} \underbrace{- \hat{\mathbb{E}} \log \Pr(y \vert X_{1:m}; \Theta)}_{\text{task-aware supervised loss}} \underbrace{- \lambda \hat{\mathbb{E}} \log \prod_{j=1}^m \Pr(x_j \vert X_{1:j-1}; \Theta)}_{\text{pretraining loss}}
    \end{equation}
\end{question}

\begin{question}[BERT $\rightarrow$ RoBERTa]
    Training longer, with bigger batches, over more data and longer sequence.

    Removing the next sentence prediction objective.
\end{question}

\begin{question}[Sentence-BERT]
    a twin network architecture that uses BERT to derive sentence embeddings.
\end{question}

\begin{question}[GPT-2]
    1.5B parameters.

    \begin{itemize}
        \item 10x larger than GPT-1
        \item Training method is same as GPT-1.
        \item Performs on par with BERT on fine-tuning tasks.
        \item Good at zero-shot learning.
        \item Open-source.
    \end{itemize}
\end{question}

\begin{question}[GPT-3]
    175B parameters.

    \begin{itemize}
        \item 100x larger than GPT-2.
        \item Training method is same as GPT-2.
        \item New phenomenon: \textbf{in-context learning} (ICL, or few-shot learning) and \textbf{chain-of-thoughts} (CoT).
    \end{itemize}
\end{question}

\begin{question}[GPT-3.5 -- RLHF]
    Reinforcement Learning from Human Feedback (RLHF).
    \begin{itemize}
        \item state = prompt
        \item action = model output
        \item policy function = LLM
        \item reward = levels of matching human feedback
    \end{itemize}

    Pari-wise comparison loss to train reward model $r_\theta$:
    \begin{equation}
        \mathcal{L}_{\text{pair}}(\theta) = - \frac{1}{\binom{K}{2}} \mathbb{E}_{(x, y_w, y_l)} \left[ \log ( \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) ) \right]
    \end{equation}

    Proximal Policy Optimization (PPO) to maximize objective:
    \begin{equation}
        \max_{\Phi} \mathbb{E}_{(x, y)} \left[ \underbrace{r_\theta(x, y)}_{\text{maximize reward}} - \beta \underbrace{\log\left( \frac{\pi_\Phi^{\text{RL}}(y \vert x)}{\pi^{\text{SFT}}(y \vert x)}\right) }_{\text{model is close to SFT model}} + \gamma \underbrace{\mathbb{E}[\log(\pi_\Phi^{\text{RL}}(x))]}_{\text{pretraining loss}} \right]
    \end{equation}
\end{question}

\section{Speculative Sampling}

\begin{question}[Reject Sampling for Check]
    Check in parallel.

    \begin{itemize}
        \item $r \sim U(0, 1), \quad \text{if } r < \underbrace{\min\left(1, \frac{p(t)}{q(t)}\right)}_{\text{accept rate}}$, next token $= t$.
        \item else: next token $= t' \sim \underbrace{\text{norm}(\max(0, p-q))}_{\text{residual distribution}}$.
    \end{itemize}
\end{question}

\begin{question}[Proof: Reject Sampling $\equiv t \sim p$]
    \begin{align}
        \begin{split}
            \min(p(t), q(t)) + \max(0, p(t) - q(t)) 
            &= \begin{cases}
            p(t) + 0 & \text{if } p(t) < q(t) \\
            q(t) + p(t) - q(t) & \text{if } p(t) \geq q(t)
            \end{cases} \\
            &= p(t)
        \end{split} \\
        \Rightarrow & \sum_t (\min(p(t), q(t)) + \max(0, p(t) - q(t))) = \sum_t p(t) = 1 \\
        \Rightarrow & 1 - \sum_t \min(p(t), q(t)) = \sum_t \max(0, p(t) - q(t))
    \end{align}

    \begin{equation}
        \begin{split}
            \Pr(X = t) &= \Pr(\tilde{X} = t) \Pr(\tilde{X}\text{ accept} \vert \tilde{X} = t) + \Pr(\tilde{X}\text{ reject}) \Pr(\tilde{\tilde{X}} = t \vert \tilde{X}\text{ reject}) \\
            &= q(t) \cdot \min\left(1, \frac{p(t)}{q(t)}\right) + (1 - \Pr(\tilde{X}\text{ accept})) \cdot \text{norm}(\max(0, p(t) - q(t))) \\
            &= \min\left(q(t), p(t)\right) + (1 - \sum_t \min\left(p(t), q(t)\right)) \cdot \frac{\max(0, p(t) - q(t))}{\sum_t \max(0, p(t) - q(t))} \\
            &= \min\left(q(t), p(t)\right) + \max(0, p(t) - q(t)) \\
            &= p(t)
        \end{split}
    \end{equation}
\end{question}

\section{Generative Adversarial Networks}

\begin{question}[Representation through Push-Forward]
    Let $r$ be any continuous distribution on $\mathbb{R}^h$.
    For any distribution $p$ on $\mathbb{R}^d$, 
    there exists push-forward maps $G: \mathbb{R}^h \rightarrow \mathbb{R}^d$ such that
    \begin{equation}
        \boldsymbol{z} \sim r \Longrightarrow G(\boldsymbol{z}) \sim p
    \end{equation}
\end{question}

\begin{question}[Discriminator's Goal]
    For a fixed generator $G$, minimize a log loss over $D$:
    \begin{itemize}
        \item If $x$ is real, minimize $-\log D(x)$;
        \item If $x$ is fake, minimize $-\log(1 - D(x))$.
    \end{itemize}

    \begin{equation}
        \min_D -\frac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] - \frac{1}{2} \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right]
    \end{equation}
\end{question}

\begin{question}[Generator's Goal]
    For a fixed discriminator $D$, maximize a log loss over $G$:
    
    \begin{equation}
        \begingroup\color{red}{\max_G}\endgroup -\frac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] - \frac{1}{2} \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right]
    \end{equation}
\end{question}

\begin{question}[Solver]
    \begin{equation}
        \min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] + \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right]
    \end{equation}

    Solved by alternative minimization-maximization:
    \begin{itemize}
        \item G step: Fix $D$ and update $G$ by one-step gradient descent
        \item D step: Fix $G$ and update $D$ by one-step gradient descent
        \item Repeat until the algorithm reaches an approximate equilibrium
    \end{itemize}
    
\end{question}

\begin{question}[Solution of $D^*$]
    Let $p_g(x)$ be the density of $x$ estimated by the generator $G$.
    For $G$ fixed, the optimal discriminator $D$ is $D_G^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}$

    \textbf{Proof:}

    \begin{equation}
        \begin{split}
            V(G, D) &\coloneq \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] + \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right] \\
            &= \int \log D(x) p_{\text{data}}(x) dx + \int_z \log(1 - D(G(z))) p_z(z) dz \\
            &= \int \underbrace{\log D(x) p_{\text{data}}(x) + p_g(x) \log(1 - D(x))}_{f(D(x))} dx
        \end{split}
    \end{equation}

    For any fixed $x$, taking derivative $= 0$:
    \begin{equation}
        \begin{split}
            f'(D(x)) &= \frac{p_{\text{data}}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0 \\
            D_G^*(x) &= \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}
        \end{split}
    \end{equation}
\end{question}

\begin{question}[Solution of $G^*$]
    The global minimum of $\min_G \max_D V(G, D)$ is achieved if
    and only if $p_g = p_{\text{data}}$. The optimal objective value is $-\log4$.

    \textbf{Proof:}
    \begin{equation}
        \begin{split}
        V(G, D_G^*) &= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D_G^*(x)\right] + \mathbb{E}_{z \sim r} \left[\log(1 - D_G^*(G(z)))\right] \\
        &= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D_G^*(x)\right] + \mathbb{E}_{x \sim p_g} \left[\log(1 - D_G^*(x))\right] \\
        &= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g} \left[\frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)}\right] \\
        \end{split}
    \end{equation}

    By definition of KL divergence $\text{KL}(P \| Q) = \mathbb{E}_{x \sim P} \left[\log \frac{p(x)}{q(x)}\right]$, we have:
    \begin{equation}
    \begin{split}
    V(G, D_G^*) &= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g} \left[\log \frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)}\right] \\
    &= -\log4 + \text{KL}\left(p_{\text{data}} \| \frac{p_\text{data} + p_g}{2}\right) + \text{KL}\left(p_g \| \frac{p_{\text{data}} + p_g}{2}\right) \\
    &= -\log4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_g) \\
    & \geq -\log4
    \end{split}
    \end{equation}

    The equality holds if and only if $p_{\text{data}} = p_g$.
\end{question}

\section{Adversarial Attacks}

\begin{question}[Principle of Generating Adversarial Attacks]
    \begin{equation}
        \max_{\left\|x_{\text{adv}}-x\right\|_{\infty} \leq \epsilon} \mathcal{L}(C(x_{\text{adv}}), y)
    \end{equation}

    where $C$ is the composition of $h$ and $f$.
\end{question}

\begin{question}[Different Solvers] to optimize the adversarial attack.
    \begin{itemize}
        \item \textbf{Zero-Order Solvers} (only access to the output of NN)
        \begin{itemize}
            \item Black-box attack
        \end{itemize}
        \item \textbf{First-Order Solvers} (access to the gradient of NN)
        \begin{itemize}
            \item White-box attack
            \item Fast Gradient Sign Method (FGSM), BIM, PGD, CW attack, \dots
        \end{itemize}
        \item \textbf{Second-Order Solvers} (access to the Hessian matrix)
        \begin{itemize}
            \item White-box attack
            \item L-BFGS attack
        \end{itemize}
    \end{itemize}
\end{question}

\begin{question}[Holder Inequality]
    For any $p, q \geq 1$ such that $\frac{1}{p} + \frac{1}{q} = 1$,
    \begin{equation} \label{eq:holder-inequality}
        \left\|x\right\|_p \cdot \left\|y\right\|_q \geq \left\langle x, y \right\rangle
    \end{equation}
    where $\left\langle x, y \right\rangle$ is the inner product.

    $\| \cdot \|_p$ and $\| \cdot \|_q$ are also known as dual norms.
    \begin{itemize}
        \item $\| \cdot \|_2$ is self-dual.
        \item $\| \cdot \|_\infty$ and $\| \cdot \|_1$ are dual norms.
    \end{itemize}
\end{question}

\begin{question}[FGSM -- Fast Gradient Sign Method]
    White-box and non-targeted (maximize the loss w.r.t. the true label).

    Do linear expansion at $x$:
    \begin{equation}
        \mathcal{L}(C(x + \delta), y) \approx \underbrace{\mathcal{L}(C(x), y)}_{\text{constant}} + \nabla_x \mathcal{L}(C(x), y) \cdot \delta
    \end{equation}

    The problem reduces to:
    \begin{equation}
        \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \nabla_x \mathcal{L}(C(x), y) \cdot \delta
    \end{equation}

    Because of holder inequality \eqref{eq:holder-inequality}, we have:
    \begin{equation}
        \nabla_x \mathcal{L}(C(x), y) \cdot \delta \leq \left\|\delta\right\|_{\infty} \cdot \left\|\nabla_x \mathcal{L}(C(x), y)\right\|_1
        \leq \epsilon \cdot \left\|\nabla_x \mathcal{L}(C(x), y)\right\|_1
    \end{equation}

    Thus, the adversarial example is generated by:

    \begin{equation}
        x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(C(x), y))
    \end{equation}

    where $\epsilon$ is the perturbation size.
\end{question}

\begin{question}[BIM -- Basic Iterative Method]
    BIM is an iterative version of FGSM.
    \begin{itemize}
        \item Initialize $x^{(0)} = x$.
        \item For $k = 1, 2, \dots, K$:
        \begin{equation}
            x^{(k)} = x^{(k-1)} + \gamma \cdot \text{sign}(\nabla_x \mathcal{L}(C(x^{(k-1)}), y))
        \end{equation}
    \end{itemize}

    Issues:
    \begin{itemize}
        \item By repeating, the pertubation size $\epsilon$ will become larger.
        \item For a pre-defined $\epsilon$, $x^{(k)}$ may not satisfy $\|x^{(k)} - x\|_\infty \leq \epsilon$.
    \end{itemize}
\end{question}

\begin{question}[PGD -- Projected Gradient Descent]
    To resolve the issue of BIM, PGD involves a truncation operation:
    \begin{itemize}
        \item Initialize $x^{(0)} = x + \delta, \text{ where } \delta \in (-\epsilon, \epsilon)$.
        \item For $k = 1, 2, \dots, K$:
        \begin{equation}
            x^{(k)} = \text{clip}_{(-\epsilon, \epsilon)}(x^{(k-1)} + \gamma \cdot \text{sign}(\nabla_x \mathcal{L}(C(x^{(k-1)}), y)))
        \end{equation}
    \end{itemize}

    where $\text{clip}_{(-\epsilon, \epsilon)}(x)$ projects $x$ back to the $\ell_\infty$ ball of radius $\epsilon$ around $x$.
\end{question}

\begin{question}[Targeted PGD Attack]
    Objective:

    \begin{itemize}
        \item \textbf{Untargeted}: \begin{equation}
            \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \mathcal{L}(C(x + \delta), y_{\text{true}})
        \end{equation}
        \item \textbf{Targeted}: \begin{equation}
            \begingroup\color{red}\min_{\left\|\delta\right\|_{\infty} \leq \epsilon}\endgroup \mathcal{L}(C(x + \delta), \begingroup\color{red}y_{\text{target}}\endgroup)
        \end{equation}
    \end{itemize}
\end{question}

\section{Adversarial Robustness}

\begin{question}[Defense Mechanisms]
    Categorized into two types:
    \begin{itemize}
        \item \textbf{Gradient Masking}: hide the gradients and make first-order attacks fail
        \begin{itemize}
            \item \textbf{Shasttered Gradients}
            By applying a non-smooth or non-differentiable preprocessor $g$ to the inputs,
            and then training a DNN model $f$ on the preprocessed inputs $g(x)$.

            \item \textbf{Stochastic/Randomized Gradients}
            Apply some form of randomization of the DNN model.
            E.g. train a set of classifiers and during the testing phase randomly select one classifier to predict the labels.
        \end{itemize}
        \item \textbf{Adversarial Training}: 
        \begin{equation}
            \min_{C} \mathbb{E}_{(x, y) \sim \mathcal{D}^n} \left[ \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \mathcal{L}(C(x + \delta), y) \right]
        \end{equation}
    \end{itemize}
\end{question}

\begin{question}[Trade-Off between Natural and Robust Error]
    \begin{equation}
    \min_f R_{\text{nat}}(f) + R_{\text{rob}}(f) / \lambda
    \end{equation}

    \begin{equation}
        \begin{split}
        R_{\text{nat}}(f) &\coloneq \Pr_{x,y \sim \mathcal{D}} \{f(x) y \leq 0\} \\
        &= \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \mathbb{I}(f(x) y \leq 0) \right] \\
        \end{split}
    \end{equation}
    \begin{equation}
        \begin{split}
        R_{\text{rob}}(f) &\coloneq \Pr_{x,y \sim \mathcal{D}} \{\exists \delta \in B_\epsilon(x) \text{ s.t. } f(x + \delta) y \leq 0\} \\
        &= \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \mathbb{I}(f(x + \delta) y \leq 0) \right] \\
        \end{split}
    \end{equation}

    Approximate by a differentiable surrogate loss $\Phi$:
    \begin{equation}
        R_{\text{nat}}(f) \approx \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \Phi(f(x) y) \right]
    \end{equation}
\end{question}

\begin{question}[TRADES]
    \begin{equation}
        \min_f \underbrace{\left[ \underbrace{\mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \Phi(f(x) y) \right]}_{\text{minimize diff btw } f(x) \text{ and } y \text{ for accuracy}} + \underbrace{\mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \Phi(f(x + \delta) f(x)) \right]}_{\text{minimize diff btw }f(x) \text{ and } f(x+\delta){ for robustness}} / \lambda \right]}_{\text{TRADES Loss}}
    \end{equation}
\end{question}

\section{Self-Supervised Learning}

\begin{question}[Contrastive Learning]
    Loss:

    \begin{equation}
        \max_\Theta \Pr_1 = \frac{\exp(z_1)}{\sum_j \exp(z_j)}
    \end{equation}
\end{question}

\end{document}

CS 848: Advanced Topics in Database: Algorithmic Aspects of Database Query Processing

Fall 25

https://cs.uwaterloo.ca/~xiaohu/courses/CS848/CS848-F25.html

Notes for Foundations of Databases

Foundations of Databases (The Alice Book)

The Alice Book is helpful for understanding the basic notations in database theory. I recommend to read the first two chapters to get familiar with the notations used in this course.

That said, the book is not required. In fact, paying real attention to the first several lectures should be sufficient to survive the course, and it turns out to be interesting.

The Relational Model

Mathjax may have issues to display math symbols in particular styles, such as \(\mathcal{A}\).
You can try to mitigate this by changing the Math Renderer to Common HTML (right click on any math expressions): \( \newcommand{\att}{\textbf{att}} \) \( \newcommand{\attTotalOrder}{\leq_{\att}} \) \( \newcommand{\dom}{\textbf{dom}} \) \( \newcommand{\Dom}{\textit{Dom}} \) \( \newcommand{\relname}{\textbf{relname}} \) \( \newcommand{\sort}{\textit{sort}} \) \( \newcommand{\arity}{\textit{arity}} \) \( \newcommand{\finPowerSet}{\mathcal{P}^{\text{fin}}} \) \( \newcommand{\dbschema}[1]{\textbf{#1}} \) \( \newcommand{\dbinst}[1]{\textbf{#1}} \) \( \newcommand{\rel}[1]{\textit{#1}} \) \( \newcommand{\attname}[1]{\textit{#1}} \) \( \newcommand{\varSet}{\textbf{var}} \)

Informal Terminology

Relation: table
- Relation Name: name of the table
Attribute: column name
Tuple/Record: line in the table
Domain: set of constants that entries of tuples can take
Database Schema: specifying the structure of the database
Database Instance: specifying the actual content in the database

Formal Definitions

\(\att\): a countably infinite set of attributes
- assume \(\att\) is fixed
\(\attTotalOrder\): a total order on \(\att\)
- the elements of a set of attributes \(U\) are written according to \(\attTotalOrder\) unless otherwise specified
\(\dom\): a countably infinite set of domains (disjoint from \(\att\))
- assume \(\dom\) is fixed
- a constant is an element of \(\dom\)
- for most cases, the same domain of values is used for all attributes;
  - otherwise, assume a mapping \(\Dom\) on \(\att\), where \(\Dom(A)\) is a set called the domain of attribute \(A\)
\(\relname\): a countably infinite set of relation names (disjoint from \(\att\) and \(\dom\))
\(\sort\): a function from \(\relname\) to \(\finPowerSet(\att)\) (the finitary powerset of \(\att\))
- \(\sort^{-1}(U)\) is inifinite for each (possibly empty) finite set of attributes \(U\)
  - which allows multiple relations to have the same set of attributes
- \(\sort(R)\) is called the \(\sort\) of a relation name \(R\)
- \(\arity(R) = |\sort(R)|\): the \(\arity\) of relation name \(R\)
\(R\): a relation name or a relation schema
- Alternative notations:
  - \(R[U]\): indicating \(\sort(R) = U\)
  - \(R[n]\): indicating \(\arity(R) = n\)
\(\dbschema{R}\): a database schema
- a nonempty finite set of relation names
- might be written as \(\dbschema{R} = \{R_1[U_1], \ldots, R_n[U_n]\}\)

Examples

Table Movies

Title	Director	Actor
A	X	Y

Table Location

Theater	Address	Phone
T1	Addr1	P1

Table Pariscope

Theater	Title	Schedule
T1	A	9:00

The database schema \(\dbschema{CINEMA} = \{ \rel{Movies}, \rel{Location}, \rel{Pariscope} \}\)
The sorts of the relation names
- \(\sort(\rel{Movies}) = \{ \attname{Title}, \attname{Director}, \attname{Actor} \}\)
- \(\sort(\rel{Location}) = \{ \attname{Theater}, \attname{Address}, \attname{Phone} \}\)
- \(\sort(\rel{Pariscope}) = \{ \attname{Theater}, \attname{Title}, \attname{Schedule} \}\)

Named v.s. Unnamed Attributes/Tuples

Named Perspective

E.g. \( \langle A:5, B:3 \rangle \)

a tuple over a finite set of attributes \(U\) (or over a relation schema \(R[U]\)) is a total mapping (viewed as a function) \(u\) from \(U\) to \(\dom\).

\(u(U)\): \(u\) is a tuple over \(U\)
\(u(A)\): the value of \(u\) on an attribute \(A\) in \(U\) (\(A \in U\))
\(u[V] = u \vert_V\): the restriction of \(u\) to a subset \(V \subseteq U\), i.e., \(u[V]\) denotes the tuple \(v\) over \(V\) such that \(v(A) = u(A)\) for all \(A \in V\)

Unnamed Perspective

E.g. \( \langle 5, 3 \rangle \)

a tuple is an ordered n-tuple (\(n \geq 0\)) of constants (i.e., an element of the Cartesian product \(\dom^n\))

\(u(i)\): the \(i\)-th coordinate of \(u\)

Because of the total order \(\attTotalOrder\), a tuple \(\langle A_1:a_1, A_2:a_2 \rangle\) (defined as a function) can be viewed as an ordered tuple with \((A_1:a_1)\) as a first component and \((A_2:a_2)\) as a second component. Ignoring the names, this tuple can be viewed as the ordered tuple \(\langle a_1, a_2 \rangle\).

Conversely, the ordered tuple \(t = \langle a_1, a_2 \rangle\) can be viewed as a function over the set of integers \(\{1, 2\}\) with \(t(i) = a_i\) for each \(i\).

Conventional v.s. Logic Programming Relational Model

Conventional Perspective

a relation (instance) over a relation schema \(\rel{R}[U]\) is a finite set of tuples \(I\) with sort \(U\).

a database instance of database schema \(\dbschema{R}\) is a mapping \(\dbinst{I}\) with domain \(\dbschema{R}\), such that \(\dbinst{I}(\rel{R})\) is a relation over \(\rel{R}\) for each \(\rel{R} \in \dbschema{R}\).

Logic Programming Perspective

This perspective is used primarily with the ordered-tuple perspective on tuples.

A fact over \(\rel{R}\) is an expression of the form \(\rel{R}(a_1, \ldots, a_n)\), where \(a_i \in \dom\) for \(i \in [1, n]\).
If \(u = \langle a_1, \ldots, a_n \rangle\), we sometimes write \(\rel{R}(u)\) for \(\rel{R}(a_1, \ldots, a_n)\).

a relation (instance) over a relation schema \(\rel{R}\) is a finite set of facts over \(\rel{R}\).

a database instance of database schema \(\dbschema{R}\) is a finite set \(\dbinst{I}\) that is the union of relation instances over \(\rel{R}\) for all \(\rel{R} \in \dbschema{R}\).

Examples

Assume \(\sort(R) = AB, \sort(S) = A\).

Named and Conventional

\[ \begin{split} I(R) = \{ f_1, f_2, f_3 \} \quad && \\ & f_1(A) = a &\quad f_1(B) = b \\ & f_2(A) = c &\quad f_2(B) = b \\ & f_3(A) = a &\quad f_3(B) = a \\ I(S) = \{g\} \quad && \\ & g(A) = d & \end{split} \]

Unnamed and Conventional

\[ \begin{split} I(R) = \{ \langle a, b \rangle, \langle c, b \rangle, \langle a, a \rangle \} \quad && \\ I(S) = \{ \langle d \rangle \} \quad && \end{split} \]

Named and Logic Programming

\[ \{ R(A:a, B:b), R(A:c, B:b), R(A:a, B:a), S(A:d) \} \]

Unnamed and Logic Programming

\[ \{ R(a, b), R(c, b), R(a, a), S(d) \} \]

Variables

an infinite set of variables \(\varSet\) will be used to range over elements of \(\dom\).

a free tuple over \(U\) or \(\rel{R}(U)\) is a function \(u\) from \(U\) to \(\dom \cup \varSet\)
an atom over \(\rel{R}\) is an expression of the form \(\rel{R}(e_1, \ldots, e_n)\), where \(n = \arity(\rel{R})\) and each term \(e_i \in \dom \cup \varSet\)
a ground atom is an atom with no variables, i.e., a fact (see Logic Programming Perspective)

Notations

Notations

Object	Notation
Constants	\(a, b, c\)
Variables	\(x, y\)
Sets of Variables	\(X, Y\)
Terms	\(e\)
Attributes	\(A, B, C\)
Sets of Attributes	\(U, V, W\)
Relation Names	\(\rel{R}, \rel{S}; \rel{R}[U], \rel{S}[V] \)
Database Schemas	\(\dbschema{R}, \dbschema{S} \)
Tuples	\(t, s\)
Free Tuples	\(u, v, w\)
Facts	\(\rel{R}(a_1, \ldots, a_n), R(t)\)
Atoms	\(\rel{R}(e_1, \ldots, e_n), R(u)\)
Relation Instances	\(I, J\)
Database Instances	\(\dbinst{I}, \dbinst{J} \)

Conjunctive Queries

Mathjax may have issues to display math symbols in particular styles, such as \(\mathcal{A}\).
You can try to mitigate this by changing the Math Renderer to Common HTML (right click on any math expressions): \( \newcommand{\att}{\textbf{att}} \) \( \newcommand{\attTotalOrder}{\leq_{\att}} \) \( \newcommand{\dom}{\textbf{dom}} \) \( \newcommand{\Dom}{\textit{Dom}} \) \( \newcommand{\relname}{\textbf{relname}} \) \( \newcommand{\sort}{\textit{sort}} \) \( \newcommand{\arity}{\textit{arity}} \) \( \newcommand{\finPowerSet}{\mathcal{P}^{\text{fin}}} \) \( \newcommand{\dbschema}[1]{\textbf{#1}} \) \( \newcommand{\dbinst}[1]{\textbf{#1}} \) \( \newcommand{\rel}[1]{\textit{#1}} \) \( \newcommand{\attname}[1]{\textit{#1}} \) \( \newcommand{\varSet}{\textbf{var}} \) \( \newcommand{\ans}{\textit{ans}} \) \( \newcommand{\var}{\textit{var}} \) \( \newcommand{\adom}{\textit{adom}} \) \( \newcommand{\tableau}{\textbf{T}} \) \( \newcommand{\free}{\textit{free}} \) \( \newcommand{\body}{\text{body}} \)

a query (mapping) is from (or over) its input schema and to its output schema.

query: a syntactic object
query mapping: a function defined by a query interpreted under a specified semantics
- its domain: the family of all instances of an input schema
- its range: the family of instances of an output schema
- we often blur query and query mapping when the meaning is clear from context
input schema: a specified relation or database schema
output schema: a relation schema or database schema
- For a relation schema, the relation name may be specified as part of the query syntax or by the context
\(q_1 \equiv q_2\) denotes two queries \(q_1\) and \(q_2\) over \(\dbschema{R}\) are equivalent, i.e., they have the same output schema and \(q_1(\dbinst{I}) = q_2(\dbinst{I})\) for each instance \(\dbinst{I}\) over \(\dbschema{R}\).

Logic-Based Perspective

Three versions of conjunctive queries:

Rule-Based Conjunctive Queries
Tableau Queries
Conjunctive Calculus

Rule-Based Conjunctive Queries

Definition

A rule-based conjunctive query (or often more simply called rules) \(q\) over a relation schema \(\dbschema{R}\) is an expression of the form \[ \ans(u) \leftarrow \rel{R}_1(u_1), \ldots, \rel{R}_n(u_n) \] where \(n \geq 0\), \(\rel{R}_1, \ldots, \rel{R}_n\) are relation names in \(\dbschema{R}\); \(\ans\) is a relation name not in \(\dbschema{R}\); and \(u, u_1, \ldots, u_n\) are free tuples (i.e., may use either variables or constants).

\(\var(q)\): the set of variables occurring in \(q\).
body: \(\rel{R}_1(u_1), \ldots, \rel{R}_n(u_n)\)
head: \(\ans(u)\)
range restricted: each variable occurring in the head also occurs in the body
- all conjunctive queries considered here are range restricted
valuation: a valuation \(v\) over \(V\), a finite subset of \(\varSet\), is a total function \(v\) from \(V\) to \(\dom\) (of constants)
- extended to be identity on \(\dom\) (so that the domain can contain both variables and constants)
- extended to map free tuples to tuples (so that the domain/range can be in the tuple form)

Semantics

Let \(q\) be the query, and let \(\dbinst{I}\) be a database instance of schema \(\dbschema{R}\). The image of \(\dbinst{I}\) under \(q\) is:

\[ q(\dbinst{I}) = \{ v(u) \mid v \text{ is a valuation over } \var(q) \text{ and } v(u_i) \in \dbinst{I}(\rel{R}_i) \text{ for each } i \in [1, n] \} \]

active domain
- of a database instance \(\dbinst{I}\), denoted \(\adom(\dbinst{I})\), is the set of constants that occur in \(\dbinst{I}\)
- of a relation instance \(I\), denoted \(\adom(I)\), is the set of constants that occur in \(I\)
- of a query \(q\), denoted \(\adom(q)\), is the set of constants that occur in \(q\)
- \(\adom(q, \dbinst{I})\) is an abbreviation for \(\adom(q) \cup \adom(\dbinst{I})\)
  - \(\adom(q(\dbinst{I})) \subseteq \adom(q, \dbinst{I})\)
extensional relations: relations in the body of the query, i.e., \(\rel{R}_1, \ldots, \rel{R}_n\)
- because they are known/provided by the input instance \(\dbinst{I}\)
intensional relation: the relation in the head of the query, i.e., \(\ans\)
- because it is not stored and its value is computed on request by the query
extensional database (edb): a database instance associated with the extensional relations
intensional database (idb): the rule itself
idb relation: the relation defined by the idb

Properties

Conjunctive queries are:

monotonic: a query \(q\) over \(\dbschema{R}\) is monotonic if for each \(\dbinst{I}, \dbinst{J}\) over \(\dbschema{R}\), \(\dbinst{I} \subseteq \dbinst{J} \implies q(\dbinst{I}) \subseteq q(\dbinst{J})\)
satisfiable: a query \(q\) is satisfiable if there exists a database instance \(\dbinst{I}\) such that \(q(\dbinst{I}) \neq \emptyset\)

Tableau Queries

A tableau query is simply a pair \((\tableau, u)\) where \(\tableau\) is a tableau and each variable in \(u\) also occurs in \(\tableau\).

This is closest to the visual form provided by Query-By-Example (QBE).

summary: the free tuple \(u\) representing the tuples included in the answer to the query
embedding: a valuation \(v\) for the variables occurring in \(\tableau\) such that \(v(\tableau) \subseteq \dbinst{I}\)
- the output of \((\tableau, u)\) on \(\dbinst{I}\) consists of all tuples \(v(u)\) for each embedding \(v\) of \(\tableau\) into \(\dbinst{I}\)
typed: a tableau query \(q = (\tableau, u)\) under the named perspective, where \(\tableau\) is over relation schema \(R\) and \(\sort(u) \subseteq \sort(R)\), is typed if no variable of \(\tableau\) is associated with two distinct attributes in \(q\)

Examples

Table Movies

Title	Director	Actor
\(x_{ti}\)	“Bergman”	\(x_{ac}\)

Table Location

Theater	Address	Phone
\(x_{th}\)	\(x_{ad}\)	\(x_p\)

Table Pariscope

Theater	Title	Schedule
\(x_{th}\)	\(x_{ti}\)	\(x_{s}\)

The above tableau query is typed because each variable is associated with only one attribute:

\(x_{ti}\): \(\attname{Title}\)
\(x_{ac}\): \(\attname{Actor}\)
\(x_{th}\): \(\attname{Theater}\)
\(x_{ad}\): \(\attname{Address}\)
\(x_p\): \(\attname{Phone}\)
\(x_{s}\): \(\attname{Schedule}\)

However, the following tableau query is untyped:

Table Movies

Title	Director	Actor
\(x_{ti}\)	\(x_{ac}\)	\(x_{ac}\)

Because \(x_{ac}\) is associated with both \(\attname{Director}\) and \(\attname{Actor}\).

Conjunctive Calculus

The conjunctive query \[ \ans(u) \leftarrow \rel{R}_1(u_1), \ldots, \rel{R}_n(u_n) \]

can be expressed as the following conjunctive calculus query that has the same semantics: \[ \left\{ e_1, \ldots, e_m \vert \exists x_1, \ldots, x_k \left( \rel{R}(u_1) \wedge \ldots \wedge \rel{R}(u_n) \right) \right\} \] where \(x_1, \ldots, x_k\) are all the variables occurring in the body and not the head.

Conjunctive Calculus Formula

Let \(\dbschema{R}\) be a relation schema. A (well-formed) formula over \(\dbschema{R}\) for the conjunctive calculus is an expression having one of the following forms:

an atom over \(\dbschema{R}\);
\(\varphi \wedge \psi \), where \(\varphi\) and \(\psi\) are formulas over \(\dbschema{R}\); or
\(\exists x \varphi\), where \(x\) is a variable and \(\varphi\) is a formula over \(\dbschema{R}\).

An occurrence of a variable \(x\) in formula \(\varphi\) is free if:

\(\varphi\) is an atom; or
\(\varphi = (\psi \wedge \xi)\) and the occurrence of \(x\) is free in \(\psi\) or \(\xi\); or
\(\varphi = \exists y \psi\), \(x \neq y\), and the occurrence of \(x\) is free in \(\psi\).

\(\free(\varphi)\): the set of free variables in \(\varphi\).

An occurrence of a variable that is not free is bound.

Conjunctive Calculus Query

A conjunctive calculus query over database schema \(\dbschema{R}\) is an expression of the form \[ \{ e_1, \ldots, e_m \mid \varphi \} \] where

\(\varphi\) is a conjunctive calculus formula,
\(\langle e_1, \ldots, e_m \rangle\) is a free tuple, and
the set of variables occurring in \(\langle e_1, \ldots, e_m \rangle\) is exactly \(\free(\varphi)\).

For named perspective, the above query can be written as: \[ \{ \langle e_1, \ldots, e_m \rangle : A_1, \ldots, A_m \mid \varphi \} \]

Note: In the previous chapter, a named tuple was usually denoted as:

\[ \langle A_1:a_1, \ldots, A_m:a_m \rangle \]

But here we denote the named tuple as:

\[ \langle e_1, \ldots, e_m \rangle : A_1, \ldots, A_m \]

Not sure which one is better or whether the author did this intentionally.

Semantics

Valuation

A valuation over \(V \subset \varSet \) is a total function \(v\) from \(V\) to \(\dom\), which can be viewed as a syntactic expression of the form: \[ \{ x_1/a_1, \ldots, x_n/a_n \} \] where

\(x_1, \ldots, x_n\) is a listing of \(V\),
\(a_i = v(x_i)\) for each \(i \in [1, n]\).

Interpretation as a Set

If \(x \not \in V\), and \(c \in \dom\), then \(v \cup \{ x/c \}\) is the valuation over \(V \cup \{ x \}\) that agrees with \(v\) on \(V\) and maps \(x\) to \(c\).

Satisfaction

Let \(\dbschema{R}\) be a database schema, \(\varphi\) a conjunctive calculus formula over \(\dbschema{R}\), and \(v\) a valuation over \(\free(\varphi)\). Then \(\dbinst{I} \models \varphi[v]\), if

\(\varphi = \rel{R}(u)\) is an atom and \(v(u) \in \dbinst{I}(\rel{R})\); or
\(\varphi = (\psi \wedge \xi)\) and \(\dbinst{I} \models \psi[v \vert_{\free(\psi)}]\) and \(\dbinst{I} \models \xi[v \vert_{\free(\xi)}]\); or
\(\varphi = \exists x \psi\) and there exists a constant \(c \in \dom\) such that \(\dbinst{I} \models \psi[v \cup \{ x/c \}]\).

Image

Let \(q = \{ e_1, \ldots, e_m \mid \varphi \}\) be a conjunctive calculus query over \(\dbschema{R}\). For an instance \(\dbinst{I}\) over \(\dbschema{R}\), the image of \(\dbinst{I}\) under \(q\) is: \[ q(\dbinst{I}) = \{ v( \langle e_1, \ldots, e_m \rangle ) \mid v \text{ is a valuation over } \free(\varphi) \text{ and } \dbinst{I} \models \varphi[v] \} \]

active domain
- of a formula \(\varphi\), denoted \(\adom(\varphi)\), is the set of constants that occur in \(\varphi\)
- \(\adom(\varphi, \dbinst{I})\) is an abbreviation for \(\adom(\varphi) \cup \adom(\dbinst{I})\)
- If \(\dbinst{I} \models \varphi[v]\), then the range of \(v\) is contained in \(\adom(\dbinst{I})\)
  - to evaluate a conjunctive calculus query, one need only consider valuations with range constrained in \(\adom(\varphi, \dbinst{I})\), i.e., only a finite number of them
equivalent: conjunctive calculus formulas \(\varphi\) and \(\psi\) over \(\dbschema{R}\) are equivalent, if
- they have the same free variables and,
- for each \(\dbinst{I}\) over \(\dbschema{R}\) and valuation \(v\) over \(\free(\varphi) = \free(\psi)\), \(\dbinst{I} \models \varphi[v]\) if and only if \(\dbinst{I} \models \psi[v]\)

Normal Form

\[ \exists x_1, \ldots, x_m \left( \rel{R}_1(u_1) \wedge \ldots \wedge \rel{R}_n(u_n) \right) \]

Each conjunctive calculus query is equivalent to a conjunctive calculus query in normal form.

Expressiveness of Query Languages

Let \(\mathcal{Q}_1\) and \(\mathcal{Q}_2\) be two query languages.

\(\mathcal{Q}_1 \sqsubseteq \mathcal{Q}_2\): \(\mathcal{Q}_1\) is dominated by \(\mathcal{Q}_2\) (or \(\mathcal{Q}_2\) is weaker than \(\mathcal{Q}_1\)), if for each query \(q_1 \in \mathcal{Q}_1\), there exists a query \(q_2 \in \mathcal{Q}_2\) such that \(q_1 \equiv q_2\).
\(\mathcal{Q}_1 \equiv \mathcal{Q}_2\): \(\mathcal{Q}_1\) and \(\mathcal{Q}_2\) are equivalent, if \(\mathcal{Q}_1 \sqsubseteq \mathcal{Q}_2\) and \(\mathcal{Q}_2 \sqsubseteq \mathcal{Q}_1\).

The rule-based conjunctive queries, tableau queries, and conjunctive calculus queries are all equivalent.

Incorporating Equality

The conjunctive query

\[ \begin{split} \ans(x_{th}, x_{ad}) \leftarrow & \rel{Movies}(x_{ti}, \text{“Bergman”}, x_{ac}), \\ & \qquad \rel{Location}(x_{th}, x_{ad}, x_p), \rel{Pariscope}(x_{th}, x_{ti}, x_s) \end{split} \]

can be expressed as:

\[ \begin{split} \ans(x_{th}, x_{ad}) \leftarrow & \rel{Movies}(x_{ti}, x_d, x_{ac}), x_d = \text{“Bergman”}, \\ & \qquad \rel{Location}(x_{th}, x_{ad}, x_p), \rel{Pariscope}(x_{th}, x_{ti}, x_s) \end{split} \]

Problem 1: Infinite Answers

Unrestricted rules with equality may yield infinite answers: \[ \ans(x, y) \leftarrow \rel{R}(x), y = z \]

A rule-based conjunctive query with equality is a range-restricted rule-based conjunctive query with equality.

Problem 2: Unsatisfiable Queries

Consider the following query: \[ \ans(x) \leftarrow \rel{R}(x), x = a, x = b \] where \(\rel{R}\) is a unary relation and \(a, b \in \dom\) with \(a \neq b\).

Each satisfiable rule with equality is equivalent to a rule without equality.
No expressive power is gained if the query is satisfiable.

Query Composition and Views

a conjunctive query program (with or without equality) is a sequence \(P\) of rules having the form: \[ \begin{split} S_1(u_1) & \leftarrow \text{body}_1 \\ S_2(u_2) & \leftarrow \text{body}_2 \\ & \vdots \\ S_m(u_m) & \leftarrow \text{body}_m \end{split} \] where

each \(S_i\) is a distinct relation name not in \(\dbschema{R}\),
for each \(i \in [1, m]\), the only relation names that may occur in \(\body_i\) are in \(\dbschema{R} \cup \{ S_1, \ldots, S_{i-1} \}\).

Closure under Composition

If conjunctive query program \(P\) defines final relation \(S\), then there is a conjunctive query \(q\), possibly with equality, such that on all input instances \(\dbinst{I}\), \(q(\dbinst{I}) = P(\dbinst{I})(S)\).

If \(P\) is satisfiable, then there is a \(q\) without equality.

Composition and User Views

views are specified as queries (or query programs), which may be

materialized: a physical copy of the view is stored and maintained
virtual: relevant information about the view is computed as needed
- queries against the virtual view generate composed queries against the underlying database

Algebraic Perspectives

The Unnamed Perspective: The SPC Algebra

Unnamed conjunctive algebra:

Selection (\(\sigma\))
Projection (\(\pi\))
Cross-Product (or Cartesian Product, \(\times\))

Selection (“Horizontal” Operator)

Primitive Forms: \(\sigma_{j=a}\) and \(\sigma_{j=k}\)
where \(j, k\) are positive integers, and \(a \in \dom\) (constants are usually surrounded by quotes).

\[ \sigma_{j=a}(I) = \{ t \in I \mid t(j) = a \} \]

\(\sigma_{j=k}\) is sometimes called atomic selection.

Projection (“Vertical” Operator)

General Form: \(\pi_{j_1, \ldots, j_n}\)
where \(n \geq 0\) and \(j_1, \ldots, j_n\) are positive integers (empty sequence is written \([\;]\)).

\[ \pi_{j_1, \ldots, j_n}(I) = \{ \langle t(j_1), \ldots, t(j_n) \rangle \mid t \in I \} \]

Cross-Product (or Cartesian Product)

\[ I \times J = \{ \langle t(1), \ldots, t(n), s(1), \ldots, s(m) \rangle \mid t \in I, s \in J \} \]

Formal Inductive Definition

Let \(\dbschema{R}\) be a relation schema.

There are two kinds of base SPC (algebra) queries:

Input Relation: Expression \(R\); with arity equal to \(\arity(R)\)
Unary Singleton Constant: Expression \(\{\langle a \rangle \}\), where \(a \in \dom\); with arity equal to 1

The family of SPC (algebra) queries contains all above base SPC queries and, for SPC queries \(q_1, q_2\) with arities \(\alpha_1, \alpha_2\), respectively,

Selection: \(\sigma_{j=1}(q_1)\) and \(\sigma_{j=k}(q_1)\), whenever \(j, k \leq \alpha_1\) and \(a \in \dom\); these have arity \(\alpha_1\).
Projection: \(\pi_{j_1, \ldots, j_n}(q_1)\), where \(j_1, \ldots, j_n \leq \alpha_1\); this has arity \(n\).
Cross-Product: \(q_1 \times q_2\); this has arity \(\alpha_1 + \alpha_2\).

Unsatisifiable Queries

E.g. \(\sigma_{1=a}(\sigma_{1=b}(R))\) where \(\arity(R) \geq 1\) and \(a \neq b\).

This is equivalent to \(q^\emptyset\).

Generalized SPC Algebra

Intersection (\(\cap\)): is easily simulated by selection and cross-product.
Generalized Selection (\(\sigma\)): permits the specification of multiple conditions.
- Positive Conjunctive Selection Formula: \(F = \gamma_1 \land \ldots \land \gamma_n (n \geq 1)\) where each \(\gamma_i\) is either \(j = k\) or \(j = a\)
- Positive Conjunctive Selection Operator: an expression of the form \(\sigma_F\), where \(F\) is a positive conjunctive selection formula
- can be simulated by a sequence of selections
Equi-Join (\(\bowtie\)): a binary operator that combines cross-product and selection
- Equi-Join Condition: \(F = \gamma_1 \land \ldots \land \gamma_n (n \geq 1)\) where each \(\gamma_i\) is of the form \(j = k\)
- Equi-Join Operator: an expression of the form \(I \bowtie_F J\), where \(F\) is an equi-join condition
- can be simulated by \(\sigma_{F’}(I \times J)\), where \(F’\) is obtained from \(F\) by replacing each condition \(j = k\) with \(j = \alpha(I) + k\)

Normal Form

\[ \pi_{j_1, \ldots, j_n}( \{ \langle a_1 \rangle \} \times \ldots \times \{ \langle a_m \rangle \} \times \sigma_F(R_1 \times \ldots \times R_k) ) \] where

\(n \geq 0; m \geq 0\);
\(a_1, \ldots, a_m \in \dom\);
\(\{1, \ldots, m\} \subseteq \{j_1, \ldots, j_n\}\);
\(R_1, \ldots, R_k\) are relation names (repeats permitted);
\(F\) is a positive conjunctive selection formula.

For each (generalized) SPC query, there is an equivalent SPC query in normal form.

The proof is based on repeated application of the following equivalence-preserving SPC algebra rewrite rules (or transformations):

Merge-Select: \[ \sigma_{F}( \sigma_{F’}(q) ) \equiv \sigma_{F \land F’}(q) \]
Merge-Project: \[ \pi_{\boldsymbol{j}}( \pi_{\boldsymbol{k}}(q) ) \equiv \pi_{\boldsymbol{l}}(q), \text{ where } l_i = k_{j_i} \text{ for each term } l_i \text{ in } \boldsymbol{l} \]
Push-Select-Though-Project: \[ \sigma_F( \pi_{\boldsymbol{j}}(q) ) \equiv \pi_{\boldsymbol{j}}( \sigma_{F’}(q) ), \text{ where } F’ \text{ is obtained from } F \text{ by replacing each } k \text{ with } j_k \]
Push-Select-Though-Singleton: \[ \sigma_{1=j}(\langle a \rangle \times q) \equiv \langle a \rangle \times \sigma_{(j - 1) = a}(q) \]
Associate-Cross: \[ (q_1 \times \ldots \times q_n) \times q \equiv q_1 \times \ldots \times q_n \times q \] \[ q \times (q_1 \times \ldots \times q_n) \equiv q_1 \times \ldots \times q_n \times q \]
Commute-Cross: \[ (q \times q’) \equiv \pi_{\boldsymbol{j} \boldsymbol{j’}} (q’ \times q) \] where \(\boldsymbol{j’} = [1, \ldots, \arity(q’)]\) and \(\boldsymbol{j} = [\arity(q’) + 1, \ldots, \arity(q’) + \arity(q)]\)
Push-Cross-Though-Select: \[ \sigma_F(q) \times q’ \equiv \sigma_{F}(q \times q’) \] \[ q \times \sigma_F(q’) \equiv \sigma_{F’}(q \times q’) \] where \(F’\) is obtained from \(F\) by replacing each \(j\) with \(\arity(q) + j\)
Push-Cross-Though-Project: \[ \pi_{\boldsymbol{j}}(q) \times q’ \equiv \pi_{\boldsymbol{j}}(q \times q’) \] \[ q \times \pi_{\boldsymbol{j}}(q’) \equiv \pi_{\boldsymbol{j’}}(q \times q’) \] where \(\boldsymbol{j’}\) is obtained from \(\boldsymbol{j}\) by replacing each \(j\) with \(\arity(q) + j\)

Notation for Rewriting

\[ \begin{split} q &\rightarrow_{\mathcal{S}} q’ \quad \text{or} \\ \text{simply} \quad q &\rightarrow q’ \end{split} \]

for a set \(\mathcal{S}\) of rewrite rules and algebra expressions \(q, q’\),
if \(\mathcal{S}\) is understood from the context,
if \(q’\) is the result of replacing a subexpression of \(q\) according to one of the rules in \(\mathcal{S}\).

\(\rightarrow^*_S\): the reflexive and transitive closure of \(\rightarrow\).

A family \(\mathcal{S}\) of rewrite rules is sound if \(q \rightarrow_{\mathcal{S}} q’\) implies \(q \equiv q’\). If \(\mathcal{S}\) is sound, then \(q \rightarrow^*_{\mathcal{S}} q’\) implies \(q \equiv q’\).

The Named Perspective: The SPJR Algebra

Named conjunctive algebra:

Selection (\(\sigma\))
Projection (\(\pi\)); repeats not permitted
Join (\(\bowtie\)); natural join
Renaming (\(\delta\))

Selection

\(\sigma_{A=a}, \sigma_{A=B}\) where \(A, B \in \att, a \in \dom\)

These operators apply to any instance \(I\) with \(A, B \in \sort(I)\).

Projection

\(\pi_{A_1, \ldots, A_n}\) where \(n \geq 0\) and \(A_1, \ldots, A_n \in \sort(I)\)

(Natural) Join

\[ \begin{split} I \bowtie J = \{ t \text{ over } V \cup W \vert & \text{ for some } v \in I \text{ and } w \in J, \\ & \quad t[V] = v \land t[W] = w \} \end{split} \] where \(I\) and \(J\) have sorts \(V\) and \(W\), respectively.

When \(\sort(I) = \sort(J)\), \(I \bowtie J = I \cap J\).
When \(\sort(I) \cap \sort(J) = \emptyset\), \(I \bowtie J = I \times J\).

Renaming

An attribute renaming for a finite set \(U\) of attributes is a one-one mapping from \(U\) to \(\att\).

\[ \delta_f(I) = \{ v \text{ over } f[U] \vert \text{ for some } u \in I, v(f(A)) = u(A) \text{ for each } A \in U \} \] where

input is an instance \(I\) over \(U\)
\(f\) is an attribute renaming for \(U\)
- which can be described by \((A, f(A))\), where \(f(A) \neq A\)
  - usually written as \(A_1 A_2 \ldots A_n \rightarrow B_1 B_2 \ldots B_n\) to indicate that \(f(A_i) = B_i\) for each \(i \in [1, n] (n \geq 0)\)

Formal Inductive Definition

The base SPJR algebra queries are:

Input Relation: Expression \(R\); with sort equal to \(\sort(R)\)
Unary Singleton Constant: Expression \(\{ \langle A:a \rangle \}\), where \(a \in \dom\); with \(\sort = A\)

The remainder of the syntax and semantics of the SPJR algebra is defined in analogy to those of the SPC algebra.

Normal Form

\[ \pi_{B_1, \ldots, B_m}( \{ \langle A_1:a_1 \rangle \} \bowtie \ldots \bowtie \{ \langle A_m:a_m \rangle \} \bowtie \sigma_F(\delta_{f_1}(R_1) \bowtie \ldots \bowtie \delta_{f_k}(R_k)) ) \] where

\(n \geq 0; m \geq 0\);
\(a_1, \ldots, a_m \in \dom\);
\({A_1, \ldots, A_m}\) occurs in \({B_1, \ldots, B_n}\);
the \(A_i\)s are distinct;
\(R_1, \ldots, R_k\) are relation names (repeats permitted);
\(\delta_{f_i}\) is a renaming operator for \(\sort(R_i)\);
no \(A_i\) occur in any \(\delta_{f_j}(R_j)\);
the sorts of \(\delta_{f_1}(R_1), \ldots, \delta_{f_k}(R_k)\) are pairwise disjoint;
\(F\) is a positive conjunctive selection formula.

For each SPJR query, there is an equivalent SPJR query in normal form.

Equivalence Theorem

Lemma: The SPC and SPJR algebras are equivalent.
Equivalence Theorem: The rule-based conjunctive queries, tableau queries, conjunctive calculus queries, satisifiable SPC algebra, and satisifiable SPJR algebra are all equivalent.

Adding Union

The following have equivalent expressive power:

the nonrecursive datalog programs (with single relation target)
the SPCU queries
the SPJRU queries
- union can only be applied to expressions having the same sort

Nonrecursive Datalog Program

A nonrecursive datalog program (nr-datalog program) over schema \(\dbschema{R}\) is a set of rules:

\[ \begin{split} S_1 & \leftarrow \body_1 \\ S_2 & \leftarrow \body_2 \\ & \vdots \\ S_m & \leftarrow \body_m \end{split} \] where

no relation name in \(\dbschema{R}\) occurs in a rule head
the same relation name may appear in more than one rule head
there is some ordering \(r_1, \ldots, r_m\) of the rules such that the relation name in the head of \(r_i\) does not occur in the body of any rule \(r_j\) with \(j \leq i\)

Union and the Conjunctive Calculus

Simply permitting disjunction (denoted \(\lor\)) in the formula of a conjunctive calculus along with conjunction can have serious consequences.

E.g.,

\[ q = \{ x, y, z \vert R(x, y) \lor R(y, z) \} \]

The answer of \(q\) on nonempty instance \(I\) will be

\[ q(I)= (I \times \dom) \cup (\dom \times I) \]

Because \(R(x, y)\) decides that the first two components in \(\langle x, y, z \rangle\) are in \(I\) and \(z\) can be anything in \(\dom\), and \(R(y, z)\) decides that the last two components in \(\langle x, y, z \rangle\) are in \(I\) and \(x\) can be anything in \(\dom\).

G1 Pain Points

About Alcohol

G1/G2: You must not drive if you have been drinking alcohol. Your blood-alcohol level must be zero.
All drivers who are 21 and under, regardless of license class, must have a BAC level of zero when operating a motor vehicle.
Blood-alcohol level <0.05% is ok
0.05%-0.08% -> warn range -> not ok for warn-range suspension, not ok for accompanying driver.
>0.08% is not ok

About Demerit Points

New drivers: 9+ -> suspended for 60d
Fully licensed drivers: 15 -> suspended for 30d
7: Failing to remain at the scene of a collision, Failing to stop for police
6: careless, racing, exceeding the speed limit by 40+km/h if limit <=80, or 50+km/h, failing to stop for a school bus.
5: only bus drivers
4: exceeding the speed limit by 30-49 km/h, following too closely, failing to stop at a pedestrian crossover.
3: exceeding the speed limit by 16-29 km/h, stop/yield sign, failing to report collision, wrong way (closed road/one-way road), emergency vehicles, HOV, opening a door, radar detector, crowded seat.
2: seatbelt, improper turn, lower headlight beam, sharing the road, signal, slow, towing, reversing on a highway

About Distance

150m

Headlights must shine a white light that can be seen at least 150 meters in front.
Red rear lights should be seen 150 meters away.
When you use high-beam headlights, remember to switch to low beams within 150 meters of an oncoming vehicle.
Never make a U-turn unless you can see at least 150 meters in both directions.
It is illegal to follow within 150 meters of a fire vehicle responding to an alarm.
Snow-removal vehicles on public roadways are equipped with flashing blue lights that can be seen from 150 meters.
Roadside stop: A 150-meter gap in both directions provides enough space to perform the move safely.

110m

Headlights must shine a white light that is strong enough to light up objects 110 meters away.

100m

Do not park on or within 100 meters of a bridge.

60m

Use your low beams when you are less than 60 meters behind another vehicle unless you are passing it.

30m

Passing within 30 meters of a pedestrian crossover is not permitted.
Passing left of a centerline is not permitted 30 meters from a bridge, viaduct or tunnel.

28m

In fact, a recent study found that drivers who were texting or changing music on their phones traveled 28 meters further (nearly half a hockey rink) before responding to a hazard than drivers who were paying attention.

20m

If you are coming from behind the bus, stop at least 20 meters away.

15m

Do not park within 15 meters of the nearest rail of a level railway crossing.
Do not park within 15 meters if it is controlled by traffic lights.

9m

Do not park within 9 meters of an intersection.

6m

Do not park within six meters of a public entrance to a hotel, theatre or public hall when it is open to the public.

5m

If a train is coming, stop at least 5 meters from the nearest rail or gate.

3m

Do not park within 3 meters of a fire hydrant.

2m

At streetcar stops, stay at least 2 meters behind the rear doors where passengers are getting off or on.

1m

Bicycles and mopeds traveling at a lower speed than other traffic are expected to ride about 1 meter from the curb or parked cars.
When passing a cyclist, drivers of motor vehicles must maintain a minimum distance of 1 meter, where practical between their vehicle and the cyclist.
Drive alongside, or parallel to, the vehicle ahead of the empty space, leaving about 1 meter between the vehicles.

60cm

Roadside stop: Leave at least 60 centimeters between your vehicle and the parked vehicle.

30cm

Roadside stop: You should not be more than about 30 centimeters away from the curb or edge of the road.

Other Numbers

10 times

Texting or browsing on your phone takes your eyes off the road and increases your risk of crashing by 10 times.

Keyboard shortcuts

Hongxu Xu