Hongxu Xu
PhD Student (May 2025 – Present)
Supervised by Prof. Chengnian Sun
Cheriton School of Computer Science
University of Waterloo, Canada
B.Sc. in Computer Science (Sep 2014 – Jun 2018)
Beijing Normal University, China
Research Interests
- Software Engineering
- Programming Languages
without a focus yet…
Socials

- GitHub: https://github.com/xuhongxu96
- LinkedIn: https://www.linkedin.com/in/xuhongxu/
- Email: h4️⃣4️⃣5️⃣xu 🌀 uwaterloo.ca
- 微信公众号: xuhongxu_it
(Chinese only. Scan the QR code to follow my WeChat Official Account.)
Misc.
- Chinese Name: 许宏旭
- Pronunciation:
Hong-shyoo
(IPA:[xʊŋ ɕy]
)
Publications
Papers
Nothing yet…
Books
Feb 2024
CMake构建实战:项目开发卷 (CMake Build Practice: Project Development Volume)
Published by 人民邮电出版社 (Posts & Telecom Press, China)
Produced by 异步图书 (epubit)
Experience
PhD Student (May 2025 – Present)
Supervised by Prof. Chengnian Sun
Cheriton School of Computer Science
University of Waterloo, Canada
Senior SDE (Sep 2022 – Mar 2025)
Seed/Data Speech Team
ByteDance, Shanghai, China
Led the development of the Text-to-Speech engine and contributed to the Doubao AI assistant application.
SDE-2 (Jan 2021 – Sep 2022)
MSAI Team
Microsoft STC-Asia, Suzhou, China
Led the development of Microsoft WordBreaker and initiated a modern NLP toolkit for Office 365.
SDE (Jul 2018 – Aug 2021)
MSAI Team
Microsoft STC-Asia
Beijing, China (Relocated to Suzhou, Jiangsu, China in May 2019)
Worked on 20-year-old Microsoft WordBreaker.
Short-Term Contributor (Nov 2020 – Jan 2021)
Windows APS Team (temporary assignment)
Microsoft STC-Asia, Suzhou, China
Contributed to the formation of the new team, and Windows 11 application development.
B.Sc. in Computer Science (Sep 2014 – Jun 2018)
Beijing Normal University, China
SDE Intern (Jul 2017 – Dec 2017)
Bing Search Relevance Team
Microsoft STC-Asia, Beijing, China
Answer triggering model development for Bing Search.
Awards
- Outstanding Graduate, Beijing Normal University, 2018
- Top Ten Volunteer, Beijing Normal University, 2015
- First Prize, National Olympiad in Informatics in Provinces (NOIP), 2013
Teaching
- Instructional Apprentice, CS 246 - Object-Oriented Software Development, Fall 2025
- Teaching Assistant, CS 246 - Object-Oriented Software Development, Spring 2025
CS 452/652: Real-Time Programming (the Train Course)
Spring 25
I dropped this course after I learned that it is really time-consuming.
However, before I dropped it, I did some research and found some useful resources, which I summarized in MarklinSim All-in-One repo.
In short, follow the readme in the repo to set up:
- MarklinSim (an emulator for the train kit)
- QEMU (an emulator for the Raspberry Pi 4B)
- Example Image (your real-time system).
I believe it will save you a lot of time. Good luck!
CS 480/680: Introduction to Machine Learning
Spring 25
This is not a hard course.
\documentclass{article}
\usepackage{fullpage}
\usepackage{float}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{mathtools}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{url, hyperref}
\usepackage{algorithm2e}
\usepackage[margin=0.25in]{geometry}
\usepackage{pgfplots}
\pgfplotsset{width=10cm,compat=1.18}
\usepgfplotslibrary{external}
\tikzexternalize
\lstset{basicstyle=\fontfamily{pcr}\footnotesize}
\graphicspath{./}
\newcommand{\paren}[1]{\left( #1 \right)}
\newcommand{\iprod}[1]{\left\langle #1 \right\rangle}
\newcommand*\MY@rightharpoonupfill@{%
\arrowfill@\relbar\relbar\rightharpoonup
}
\newcommand*\overrightharpoon{%
\mathpalette{\overarrow@\MY@rightharpoonupfill@}%
}
\theoremstyle{definition}
\newtheorem{question}{Question}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\diag}{diag}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\Var}{Var}
\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}
\newcommand{\llnorm}[1]{\left\| #1 \right\|_2}
\newcommand{\mnorm}[1]{\left\| #1 \right\|_1}
\newcommand{\fnorm}[1]{\left\| #1 \right\|_F}
\title{\large CS480/680, Spring 2025\\\huge Review Notes}
\author{Student: Hongxu Xu (h445xu@uwaterloo.ca)}
\date{\today}
\setlength\parindent{0pt}
%\input{preamble}
\begin{document}
\maketitle
\newpage
\part{For Mid-Term}
\section{Perceptron}
\begin{question}[Linear Function]
\begin{equation}
\begin{split}
& \forall \alpha \beta \in \mathbb{R}, \forall \boldsymbol{x}, \boldsymbol{z} \in \mathbb{R}^d,
f(\alpha \boldsymbol{x} + \beta \boldsymbol{z}) = \alpha \cdot f(\boldsymbol{X}) + \beta \cdot f(\boldsymbol{z}) \\
\Longleftrightarrow
& \exists \boldsymbol{w} \in \mathbb{R}^d, f(\boldsymbol{x}) = \iprod{\boldsymbol{x}, \boldsymbol{w}}
\end{split}
\end{equation}
\textbf{Proof $\Rightarrow$:}
Let $\boldsymbol{w} \coloneq \left[ f(\boldsymbol{e_1}), \dots, f(\boldsymbol{e_d}) \right]^T$, where
$\boldsymbol{e_i}$ is the $i$-th coordinate vector.
\begin{equation}
\begin{split}
f(\boldsymbol{x}) & = f( x_1 \boldsymbol{e_1} + \dots + x_d \boldsymbol{e_d} ) \\
& = x_1 f(\boldsymbol{e_1}) + \dots + x_d f(\boldsymbol{e_d}) \\
& = \iprod{\boldsymbol{x}, \boldsymbol{w}}
\end{split}
\end{equation}
\textbf{Proof $\Leftarrow$:}
\begin{equation}
\begin{split}
f(\alpha \boldsymbol{x} + \beta \boldsymbol{z}) & = \iprod{\alpha \boldsymbol{x} + \beta \boldsymbol{z}, \boldsymbol{w}} \\
& = \iprod{\alpha \boldsymbol{x}, \boldsymbol{w}} + \iprod{\beta \boldsymbol{z}, \boldsymbol{w}} \\
& = \alpha \iprod{\boldsymbol{x}, \boldsymbol{w}} + \beta \iprod{\boldsymbol{z}, \boldsymbol{w}} \\
& = \alpha f(\boldsymbol{x}) + \beta f(\boldsymbol{z})
\end{split}
\end{equation}
\end{question}
\begin{question}[$\boldsymbol{w}$ is Orthogonal to Decision Boundary $H$]
Any vector on $H$ can be written as $\overrightharpoon{\boldsymbol{x}\boldsymbol{x'}} = \boldsymbol{x'} - \boldsymbol{x}$,
\begin{equation}
\iprod{\boldsymbol{x'} - \boldsymbol{x}, \boldsymbol{w}}
= \iprod{\boldsymbol{x'}, \boldsymbol{w}} - \iprod{\boldsymbol{x}, \boldsymbol{w}}
= -b - (-b) = 0
\end{equation}
\end{question}
\begin{question}[Update Rule for Perceptron]
\begin{equation} \label{perceptron-update-rule}
\begin{split}
\boldsymbol{w} \leftarrow \boldsymbol{w} + y_i \boldsymbol{x_i} \\
b \leftarrow = b + y_i
\end{split}
\end{equation}
\end{question}
\begin{question}[Feasibility of Perceptron]
The goal is to find $\boldsymbol{w} \in \mathbb{R}^d, b \in \mathbb{R}$,
such that $\forall i, y_i \paren{\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b} > 0$.
According to the update rule \ref{perceptron-update-rule},
\begin{equation}
\begin{split}
y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{new}}} + b_{new} \right]
&= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old} + y \boldsymbol{x}}} + b_{old} + y \right] \\
&= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + y\left[ \iprod{\boldsymbol{x} ,y \boldsymbol{x}} + y \right] \\
&= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + y\left[ y \llnorm{\boldsymbol{x}}^2 + y \right] \\
&= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + y^2 \llnorm{\boldsymbol{x}}^2 + y^2 \\
&= y\left[ \iprod{\boldsymbol{x}, \boldsymbol{w_{old}}} + b_{old} \right] + \underbrace{\llnorm{\boldsymbol{x}}^2 + 1}_{\text{always positive}}
\end{split}
\end{equation}
Notice that $y \in \left\{ \pm 1 \right\} \Rightarrow y^2 = 1$.
$\llnorm{\boldsymbol{x}}^2 + 1$ is always positive, which means we always increase the confidence $y \hat{y}$ after the update.
\end{question}
\begin{question}[Trick for Hiding the Bias Term -- Padding]
\begin{equation}
\begin{split}
\iprod{\boldsymbol{x}, \boldsymbol{w}} + b
& = \iprod{ \begin{pmatrix} \boldsymbol{x} \\ 1 \end{pmatrix}, \begin{pmatrix} \boldsymbol{w} \\ b \end{pmatrix} } \\
& = \iprod{ \boldsymbol{x_{pad}}, \boldsymbol{w_{pad}}}
\end{split}
\end{equation}
Correspondingly, the update rule can be written as:
\begin{equation}
\boldsymbol{w_{pad}} \leftarrow \boldsymbol{w_{pad}} + y \boldsymbol{x_{pad}}
\end{equation}
\end{question}
\begin{question}[Margin]
Suppose $\exists \boldsymbol{w^*}$ such that $\forall i, y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w^*}} > 0$.
We normalize $\boldsymbol{w^*}$ such that $\llnorm{\boldsymbol{w^*}} = 1$.
In other words, $\boldsymbol{w^*}$ is the normalized weight for the deicision boundary.
\begin{equation} \label{margin}
\text{Margin } \gamma \coloneq \min_i \left| \iprod{\boldsymbol{x_i}, \boldsymbol{w^*}} \right|
\end{equation}
\end{question}
\begin{question}[Convergence Theorem -- Linearly Separable Case]
Assume that $\forall i, \llnorm{\boldsymbol{x_i}} \leq C$ (i.e. within a circle with radius $C$).
Then the Perceptron algorithm converges after $\frac{C^2}{\gamma^2}$ mistakes.
\vspace{5mm}
\textbf{Proof:}
Suppose $\boldsymbol{w}$ is the updating weight, and $\theta$ is the angle between $\boldsymbol{w}$ and $\boldsymbol{w^*}$.
We have $\iprod{\boldsymbol{w}, \boldsymbol{w^*}} = \llnorm{\boldsymbol{w}} \llnorm{\boldsymbol{w^*}} \cos{\theta} = \llnorm{\boldsymbol{w}} \cos{\theta}$.
After an update, $\llnorm{\boldsymbol{w_{new}}} \cos{\theta_{new}}$ will be
\begin{equation}
\begin{split}
\iprod{\boldsymbol{w} + y \boldsymbol{x}, \boldsymbol{w^*}}
& = \iprod{\boldsymbol{w}, \boldsymbol{w^*}} + y \iprod{\boldsymbol{x}, \boldsymbol{w^*}} \\
& = \iprod{\boldsymbol{w}, \boldsymbol{w^*}} + \left|\iprod{\boldsymbol{x}, \boldsymbol{w^*}}\right| \\
& \geq \iprod{\boldsymbol{w}, \boldsymbol{w^*}} + \gamma
\end{split}
\end{equation}
Let's see the change of $\iprod{\boldsymbol{w_{new}}, \boldsymbol{w_{new}}} = \llnorm{\boldsymbol{w_{new}}}^2$,
\begin{equation}
\begin{split}
\iprod{\boldsymbol{w} + y \boldsymbol{x}, \boldsymbol{w} + y \boldsymbol{x}}
& = \iprod{\boldsymbol{w}, \boldsymbol{w}} + 2y \iprod{\boldsymbol{w}, \boldsymbol{x}} + y^2 \iprod{\boldsymbol{x}, \boldsymbol{x}} \\
& = \llnorm{\boldsymbol{w}}^2 + 2y \iprod{\boldsymbol{w}, \boldsymbol{x}} + \llnorm{\boldsymbol{x}}^2
\end{split}
\end{equation}
Because $y\iprod{\boldsymbol{w}, \boldsymbol{x}} < 0$ and $\llnorm{\boldsymbol{x}} \leq C$,
\begin{equation}
\begin{split}
\iprod{\boldsymbol{w} + y \boldsymbol{x}, \boldsymbol{w} + y \boldsymbol{x}}
& = \llnorm{\boldsymbol{w}}^2 + 2y \iprod{\boldsymbol{w}, \boldsymbol{x}} + \llnorm{\boldsymbol{x}}^2 \\
& \leq \llnorm{\boldsymbol{w}}^2 + C^2
\end{split}
\end{equation}
Finally, suppose it converges after $M$ updates, we have $\iprod{\boldsymbol{w}, \boldsymbol{w^*}} \geq M \gamma$ and $\llnorm{\boldsymbol{w}}^2 \leq M C^2$
\begin{equation}
\begin{split}
1 = \cos{\theta} & = \frac{\iprod{\boldsymbol{w}, \boldsymbol{w^*}}}{\llnorm{\boldsymbol{w}} \llnorm{\boldsymbol{w^*}}} \\
& \geq \frac{M \gamma}{\sqrt{M C^2} \times 1} \\
& = \sqrt{M} \frac{\gamma}{C}
\end{split}
\end{equation}
which means $M \leq \frac{C^2}{\gamma^2}$.
\end{question}
\begin{question}[Perceptron Loss]
\begin{equation}
\begin{split}
l(\boldsymbol{w}, \boldsymbol{x_t}, y_t)
& = -y_t \iprod{\boldsymbol{w}, \boldsymbol{x_t}} \mathbb{I}[\text{mistake on }\boldsymbol{x_t}] \\
& = -\min \left\{ y_t \iprod{\boldsymbol{w}, \boldsymbol{x_i}}, 0 \right\}
\end{split}
\end{equation}
\begin{equation}
L(\boldsymbol{w}) = -\frac{1}{n} \sum_{t=1}^n y_t \iprod{\boldsymbol{w}, \boldsymbol{x_t}} \mathbb{I} [\text{mistake on }\boldsymbol{x_t}]
\end{equation}
\end{question}
\section{Linear Regression}
\begin{question}[Least Square Regression]
\begin{equation}
\min_{f: \mathcal{X} \rightarrow \mathcal{Y}} \mathbb{E} \llnorm{f(\boldsymbol{X}) - Y}^2
\end{equation}
The optimal regression function is
\begin{equation}
f^*(\boldsymbol{x}) = m(x) = \mathbb{E} [Y | \boldsymbol{X} = \boldsymbol{x}]
\end{equation}
Calculating it needs to know the distribution, i.e., all pairs $(\boldsymbol{X}, Y)$.
\end{question}
\begin{question}[Bias-Variance Decomposition]
\begin{equation}
\begin{split}
\mathbb{E} \llnorm{f(\boldsymbol{X}) - Y}^2
& = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x) + m(x) - Y}^2 \\
& = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + 2 \mathbb{E} \iprod{f(\boldsymbol{X}) - m(x), m(x) - Y} \\
& = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + \mathbb{E} \; \mathbb{E}_{Y|\boldsymbol{X}} \left[ \iprod{f(\boldsymbol{X}) - m(x), m(x) - Y} \right] \\
& = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + \mathbb{E} \iprod{f(\boldsymbol{X}) - m(x), m(x) - \mathbb{E}_{Y|\boldsymbol{X}} [Y]} \\
& = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \mathbb{E} \llnorm{m(x) - Y}^2 + \mathbb{E} \iprod{f(\boldsymbol{X}) - m(x), m(x) - m(x)} \\
& = \mathbb{E} \llnorm{f(\boldsymbol{X}) - m(x)}^2 + \underbrace{\mathbb{E} \llnorm{m(x) - Y}^2}_{\text{noise (variance)}}
\end{split}
\end{equation}
The last term is the noise (variance), irrelevant to $f$.
So, to minimize the squared error, we need $f \approx m$.
However, $m(\boldsymbol{x})$ is incalculable, because $\mathbb{E} [Y | \boldsymbol{X} = \boldsymbol{x}]$ is unknown.
Let's learn $f_D$ from the training data $D$. Define $\bar{f}(\boldsymbol{X}) = \mathbb{E}_D[f_D(\boldsymbol{X})]$.
\begin{equation}
\begin{split}
\underbrace{\mathbb{E}_{\boldsymbol{X}, Y, D} \llnorm{f_D(\boldsymbol{X}) - Y}^2}_{\text{test error}}
& = \mathbb{E}_{\boldsymbol{X}} \llnorm{f_D(\boldsymbol{X}) - m(x)}^2 + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
& = \mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}) + \bar{f}(\boldsymbol{X}) - m(x)}^2 + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
& = \mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X})}^2 + \mathbb{E}_{\boldsymbol{X}} \llnorm{\bar{f}(\boldsymbol{X}) - m(x)}^2 \\
& \qquad + 2 \mathbb{E}_{\boldsymbol{X}, D} \iprod{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} \\
& \qquad + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
& = \dots + 2 \mathbb{E}_{\boldsymbol{X}} \mathbb{E}_{D} \iprod{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} + \underline{\dots} \\
& = \dots + 2 \mathbb{E}_{\boldsymbol{X}} \iprod{\mathbb{E}_{D} [f_D(\boldsymbol{X})] - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} + \underline{\dots} \\
& = \dots + 2 \mathbb{E}_{\boldsymbol{X}} \iprod{\bar{f}(\boldsymbol{X}) - \bar{f}(\boldsymbol{X}), \bar{f}(\boldsymbol{X}) - m(x)} + \underline{\dots} \\
& = \dots + 0 + \underline{\dots} \\
& = \mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \bar{f}(\boldsymbol{X})}^2 + \mathbb{E}_{\boldsymbol{X}} \llnorm{\bar{f}(\boldsymbol{X}) - m(x)}^2 + \underline{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2} \\
& = \underbrace{\mathbb{E}_{\boldsymbol{X}, D} \llnorm{f_D(\boldsymbol{X}) - \mathbb{E}_D [f_D(\boldsymbol{X})]}^2}_{\text{variance}} + \underbrace{\mathbb{E}_{\boldsymbol{X}} \llnorm{\mathbb{E}_D [f_D(\boldsymbol{X})] - m(x)}^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_{\boldsymbol{X}, Y} \llnorm{m(x) - Y}^2}_{\text{noise (variance)}} \\
\end{split}
\end{equation}
\end{question}
\begin{question}[Sampling $\rightarrow$ Training]
Replace expectation with sample average: $\paren{\boldsymbol{X_i}, Y_i} \tilde P$.
\begin{equation}
\min_{f: \mathcal{X} \rightarrow \mathcal{Y}} \hat{\mathbb{E}} \llnorm{f(\boldsymbol{X}) - Y}^2 \coloneq \frac{1}{n} \sum_{i=1}^n \llnorm{f(\boldsymbol{X_i}) - Y_i}^2
\end{equation}
Uniform law of large numbers: as training data size $n \rightarrow \argmin \mathbb{E}$,
$\hat{\mathbb{E}} \rightarrow \mathbb{E}$ and (hopefully) $\argmin \hat{\mathbb{E}} \rightarrow \mathbb{E}$.
\end{question}
\begin{question}[Linear Regression]
Padding: $\boldsymbol{x} \leftarrow \begin{pmatrix}\boldsymbol{x} \\ 1\end{pmatrix}$, $W \leftarrow \begin{pmatrix}W \\ \boldsymbol{b}\end{pmatrix}$
\begin{align*}
X & = [\boldsymbol{x_1}, \dots, \boldsymbol{x_n}] \in \mathbb{R}^{(d+1) \times n}, \\
Y & = [\boldsymbol{y_1}, \dots, \boldsymbol{y_n}] \in \mathbb{R}^{t \times n}, \\
W & \in \mathbb{R}^{t \times (d+1)}, \\
\fnorm{A} & = \sqrt{\sum_{ij} a_{ij}^2}
\end{align*}
Linear regression is:
\begin{equation}
\min_{W \in \mathbb{R}^{t \times (d+1)}} \frac{1}{n} \fnorm{WX - Y}^2
\end{equation}
\end{question}
\begin{question}[Optimality Condition]
If $\boldsymbol{w}$ is a minimizer (or maximizer) of a differentiable function $f$
\textbf{over an open set}, then $f'(\boldsymbol{w}) = 0$.
\end{question}
\begin{question}[Solving Linear Regression]
\begin{equation}
L(W) = \frac{1}{n} \fnorm{WX-Y}^2
\end{equation}
\begin{equation}
\begin{split}
\nabla_W L(W) & = \frac{2}{n} (WX-Y) X^T = 0 \\
& \Rightarrow W X X^T = Y X^T \\
& \Rightarrow W = Y X^T (X X^T)^{-1} \\
\end{split}
\end{equation}
\end{question}
\begin{question}[Ill-Conditioning]
Slight pertubation leads to chaotic behavior, which happens whenever
$X$ is ill-conditioned, i.e., (close to) rank-deficient.
\vspace{5mm}
Rank-deficient X means:
\begin{enumerate}
\item two columns in $X$ are linearly dependent (or simply the same)
\item but the corresponding $y$ might be different
\end{enumerate}
\end{question}
\begin{question}[Ridge Regression]
\begin{equation}
\min_{W} \frac{1}{n} \fnorm{WX - Y}^2 + \lambda \fnorm{W}^2
\end{equation}
\begin{equation}
\begin{split}
\nabla_W L(W) & = \frac{2}{n} (WX - Y) X^T + 2 \lambda W = 0 \\
& \Rightarrow W X X^T - Y X^T + \lambda W = 0 \\
& \Rightarrow W (X X^T + n \lambda I) = Y X^T
\end{split}
\end{equation}
\begin{equation}
\begin{split}
& X = U \Sigma V^T \\
\Rightarrow & X X^T = U \Sigma (V^T V) \Sigma U^T = U \Sigma^2 U^T \\
\Rightarrow & X X^T + n \lambda I = U \underbrace{(\Sigma^2 + n \lambda I)}_{\text{strictly positive}} U^T \\
\Rightarrow & X X^T + n \lambda I \text{ is of full-rank}
\end{split}
\end{equation}
$\lambda$ is regularization parameter. $\lambda = \infty \Rightarrow W \equiv \boldsymbol{0}$.
\end{question}
\begin{question}[Regularization $\equiv$ Data Augmentation]
\begin{equation}
\frac{1}{n} \fnorm{WX - Y}^2 + \lambda \fnorm{W}^2
= \frac{1}{n} \fnorm{W \left[X \quad \sqrt{n\lambda} I \right] - \left[Y \quad \boldsymbol{0}\right]}^2
\end{equation}
\end{question}
\section{Logistic Regression}
\begin{question}[Max Likelihood Estimation]
Let $\mathcal{Y} = \{0, 1\}$.
Learn confidence $p(\boldsymbol{x}; \boldsymbol{w}) \coloneq \Pr (Y=1|X=\boldsymbol{x})$.
\begin{equation}
\begin{split}
\max_{\boldsymbol{w}} \Pr (Y_1 = y_1, \dots, Y_n = y_n)
& = \max_{\boldsymbol{w}} \prod_{i=1}^n \Pr (Y_i = y_i | X_i = x_i) \\
& \stackrel{\mathcal{Y}=\{0, 1\}}{=} \max_{\boldsymbol{w}} \prod_{i=1}^n \left[p(\boldsymbol{x_i}; \boldsymbol{w})\right]^{y_i} \left[1 - p(\boldsymbol{x_i}; \boldsymbol{w})\right]^{1 - y_i} \\
\end{split}
\end{equation}
Use negative log-likelihood:
\begin{equation}
\min_{\boldsymbol{w}} \sum_{i=1}^n \left[ -y_i \log p(\boldsymbol{x_i}; \boldsymbol{w}) - (1 - y_i) \log (1 - p(\boldsymbol{x_i}; \boldsymbol{w})) \right]
\end{equation}
\end{question}
\begin{question}[Odds Ratio and Sigmoid]
\begin{equation}
\text{Odds Ratio} = \frac{\Pr}{1 - \Pr}
\end{equation}
Assume $\log \frac{p(\boldsymbol{x};\boldsymbol{w})}{1 - p(\boldsymbol{x};\boldsymbol{w})} = \iprod{\boldsymbol{x}, \boldsymbol{w}}$.
\vspace{5mm}
The Sigmoid transformation is:
\begin{equation}
p(\boldsymbol{x}; \boldsymbol{w}) = \frac{1}{1 + \exp (-\iprod{\boldsymbol{x}, \boldsymbol{w}})}
\end{equation}
\end{question}
\begin{question}[Logistic Regression]
Plug the sigmoid in the negative log-likelihood:
\begin{equation}
\begin{split}
& \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ -y_i \log p(\boldsymbol{x_i}; \boldsymbol{w}) - (1 - y_i) \log (1 - p(\boldsymbol{x_i}; \boldsymbol{w})) \right] \\
= & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ y_i \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] - (1 - y_i) \log \left(1 - \frac{1}{1 + \exp (-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})}\right) \right] \\
= & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ y_i \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] - (1 - y_i) \log \left(\frac{\exp (-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})}{1 + \exp (-\iprod{\boldsymbol{x}, \boldsymbol{w}})}\right) \right] \\
= & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ y_i \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + (1 - y_i) \left[\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + \log(1 + \exp (-\iprod{\boldsymbol{x}, \boldsymbol{w}}))\right] \right] \\
= & \min_{\boldsymbol{w}} \sum_{i=1}^n \left[ \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} \right] \\
\end{split}
\end{equation}
Because $y_i \in \{0, 1\}$, let's map it to $\{\pm 1\}$.
\begin{equation}
\begin{split}
L(\boldsymbol{w})
& \stackrel{y_i \in \{0, 1\}}{=} \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} \\
& \stackrel{y_i \in \{0, 1\}}{=} \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] + \log \left[ \exp ( (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} ) \right] \\
& \stackrel{y_i \in \{0, 1\}}{=} \log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}}) \cdot \exp ( (1 - y_i) \iprod{\boldsymbol{x_i}, \boldsymbol{w}} ) ] \\
& \stackrel{y_i \in \{0, 1\}}{=} \log [\exp((1-y_i)\iprod{\boldsymbol{x_i}, \boldsymbol{w}}) + \exp(-y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w}})] \\
& \stackrel{y_i \in \{0, 1\}}{=} \begin{cases}
\log [\exp(\iprod{\boldsymbol{x_i}, \boldsymbol{w}}) + 1] & y_i = 0 \\
\log [1 + \exp(-\iprod{\boldsymbol{x_i}, \boldsymbol{w}})] & y_i = 1
\end{cases} \\
& \stackrel{y_i \in \{\pm 1\}}{=} \log [ 1 + \exp(-y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w}}) ]
\end{split}
\end{equation}
\end{question}
\begin{question}[Multi-Class: Sigmoid $\rightarrow$ Softmax]
\begin{equation}
\Pr(Y=k | X = \boldsymbol{x}; \boldsymbol{W} = [\boldsymbol{w_1}, \dots, \boldsymbol{w_c}])
= \frac{\exp(\iprod{\boldsymbol{x}, \boldsymbol{w_k}})}{\sum_{l=1}^c \exp(\iprod{\boldsymbol{x}, \boldsymbol{w_l}})}
\end{equation}
Maximum likelihood estimation (log loss, cross-entropy loss):
\begin{equation}
\min_{\boldsymbol{W}} \sum_{i=1}^n \left[ -\log \frac{\exp(\iprod{\boldsymbol{x}, \boldsymbol{w_k}})}{\sum_{l=1}^c \exp(\iprod{\boldsymbol{x}, \boldsymbol{w_l}})} \right]
\end{equation}
\end{question}
\section{Hard-Margin Support Vector Machines}
\begin{question}[Distance from a Point to a Hyperplane]
Let $H \coloneq \{\boldsymbol{x}: \iprod{\boldsymbol{x}, \boldsymbol{w}} + b = 0\}$, $\boldsymbol{x}$ be any vector in $H$.
\begin{equation}
\text{Distance}(\boldsymbol{x_i}, \boldsymbol{w})
= \frac{|\iprod{\boldsymbol{x_i} - \boldsymbol{x}, \boldsymbol{w}}|}{\llnorm{\boldsymbol{w}}}
= \frac{|\iprod{\boldsymbol{x_i}, \boldsymbol{w}} - \iprod{\boldsymbol{x}, \boldsymbol{w}}|}{\llnorm{\boldsymbol{w}}}
\stackrel{\boldsymbol{x} \in H}{=} \frac{|\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b|}{\llnorm{\boldsymbol{w}}}
\stackrel{y_i \hat{y_i} > 0}{=} \frac{y_i \hat{y_i}}{\llnorm{\boldsymbol{w}}}
\end{equation}
\end{question}
\begin{question}[Margin Maximization]
Margin is the smallest distance to $H$ among all separable data.
\begin{equation}
\max_{\boldsymbol{w}, b} \min_i \frac{y_i \hat{y_i}}{\llnorm{\boldsymbol{w}}}
\text{, such that }
\forall i, y_i \hat{y_i} > 0
\end{equation}
Let $c > 0$, then $\boldsymbol{w} = c\boldsymbol{w}, b = cb$ keeps the loss same:
\begin{equation}
\begin{split}
\max_{\boldsymbol{w}, b} \min_i \frac{c y_i \hat{y_i}}{\llnorm{c\boldsymbol{w}}}
& = \max_{\boldsymbol{w}, b} \min_i \frac{y_i (\iprod{\boldsymbol{x}, c\boldsymbol{w}} + cb)}{\llnorm{c\boldsymbol{w}}} \\
& = \max_{\boldsymbol{w}, b} \min_i \frac{c y_i (\iprod{\boldsymbol{x}, \boldsymbol{w}} + b)}{c \llnorm{\boldsymbol{w}}} \\
& = \max_{\boldsymbol{w}, b} \min_i \frac{y_i (\iprod{\boldsymbol{x}, \boldsymbol{w}} + b)}{\llnorm{\boldsymbol{w}}} \\
& = \max_{\boldsymbol{w}, b} \min_i \frac{y_i \hat{y_i}}{\llnorm{\boldsymbol{w}}}
\end{split}
\end{equation}
Let $c = \frac{1}{\min_i y_i \hat{y_i}}$,
\begin{equation}
\begin{split}
\max_{\boldsymbol{w}, b} \min_i \frac{c y_i \hat{y_i}}{c \llnorm{\boldsymbol{w}}}
& = \max_{\boldsymbol{w}, b} \frac{1}{c \llnorm{\boldsymbol{w}}} \\
& = \max_{\boldsymbol{w}, b} \frac{1}{\llnorm{\boldsymbol{w}}} \text{ s.t. } \min_i y_i \hat{y_i} = 1
\end{split}
\end{equation}
Max $\rightarrow$ Min:
\begin{equation}
\min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1
\end{equation}
\end{question}
\begin{question}[Hard-Margin SVM v.s. Perceptron]
\begin{align}
\text{Hard-Margin SVM:} \qquad
& \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 & \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1 \\
\text{Perceptron:} \qquad
& \min_{\boldsymbol{w}, b} 0 & \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1
\end{align}
\end{question}
\begin{question}[Lagrangian Dual]
Dual variables $\boldsymbol{\alpha} \in \mathbb{R}^n$.
\begin{equation}
\begin{split}
\min_{\boldsymbol{w}, b} \; \max_{\boldsymbol{\alpha} \geq 0} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 - \sum_i \alpha_i \left[ y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) - 1 \right]
& = \min_{\boldsymbol{w}, b} \; \begin{cases}
+ \infty, & \text{if } \exists i, y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) < 1 \quad (\alpha_i = + \infty) \\
\frac{1}{2} \llnorm{\boldsymbol{w}}^2 , & \text{if } \forall i, y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) \geq 1 \quad (\forall i, \alpha_i = 0) \\
\end{cases} \\
& = \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2, \quad \text{s.t. } \forall i, y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) \geq 1 \\
& = \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 \text{ s.t. } \forall i, y_i \hat{y_i} \geq 1
\end{split}
\end{equation}
Swap $\min$ and $\max$:
\begin{equation} \label{lagrangian-max-min}
\max_{\boldsymbol{\alpha} \geq 0} \; \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 - \sum_i \alpha_i \left[ y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) - 1 \right]
\end{equation}
Solve inner problem by setting derivative to 0:
\begin{equation} \label{solve-derivative-for-svm-dual}
\frac{\delta}{\delta \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \boldsymbol{x_i} = 0,
\qquad
\frac{\delta}{\delta b} = - \sum_i \alpha_i y_i = 0,
\end{equation}
Plug them into the loss:
\begin{equation}
\begin{split}
L(\boldsymbol{\alpha})
& = \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 - \sum_i \alpha_i \left[ y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) - 1 \right] \\
& = \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 - \sum_i \alpha_i y_i \iprod{\boldsymbol{x_i}, \boldsymbol{w}} - \sum_i \alpha_i y_i b + \sum_i \alpha_i \\
& = \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 - \iprod{\sum_i \alpha_i y_i \boldsymbol{x_i}, \sum_i \alpha_i y_i \boldsymbol{x_i}} - b \sum_i \alpha_i y_i + \sum_i \alpha_i \\
& = \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 - \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2 + \sum_i \alpha_i \\
& = \sum_i \alpha_i - \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2,
\qquad \text{s.t. } \sum_i \alpha_i y_i = 0
\end{split}
\end{equation}
So, \ref{lagrangian-max-min} is solved as:
\begin{equation}
\max_{\boldsymbol{\alpha} \geq 0} \sum_i \alpha_i - \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2,
\qquad \text{s.t. } \sum_i \alpha_i y_i = 0
\end{equation}
Max $\rightarrow$ min and expand the norm:
\begin{equation}
\min_{\boldsymbol{\alpha} \geq 0} - \sum_i \alpha_i + \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \underbrace{\iprod{\boldsymbol{x_i}, \boldsymbol{x_j}}}_{\text{Kernel, closed form w.r.t. }\boldsymbol{x_i}, \boldsymbol{x_j}},
\qquad \text{s.t. } \sum_i \alpha_i y_i = 0
\end{equation}
\end{question}
\begin{question}[Support Vectors]
From \ref{solve-derivative-for-svm-dual}, we know $\boldsymbol{w} = \sum_i \alpha_i y_i \boldsymbol{x_i}$.
Vectors with $\alpha_i \neq 0$ are support vectors, which lie on the margin.
\end{question}
\section{Soft-Margin Support Vector Machines}
\begin{question}[Goal]
minimize over $\boldsymbol{w}, b$,
\begin{equation}
\Pr(Y \neq \sign(\hat Y)) = \Pr(Y \hat Y \leq 0)
= \mathbb{E} \; \underbrace{\mathbb{I} [Y \hat Y \leq 0]}_{indicator function}
\coloneq \mathbb{E} \; l_{0-1}(Y \hat Y)
\end{equation}
where $\hat Y = \iprod{X, \boldsymbol{w}} + b, Y = \pm 1$.
\begin{equation}
\begin{split}
\min_{\hat Y: \mathcal{X} \rightarrow \mathbb{R}} \mathbb{E} \; l_{0-1}(Y \hat Y)
& = \min_{\hat Y: \mathcal{X} \rightarrow \mathbb{R}} \mathbb{E}_X \mathbb{E}_{Y|X} \; l_{0-1}(Y \hat Y) \\
& = \mathbb{E}_X \min_{\hat Y: \mathcal{X} \rightarrow \mathbb{R}} \mathbb{E}_{Y|X} \; l_{0-1}(Y \hat Y) \\
\end{split}
\end{equation}
Minimizing the 0-1 error is \textbf{NP-hard}.
\end{question}
\begin{question}[Bayes Rule]
\begin{equation}
\eta(\boldsymbol{x}) \coloneq \argmax_{\hat y \in \mathbb{R}} \Pr (Y = \hat y | X = \boldsymbol{x})
\end{equation}
\begin{equation}
\eta(\boldsymbol{x}) \coloneq \argmin_{\hat y \in \mathbb{R}} \mathbb{E}_{Y|X=\boldsymbol{x}} \; l_{0-1}(Y \hat y)
\end{equation}
\end{question}
\begin{question}[Classification Calibrated]
A loss $l(y \hat y)$ is classification calibrated, iff $\forall \boldsymbol{x}$,
\begin{equation*}
\hat y(\boldsymbol{x}) \coloneq \argmin_{\hat y \in \mathbb{R}} \mathbb{E}_{Y | X = \boldsymbol{x}} \; l(Y \hat y)
\end{equation*}
has the same sign as the Bayes rule $\eta(\boldsymbol{x}) \coloneq \argmin_{\hat y \in \mathbb{R}} \mathbb{E}_{Y|X=\boldsymbol{x}} \; l_{0-1}(Y \hat y)$
Notice: $\eta(\boldsymbol{x}), \hat y (\boldsymbol{x})$ provide the \textbf{score}, their sign provides the prediction.
\end{question}
\begin{question}[Characterization under Convexity]
Any \textbf{convex} loss $l$ is classification calibrated iff $l$ is differentiable
at $0$ and $l'(0) < 0$.
\end{question}
\begin{question}[Hinge Loss]
\begin{equation}
l_{hinge}(y \hat y) = (1 - y \hat y)^+ \coloneq \max \{0, 1 - y \hat y\}
= \begin{cases}
1 - y \hat y, & \text{if } y \hat y < 1 \\
0, & \text{otherwise}
\end{cases}
\end{equation}
The classifier that minimizes the expected hinge loss minimizes the expected 0-1 loss.
\end{question}
\begin{question}[Soft-Margin SVM]
\begin{equation}
\min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + C \cdot \sum_i l_{hinge}(y_i \hat y_i), \quad \text{s.t. } \hat y_i = \iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b
\end{equation}
\end{question}
\begin{question}[Lagrangian Dual]
Apply $C \cdot l_{hinge}(t) \coloneq \max\{0, C(1 - t)\} = \max_{0 \leq \alpha \leq C} \alpha (1 - t)$
\begin{equation}
\min_{\boldsymbol{w}, b} \max_{0 \leq \boldsymbol{\alpha} \leq C} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i (1 - y_i \hat y_i)
\end{equation}
Swap $\min$ with $\max$:
\begin{equation}
\max_{0 \leq \boldsymbol{\alpha} \leq C} \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i (1 - y_i \hat y_i)
\end{equation}
Solve it by setting derivative to 0:
\begin{equation}
\frac{\delta}{\delta \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \boldsymbol{x_i} = 0,
\qquad
\frac{\delta}{\delta b} = - \sum_i \alpha_i y_i = 0,
\end{equation}
Plug them into the loss:
\begin{equation}
\begin{split}
\max_{0 \leq \boldsymbol{\alpha} \leq C} \min_{\boldsymbol{w}, b} \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i (1 - y_i \hat y_i)
& = \max_{0 \leq \boldsymbol{\alpha} \leq C} \min_{\boldsymbol{w}, b} \; \frac{1}{2} \llnorm{\boldsymbol{w}}^2 + \sum_i \alpha_i \left[ 1 - y_i (\iprod{\boldsymbol{x_i}, \boldsymbol{w}} + b) \right] \\
& = \max_{0 \leq \boldsymbol{\alpha} \leq C} \sum_i \alpha_i - \frac{1}{2} \llnorm{\sum_i \alpha_i y_i \boldsymbol{x_i}}^2,
\qquad \text{s.t. } \sum_i \alpha_i y_i = 0
\end{split}
\end{equation}
Max $\rightarrow$ min and expand the norm:
\begin{equation}
\min_{0 \leq \boldsymbol{\alpha} \leq C} - \sum_i \alpha_i + \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \underbrace{\iprod{\boldsymbol{x_i}, \boldsymbol{x_j}}}_{\text{Kernel, closed form w.r.t. }\boldsymbol{x_i}, \boldsymbol{x_j}},
\qquad \text{s.t. } \sum_i \alpha_i y_i = 0
\end{equation}
$C \rightarrow \infty \Rightarrow \text{Hard-margin SVM}$,
$C \rightarrow 0 \Rightarrow \text{a constant classifier}$
\end{question}
\section{Reproducing Kernels}
\begin{question}[(Reproducing) Kernels]
$k: \mathcal(X) \times \mathcal{X} \rightarrow \mathbb{R}$ is a (reproducing) kernel
iff there exists some $\Phi: \mathcal{X} \rightarrow \mathcal{H}$ so that
$\iprod{\Phi(\boldsymbol{x}), \Phi(\boldsymbol{z})} = k(\boldsymbol{x}, \boldsymbol{z})$.
\begin{itemize}
\item A feature transform $\Phi$ determines the corresponding kernel $k$.
\item A kernel $k$ determines some feature transforms $\Phi$, but may not be unique. \\
E.g. $\iprod{\phi(\boldsymbol{x}), \phi(\boldsymbol{z})} = \iprod{\phi'(\boldsymbol{x}), \phi'(\boldsymbol{z})}$
\begin{enumerate}
\item $\phi(x) \coloneq [x_1^2, \sqrt{2} x_1 x_2] \in \mathbb{R}^2$
\item $\phi'(x) \coloneq [x_1^2, x_1 x_2, x_1 x_2] \in \mathbb{R}^3$
\end{enumerate}
\end{itemize}
\end{question}
\begin{question}[Mercer's Theorem]
$k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a kernel, \\
iff $\forall n \in \mathbb{N}, \forall \boldsymbol{x_1}, \dots, \boldsymbol{x_n} \in \mathcal{X}$,
the kernel matrix $K$ such that $K_{ij} \coloneq k(\boldsymbol{x_i}, \boldsymbol{x_j})$ is symmetric and positive semi-definite (PSD).
\begin{equation}
k \text{ is a kernel} \Leftrightarrow \begin{cases}
K_{ij} = K_{ji} & \text{(symmetric)}\\
\iprod{\boldsymbol{\alpha}, K \boldsymbol{\alpha}} = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j K_{ij} \geq 0 \qquad \forall \boldsymbol{\alpha} \in \mathbb{R}^n & \text{(PSD)}
\end{cases}
\end{equation}
\end{question}
\begin{question}[Symmetric PSD]
For a symmetric matrix $A$, the following conditions are equivalent.
\begin{enumerate}
\item $A \succeq 0$
\item $A = U^T U$ for some matrix $U$
\item $x^T A x \geq 0$ for every $x \in \mathbb{R}^n$
\item All principal minors of $A$ are nonnegative
\end{enumerate}
\end{question}
\section{Gradient Descent}
\begin{question}[Gradient Descent Template]
Choose initial point $x^{(0)} \in \mathbb{R}^d$ and repeat:
\begin{equation}
x^{(k)} = x^{(k-1)} - \underbrace{\eta}_{\text{step size}} \nabla f(x^{(k-1)}), \quad k = 1, 2, \dots
\end{equation}
\end{question}
\begin{question}[Interpretation from Taylor Expansion]
Expand $f$ locally at $x$:
\begin{equation}
\begin{split}
f(y) & \approx f(x) + \nabla f(x)^T(y-x) + \frac{1}{2t}\llnorm{y-x}^2 \\
\Rightarrow \min_y f(y) & \approx \min_y \left[ f(x) + \nabla f(x)^T(y-x) + \frac{1}{2t}\llnorm{y-x}^2 \right] \\
\end{split}
\end{equation}
When $y - x = \frac{\nabla f(x)}{-2 \frac{1}{2t}} = -t \nabla f(x) \Longrightarrow y = x - t \nabla f(x)$, it reaches the minimum.
\end{question}
\begin{question}[$L$-smooth or $L$-Lipschitz Continuous]
$f$ is convex and differentiable. $\nabla f$ is $L$-Lipschitz continuous ($L$-smooth):
\begin{equation}
L \boldsymbol{I} \succeq \nabla^2 f(x), \forall x
\end{equation}
\end{question}
\begin{question}[Convergence Rate for Convex Case]
Assume $f$ is $L$-smooth.
Gradient descent with fixed step size $t \leq \frac{1}{L}$ satisfies:
\begin{equation}
f(x^{(k)}) - f(x^*) \leq \frac{\llnorm{x^{(0)} - x^*}^2}{2tk}
\end{equation}
We say gradient descent has convergence rate $O(\frac{1}{k})$,
i.e. $f(x^{(k)}) - f(x^*) \leq \epsilon$ can be achieved using only $O(\frac{1}{\epsilon})$ iterations.
\vspace{5mm}
\textbf{Proof}
\begin{equation}
\begin{split}
f(y)
& = f(x) + \nabla f(x)^T (y-x) + \frac{1}{2} (y-x)^T \nabla^2 f(\xi)(y-x) \\
& \leq f(x) + \nabla f(x)^T(y-x) + \frac{1}{2} L \llnorm{y-x}^2 \qquad (L\text{-smooth}, L \boldsymbol{I} \succeq \nabla^2 f(\xi))
\end{split}
\end{equation}
Plug in gradient descent:
\begin{equation} \label{convergence-1}
\begin{split}
f(x^+)
& = f(y) \\
& \leq f(x) + \nabla f(x)^T(x - t\nabla f(x)-x) + \frac{1}{2} L \llnorm{x - t\nabla f(x)-x}^2 \\
& = f(x) -(1 - \frac{1}{2}Lt) t \llnorm{\nabla f(x)}^2 \\
& \leq f(x) - \frac{1}{2} t \llnorm{\nabla f(x)}^2 \quad (t \leq \frac{1}{L})
\end{split}
\end{equation}
$f$ is convex $\Rightarrow f(x^*) \geq f(x) + \nabla f(X)^T(x^* - x) \Rightarrow f(x) \leq f(x^*) + \nabla f(x)^T(x - x^*)$
Plug this into \ref{convergence-1}:
\begin{equation}
\begin{split}
& f(x^+) \leq f(x^*) + \nabla f(x)^T (x - x^*) - \frac{1}{2} t \llnorm{\nabla f(x)}^2 \\
\Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( 2t \nabla f(x)^T(x - x^*) - t^2 \llnorm{\nabla f(x)}^2 \right) \\
\Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( 2t \nabla f(x)^T(x - x^*) - t^2 \llnorm{\nabla f(x)}^2 - \llnorm{x-x^*}^2 + \llnorm{x-x^*}^2 \right) \\
\Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( \llnorm{x-x^*}^2 - (\llnorm{x-x^*}^2 + t^2 \llnorm{\nabla f(x)}^2 - 2t \nabla f(x)^T(x - x^*) ) \right) \\
\Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( \llnorm{x-x^*}^2 - \llnorm{x - x^* - t\nabla f(x)}^2 \right) \\
\Rightarrow & f(x^+) - f(x^*) \leq \frac{1}{2t}\left( \llnorm{x-x^*}^2 - \llnorm{x^+ - x^*}^2 \right) \\
\end{split}
\end{equation}
Viewing $x^+$ as $x^{(i)}$ and $x$ as $x^{(x-1)}$:
\begin{equation}
\begin{split}
\sum_{i=1}^k \paren{ f(x^{(i)}) - f(x^*) }
& \leq \sum_{i=1}^k \frac{1}{2t} \paren{ \llnorm{x^{(i-1)} - x^*}^2 - \llnorm{x^{(i)} - x^*}^2 } \\
& = \frac{1}{2t} \paren{ \llnorm{x^{(0)} - x^*}^2 - \llnorm{x^{(k)} - x^*}^2 } \\
& \leq \frac{1}{2t} \llnorm{x^{(0)} - x^*}^2 \\
\end{split}
\end{equation}
which implies
\begin{equation}
f(x^{(k)}) \leq \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) \leq f(x^*) + \frac{\llnorm{x^{(0)} - x^*}^2}{2tk}
\end{equation}
\end{question}
\begin{question}[Convergence Rate for Strong Convexity]
$f$ is differentiable, $L$-smooth, and $m$-strongly convex.
$m$-strong convexity of $f$ means $f(x) - \frac{m}{2} \llnorm{x}^2$ is convex, i.e. $\nabla^2 f(x) \succeq m \boldsymbol{I}$
Then, there is a constant $0 < \gamma < 1$ such that gradient descent with fixed step size $t \leq \frac{2}{m + L}$ satisfies:
\begin{equation}
f(x^{(k)}) - f(x^*) \leq \gamma^k \frac{L}{2} \llnorm{x^{(0)} - x^*}^2
\end{equation}
Rate is $O(\gamma^k)$. Only $O(\log_{\frac{1}{\gamma}}(\frac{1}{\epsilon}))$ iterations needed.
\end{question}
\begin{question}[Convergence Rate for Non-Convex Case]
$f$ is differentiable and $L$-smooth, but non-convex.
Gradient descent with fixed step size $t \leq \frac{1}{L}$ satisifes:
\begin{equation}
\min_{i=0,\dots,k} \llnorm{\nabla f(x^{(i)})} \leq \sqrt(\frac{2 (f(x^{(0)} - f^*))}{t(k+1)})
\end{equation}
Rate is $O(\frac{1}{\sqrt{k}})$ for finding stationary point.
$O(\frac{1}{\epsilon^2})$ iterations are needed.
\end{question}
\begin{question}[Convergence Rate for Stochastic Gradient Descent]
For convex and $L$-smooth $f$,
\begin{itemize}
\item Gradient Descent
\begin{equation}
\boldsymbol{w^+} = \boldsymbol{w} - t \cdot \frac{1}{n} \sum_{i=1}^n \nabla f_i(\boldsymbol{w})
\end{equation}
\begin{itemize}
\item Step size $t \leq \frac{1}{L}$
\item Time complexity $O(\frac{n}{\epsilon})$
\end{itemize}
\item Stochastic Gradient Descent
\begin{equation}
\boldsymbol{w^+} = \boldsymbol{w} - t \cdot \nabla f_{I_{random}}(\boldsymbol{w})
\end{equation}
\begin{itemize}
\item Step size $t = \frac{1}{k}, k = 1, 2, 3, \dots$ (adaptive step size)
\item Time complexity $O(\frac{1}{\epsilon^2})$
\end{itemize}
\end{itemize}
\end{question}
\section{Fully-Connected Neural Networks}
\begin{question}[Forward and Backward Pass of a 2-Layer MLP]
A 2-layer MLP ($k$ is the NN width, $c$ is the output dim):
\begin{align}
\boldsymbol{x} & = \text{input} & (\boldsymbol{x} \in \mathbb{R}^d) \\
\boldsymbol{z} & = \boldsymbol{W} \boldsymbol{x} + \boldsymbol{b_1} & (\boldsymbol{W} \in \mathbb{R}^{k \times d}, \boldsymbol{z}, \boldsymbol{b} \in \mathbb{R}^k) \\
\boldsymbol{h} & = \text{ReLU}(\boldsymbol{z}) & (\boldsymbol{h} \in \mathbb{R}^k) \\
\boldsymbol{\theta} & = \boldsymbol{U} \boldsymbol{h} + \boldsymbol{b_2} & (\boldsymbol{U} \in \mathbb{R}^{c \times k}, \boldsymbol{\theta}, \boldsymbol{b_2} \in \mathbb{R}^c) \\
J & = \frac{1}{2} \llnorm{\boldsymbol{\theta} - \boldsymbol{y}}^2 & (\boldsymbol{y} \in \mathbb{R}^c, J \in \mathbb{R})
\end{align}
\begin{align}
\text{ReLU} = \begin{cases}
x & x > 0 \\
0 & x \leq 0
\end{cases} \\
\text{ReLU}' = \begin{cases}
1 & x > 0 \\
0 & x \leq 0
\end{cases}
\end{align}
Backward pass ($\odot$ is the Hadamard product, i.e. element-wise product):
\begin{align}
\frac{\delta J}{\delta \boldsymbol{\theta}} & = \boldsymbol{\theta} - \boldsymbol{y} \\
\frac{\delta J}{\delta \boldsymbol{U}} & = \frac{\delta J}{\delta \boldsymbol{\theta}} \circ \frac{\delta \boldsymbol{\theta}}{\delta \boldsymbol{U}} = (\boldsymbol{\theta} - \boldsymbol{y}) \boldsymbol{h}^T \\
\frac{\delta J}{\delta \boldsymbol{b_2}} & = \frac{\delta J}{\delta \boldsymbol{\theta}} \circ \frac{\delta \boldsymbol{\theta}}{\delta \boldsymbol{b_2}} = \boldsymbol{\theta} - \boldsymbol{y} \\
\frac{\delta J}{\delta \boldsymbol{h}} & = \frac{\delta J}{\delta \boldsymbol{\theta}} \circ \frac{\delta \boldsymbol{\theta}}{\delta \boldsymbol{h}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \\
\frac{\delta J}{\delta \boldsymbol{z}} & = \frac{\delta J}{\delta \boldsymbol{h}} \circ \frac{\delta \boldsymbol{h}}{\delta \boldsymbol{z}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \odot \text{ReLU}'(\boldsymbol{z}) \\
\frac{\delta J}{\delta \boldsymbol{W}} & = \frac{\delta J}{\delta \boldsymbol{z}} \circ \frac{\delta \boldsymbol{z}}{\delta \boldsymbol{W}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \odot \text{ReLU}'(\boldsymbol{z}) \boldsymbol{x}^T \\
\frac{\delta J}{\delta \boldsymbol{b_1}} & = \frac{\delta J}{\delta \boldsymbol{z}} \circ \frac{\delta \boldsymbol{z}}{\delta \boldsymbol{b_1}} = \boldsymbol{U^T}(\boldsymbol{\theta} - \boldsymbol{y}) \odot \text{ReLU}'(\boldsymbol{z}) \\
\end{align}
\end{question}
\begin{question}[Universal Approximation Theorem by 2-Layer NNs]
For any continuous function $f: \mathbb{R}^d \rightarrow \mathbb{R}^c$
and any $\epsilon > 0$,
there exists $k \in \mathbb{N}, \boldsymbol{W} \in \mathbb{R}^{k \times d}, \boldsymbol{b} \in \mathbb{R}^k, \boldsymbol{U} \in \mathbb{R}^{c \times k}$
such that
\begin{equation}
\sup_{\boldsymbol{x}} \llnorm{f(\boldsymbol{x}) - g(\boldsymbol{x})} < \epsilon
\end{equation}
where $g(\boldsymbol{x}) = \boldsymbol{U}(\sigma(\boldsymbol{W}\boldsymbol{x} + \boldsymbol{b}))$
and $\sigma$ is the element-wise ReLU operation.
As long as the 2-layer MLP is wide enough, it can approximate any
continuous function arbitrarily closely.
\end{question}
\section{Convolutional Neural Networks}
\begin{question}[Controlling the Convolution]
Hyperparameters.
\begin{itemize}
\item \textbf{Filter (kernel) size}: width $times$ height.
\item \textbf{Number of filters (kernels)}. \\
Weights are not shared between different filters (kernels)
\item \textbf{Stride}: how many pixels the filter moves each time.
\item \textbf{Padding}: add zeros around the boundary of the input.
\end{itemize}
\end{question}
\begin{question}[Size Calculation]
\begin{align*}
\text{Input size: } & m \times n \times c_{in} \\
\text{Filter size: } & a \times b \times c_{in} \\
\text{Stride: } & s \times t \\
\text{Padding: } & p \times q
\end{align*}
Output size:
\begin{equation}
\floor*{1 + \frac{m + 2p - a}{s}} \times \floor*{1 + \frac{n + 2q - b}{t}}
\end{equation}
\end{question}
\part{For Final}
\section{Transformer}
\begin{question}[Attention Layer Inputs and Outputs]
Inputs: $V \in \mathcal{R}^{n \times d}$, $K \in \mathcal{R}^{n \times d}$, $Q \in \mathcal{R}^{m \times d}$,
Outputs: an $m \times d$ matrix.
\begin{itemize}
\item \textbf{Self Attention}: $m = n$,
\item \textbf{Cross Attention}: $m \neq n$ where $m$ is the sequence length of decoder, $n$ is the sequence length of encoder.
\end{itemize}
\end{question}
\begin{question}[Attention Layer Calculation]
\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V
\end{equation}
Softmax is row-wise, i.e. for each row of $QK^T$, it is normalized to sum to 1.
\end{question}
\begin{question}[Learnable Attention Layer]
\begin{equation}
\text{Attention}(XW^v, XW^k, XW^q) = \text{softmax}\left(\frac{XW^q(XW^k)^T}{\sqrt{d}}\right)XW^v
\end{equation}
\end{question}
\begin{question}[RMSNorm (LLaMA's Choice)]
\begin{equation}
\bar{a_i} = \frac{a_i}{\text{RMS}(a)}\gamma = \frac{a_i}{\sqrt{\frac{1}{d}\sum_{j=1}^d a_j^2}}\gamma
\end{equation}
\end{question}
\begin{question}[Transformer Loss]
\begin{equation}
\min_{W} \hat{\mathbb{E}}\left[ - \iprod{Y, \log \hat{Y}} \right]
\end{equation}
$Y$ is output sequence, one-hot; \\
$\hat{Y}$ is the predicted probabilities
\end{question}
\begin{question}[Transformer Implementation] As following.
\begin{lstlisting}[language=python]
import torch
import torch.nn as nn
import torch.F as F
import math
class RMSNorm(nn.Module):
def __init__(self, hidden_dim, eps = 1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(hidden_dim))
def forward(self, hidden_state):
norm = hidden_state.pow(2).mean(-1, keepdim = True)
output = hidden_state * self.weight * torch.rsqrt(norm + self.eps)
return output
class MultiHeadAttention(nn.Module):
def __init__(self, hidden_dim, num_heads):
super().__init__()
self.hidden_dim = hidden_dim
self.num_heads = num_heads
self.head_dim = hidden_dim // num_heads
self.q_linear = nn.linear(hidden_dim, hidden_dim)
self.k_linear = nn.linear(hidden_dim, hidden_dim)
self.v_linear = nn.linear(hidden_dim, hidden_dim)
self.o_linear = nn.linear(hidden_dim, hidden_dim)
self.norm = RMSNorm(hidden_dim)
def forward(self, hidden_state, mask, past_kv = None, use_cache = True):
bs = hidden_state.shape[0]
residual = hidden_state
hidden_state = self.norm(hidden_state) # LLAMA style normalization
q = self.q_linear(hidden_state) # (bs, seqlen, hidden_dim)
k = self.k_linear(hidden_state) # (bs, seqlen, hidden_dim)
v = self.v_linear(hidden_state) # (bs, seqlen, hidden_dim)
q = q.view(bs, -1, self.num_heads, self.head_dim).tranpose(1, 2)
k = k.view(bs, -1, self.num_heads, self.head_dim).tranpose(1, 2)
v = v.view(bs, -1, self.num_heads, self.head_dim).tranpose(1, 2)
# (bs, nums_head, seqlen, head_dim)
q, k = apply_rope(q, k)
# kv cache
if past_kv is not None:
past_k, past_v = past_kv
k = torch.cat([past_k, k], dim = 2)
v = torch.cat([past_v, v], dim = 2)
new_past_kv = (k, v) if use_cache else None
# compute attention
attention_scores = torch.matmul(q, k.tranpose(-1, -2)) / math.sqrt(self.head_dim)
attention_scores += mask * -1e9
attention_scores = F.softmax(attention_scores, dim = -1)
output = torch.matmul(attention_scores, v)
# concat
output = output.tranpose(1, 2).contiguous().view(bs, -1, self.hidden_dim)
# o_linear
output = self.o_linear(output)
output += residual
return output, new_past_kv if use_cache else output
\end{lstlisting}
\end{question}
\section{Large Language Models}
\begin{question}[BERT v.s. GPT]
BERT is encoder; GPT is decoder.
\begin{itemize}
\item BERT predicts middle words; GPT predicts the next word.
\item BERT is \textbf{NOT} auto-regressive; GPT is auto-regressive.
\end{itemize}
\end{question}
\begin{question}[GPT -- Generative Pre-Training]
\begin{equation}
\min_{\Theta} \hat{\mathbb{E}} - \log \prod_{j=1}^m \Pr(x_j \vert x_1, \dots, x_{j-1}; \Theta)
\end{equation}
\end{question}
\begin{question}[Fine-Tuning Tasks]
Supervised fine-tuning tasks:
\begin{equation}
\min_{\Theta} \underbrace{- \hat{\mathbb{E}} \log \Pr(y \vert X_{1:m}; \Theta)}_{\text{task-aware supervised loss}} \underbrace{- \lambda \hat{\mathbb{E}} \log \prod_{j=1}^m \Pr(x_j \vert X_{1:j-1}; \Theta)}_{\text{pretraining loss}}
\end{equation}
\end{question}
\begin{question}[BERT $\rightarrow$ RoBERTa]
Training longer, with bigger batches, over more data and longer sequence.
Removing the next sentence prediction objective.
\end{question}
\begin{question}[Sentence-BERT]
a twin network architecture that uses BERT to derive sentence embeddings.
\end{question}
\begin{question}[GPT-2]
1.5B parameters.
\begin{itemize}
\item 10x larger than GPT-1
\item Training method is same as GPT-1.
\item Performs on par with BERT on fine-tuning tasks.
\item Good at zero-shot learning.
\item Open-source.
\end{itemize}
\end{question}
\begin{question}[GPT-3]
175B parameters.
\begin{itemize}
\item 100x larger than GPT-2.
\item Training method is same as GPT-2.
\item New phenomenon: \textbf{in-context learning} (ICL, or few-shot learning) and \textbf{chain-of-thoughts} (CoT).
\end{itemize}
\end{question}
\begin{question}[GPT-3.5 -- RLHF]
Reinforcement Learning from Human Feedback (RLHF).
\begin{itemize}
\item state = prompt
\item action = model output
\item policy function = LLM
\item reward = levels of matching human feedback
\end{itemize}
Pari-wise comparison loss to train reward model $r_\theta$:
\begin{equation}
\mathcal{L}_{\text{pair}}(\theta) = - \frac{1}{\binom{K}{2}} \mathbb{E}_{(x, y_w, y_l)} \left[ \log ( \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) ) \right]
\end{equation}
Proximal Policy Optimization (PPO) to maximize objective:
\begin{equation}
\max_{\Phi} \mathbb{E}_{(x, y)} \left[ \underbrace{r_\theta(x, y)}_{\text{maximize reward}} - \beta \underbrace{\log\left( \frac{\pi_\Phi^{\text{RL}}(y \vert x)}{\pi^{\text{SFT}}(y \vert x)}\right) }_{\text{model is close to SFT model}} + \gamma \underbrace{\mathbb{E}[\log(\pi_\Phi^{\text{RL}}(x))]}_{\text{pretraining loss}} \right]
\end{equation}
\end{question}
\section{Speculative Sampling}
\begin{question}[Reject Sampling for Check]
Check in parallel.
\begin{itemize}
\item $r \sim U(0, 1), \quad \text{if } r < \underbrace{\min\left(1, \frac{p(t)}{q(t)}\right)}_{\text{accept rate}}$, next token $= t$.
\item else: next token $= t' \sim \underbrace{\text{norm}(\max(0, p-q))}_{\text{residual distribution}}$.
\end{itemize}
\end{question}
\begin{question}[Proof: Reject Sampling $\equiv t \sim p$]
\begin{align}
\begin{split}
\min(p(t), q(t)) + \max(0, p(t) - q(t))
&= \begin{cases}
p(t) + 0 & \text{if } p(t) < q(t) \\
q(t) + p(t) - q(t) & \text{if } p(t) \geq q(t)
\end{cases} \\
&= p(t)
\end{split} \\
\Rightarrow & \sum_t (\min(p(t), q(t)) + \max(0, p(t) - q(t))) = \sum_t p(t) = 1 \\
\Rightarrow & 1 - \sum_t \min(p(t), q(t)) = \sum_t \max(0, p(t) - q(t))
\end{align}
\begin{equation}
\begin{split}
\Pr(X = t) &= \Pr(\tilde{X} = t) \Pr(\tilde{X}\text{ accept} \vert \tilde{X} = t) + \Pr(\tilde{X}\text{ reject}) \Pr(\tilde{\tilde{X}} = t \vert \tilde{X}\text{ reject}) \\
&= q(t) \cdot \min\left(1, \frac{p(t)}{q(t)}\right) + (1 - \Pr(\tilde{X}\text{ accept})) \cdot \text{norm}(\max(0, p(t) - q(t))) \\
&= \min\left(q(t), p(t)\right) + (1 - \sum_t \min\left(p(t), q(t)\right)) \cdot \frac{\max(0, p(t) - q(t))}{\sum_t \max(0, p(t) - q(t))} \\
&= \min\left(q(t), p(t)\right) + \max(0, p(t) - q(t)) \\
&= p(t)
\end{split}
\end{equation}
\end{question}
\section{Generative Adversarial Networks}
\begin{question}[Representation through Push-Forward]
Let $r$ be any continuous distribution on $\mathbb{R}^h$.
For any distribution $p$ on $\mathbb{R}^d$,
there exists push-forward maps $G: \mathbb{R}^h \rightarrow \mathbb{R}^d$ such that
\begin{equation}
\boldsymbol{z} \sim r \Longrightarrow G(\boldsymbol{z}) \sim p
\end{equation}
\end{question}
\begin{question}[Discriminator's Goal]
For a fixed generator $G$, minimize a log loss over $D$:
\begin{itemize}
\item If $x$ is real, minimize $-\log D(x)$;
\item If $x$ is fake, minimize $-\log(1 - D(x))$.
\end{itemize}
\begin{equation}
\min_D -\frac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] - \frac{1}{2} \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right]
\end{equation}
\end{question}
\begin{question}[Generator's Goal]
For a fixed discriminator $D$, maximize a log loss over $G$:
\begin{equation}
\begingroup\color{red}{\max_G}\endgroup -\frac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] - \frac{1}{2} \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right]
\end{equation}
\end{question}
\begin{question}[Solver]
\begin{equation}
\min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] + \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right]
\end{equation}
Solved by alternative minimization-maximization:
\begin{itemize}
\item G step: Fix $D$ and update $G$ by one-step gradient descent
\item D step: Fix $G$ and update $D$ by one-step gradient descent
\item Repeat until the algorithm reaches an approximate equilibrium
\end{itemize}
\end{question}
\begin{question}[Solution of $D^*$]
Let $p_g(x)$ be the density of $x$ estimated by the generator $G$.
For $G$ fixed, the optimal discriminator $D$ is $D_G^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}$
\textbf{Proof:}
\begin{equation}
\begin{split}
V(G, D) &\coloneq \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D(x)\right] + \mathbb{E}_{z \sim r} \left[\log(1 - D(G(z)))\right] \\
&= \int \log D(x) p_{\text{data}}(x) dx + \int_z \log(1 - D(G(z))) p_z(z) dz \\
&= \int \underbrace{\log D(x) p_{\text{data}}(x) + p_g(x) \log(1 - D(x))}_{f(D(x))} dx
\end{split}
\end{equation}
For any fixed $x$, taking derivative $= 0$:
\begin{equation}
\begin{split}
f'(D(x)) &= \frac{p_{\text{data}}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0 \\
D_G^*(x) &= \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}
\end{split}
\end{equation}
\end{question}
\begin{question}[Solution of $G^*$]
The global minimum of $\min_G \max_D V(G, D)$ is achieved if
and only if $p_g = p_{\text{data}}$. The optimal objective value is $-\log4$.
\textbf{Proof:}
\begin{equation}
\begin{split}
V(G, D_G^*) &= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D_G^*(x)\right] + \mathbb{E}_{z \sim r} \left[\log(1 - D_G^*(G(z)))\right] \\
&= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log D_G^*(x)\right] + \mathbb{E}_{x \sim p_g} \left[\log(1 - D_G^*(x))\right] \\
&= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g} \left[\frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)}\right] \\
\end{split}
\end{equation}
By definition of KL divergence $\text{KL}(P \| Q) = \mathbb{E}_{x \sim P} \left[\log \frac{p(x)}{q(x)}\right]$, we have:
\begin{equation}
\begin{split}
V(G, D_G^*) &= \mathbb{E}_{x \sim p_{\text{data}}} \left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g} \left[\log \frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)}\right] \\
&= -\log4 + \text{KL}\left(p_{\text{data}} \| \frac{p_\text{data} + p_g}{2}\right) + \text{KL}\left(p_g \| \frac{p_{\text{data}} + p_g}{2}\right) \\
&= -\log4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_g) \\
& \geq -\log4
\end{split}
\end{equation}
The equality holds if and only if $p_{\text{data}} = p_g$.
\end{question}
\section{Adversarial Attacks}
\begin{question}[Principle of Generating Adversarial Attacks]
\begin{equation}
\max_{\left\|x_{\text{adv}}-x\right\|_{\infty} \leq \epsilon} \mathcal{L}(C(x_{\text{adv}}), y)
\end{equation}
where $C$ is the composition of $h$ and $f$.
\end{question}
\begin{question}[Different Solvers] to optimize the adversarial attack.
\begin{itemize}
\item \textbf{Zero-Order Solvers} (only access to the output of NN)
\begin{itemize}
\item Black-box attack
\end{itemize}
\item \textbf{First-Order Solvers} (access to the gradient of NN)
\begin{itemize}
\item White-box attack
\item Fast Gradient Sign Method (FGSM), BIM, PGD, CW attack, \dots
\end{itemize}
\item \textbf{Second-Order Solvers} (access to the Hessian matrix)
\begin{itemize}
\item White-box attack
\item L-BFGS attack
\end{itemize}
\end{itemize}
\end{question}
\begin{question}[Holder Inequality]
For any $p, q \geq 1$ such that $\frac{1}{p} + \frac{1}{q} = 1$,
\begin{equation} \label{eq:holder-inequality}
\left\|x\right\|_p \cdot \left\|y\right\|_q \geq \left\langle x, y \right\rangle
\end{equation}
where $\left\langle x, y \right\rangle$ is the inner product.
$\| \cdot \|_p$ and $\| \cdot \|_q$ are also known as dual norms.
\begin{itemize}
\item $\| \cdot \|_2$ is self-dual.
\item $\| \cdot \|_\infty$ and $\| \cdot \|_1$ are dual norms.
\end{itemize}
\end{question}
\begin{question}[FGSM -- Fast Gradient Sign Method]
White-box and non-targeted (maximize the loss w.r.t. the true label).
Do linear expansion at $x$:
\begin{equation}
\mathcal{L}(C(x + \delta), y) \approx \underbrace{\mathcal{L}(C(x), y)}_{\text{constant}} + \nabla_x \mathcal{L}(C(x), y) \cdot \delta
\end{equation}
The problem reduces to:
\begin{equation}
\max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \nabla_x \mathcal{L}(C(x), y) \cdot \delta
\end{equation}
Because of holder inequality \eqref{eq:holder-inequality}, we have:
\begin{equation}
\nabla_x \mathcal{L}(C(x), y) \cdot \delta \leq \left\|\delta\right\|_{\infty} \cdot \left\|\nabla_x \mathcal{L}(C(x), y)\right\|_1
\leq \epsilon \cdot \left\|\nabla_x \mathcal{L}(C(x), y)\right\|_1
\end{equation}
Thus, the adversarial example is generated by:
\begin{equation}
x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(C(x), y))
\end{equation}
where $\epsilon$ is the perturbation size.
\end{question}
\begin{question}[BIM -- Basic Iterative Method]
BIM is an iterative version of FGSM.
\begin{itemize}
\item Initialize $x^{(0)} = x$.
\item For $k = 1, 2, \dots, K$:
\begin{equation}
x^{(k)} = x^{(k-1)} + \gamma \cdot \text{sign}(\nabla_x \mathcal{L}(C(x^{(k-1)}), y))
\end{equation}
\end{itemize}
Issues:
\begin{itemize}
\item By repeating, the pertubation size $\epsilon$ will become larger.
\item For a pre-defined $\epsilon$, $x^{(k)}$ may not satisfy $\|x^{(k)} - x\|_\infty \leq \epsilon$.
\end{itemize}
\end{question}
\begin{question}[PGD -- Projected Gradient Descent]
To resolve the issue of BIM, PGD involves a truncation operation:
\begin{itemize}
\item Initialize $x^{(0)} = x + \delta, \text{ where } \delta \in (-\epsilon, \epsilon)$.
\item For $k = 1, 2, \dots, K$:
\begin{equation}
x^{(k)} = \text{clip}_{(-\epsilon, \epsilon)}(x^{(k-1)} + \gamma \cdot \text{sign}(\nabla_x \mathcal{L}(C(x^{(k-1)}), y)))
\end{equation}
\end{itemize}
where $\text{clip}_{(-\epsilon, \epsilon)}(x)$ projects $x$ back to the $\ell_\infty$ ball of radius $\epsilon$ around $x$.
\end{question}
\begin{question}[Targeted PGD Attack]
Objective:
\begin{itemize}
\item \textbf{Untargeted}: \begin{equation}
\max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \mathcal{L}(C(x + \delta), y_{\text{true}})
\end{equation}
\item \textbf{Targeted}: \begin{equation}
\begingroup\color{red}\min_{\left\|\delta\right\|_{\infty} \leq \epsilon}\endgroup \mathcal{L}(C(x + \delta), \begingroup\color{red}y_{\text{target}}\endgroup)
\end{equation}
\end{itemize}
\end{question}
\section{Adversarial Robustness}
\begin{question}[Defense Mechanisms]
Categorized into two types:
\begin{itemize}
\item \textbf{Gradient Masking}: hide the gradients and make first-order attacks fail
\begin{itemize}
\item \textbf{Shasttered Gradients}
By applying a non-smooth or non-differentiable preprocessor $g$ to the inputs,
and then training a DNN model $f$ on the preprocessed inputs $g(x)$.
\item \textbf{Stochastic/Randomized Gradients}
Apply some form of randomization of the DNN model.
E.g. train a set of classifiers and during the testing phase randomly select one classifier to predict the labels.
\end{itemize}
\item \textbf{Adversarial Training}:
\begin{equation}
\min_{C} \mathbb{E}_{(x, y) \sim \mathcal{D}^n} \left[ \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \mathcal{L}(C(x + \delta), y) \right]
\end{equation}
\end{itemize}
\end{question}
\begin{question}[Trade-Off between Natural and Robust Error]
\begin{equation}
\min_f R_{\text{nat}}(f) + R_{\text{rob}}(f) / \lambda
\end{equation}
\begin{equation}
\begin{split}
R_{\text{nat}}(f) &\coloneq \Pr_{x,y \sim \mathcal{D}} \{f(x) y \leq 0\} \\
&= \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \mathbb{I}(f(x) y \leq 0) \right] \\
\end{split}
\end{equation}
\begin{equation}
\begin{split}
R_{\text{rob}}(f) &\coloneq \Pr_{x,y \sim \mathcal{D}} \{\exists \delta \in B_\epsilon(x) \text{ s.t. } f(x + \delta) y \leq 0\} \\
&= \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \mathbb{I}(f(x + \delta) y \leq 0) \right] \\
\end{split}
\end{equation}
Approximate by a differentiable surrogate loss $\Phi$:
\begin{equation}
R_{\text{nat}}(f) \approx \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \Phi(f(x) y) \right]
\end{equation}
\end{question}
\begin{question}[TRADES]
\begin{equation}
\min_f \underbrace{\left[ \underbrace{\mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \Phi(f(x) y) \right]}_{\text{minimize diff btw } f(x) \text{ and } y \text{ for accuracy}} + \underbrace{\mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \max_{\left\|\delta\right\|_{\infty} \leq \epsilon} \Phi(f(x + \delta) f(x)) \right]}_{\text{minimize diff btw }f(x) \text{ and } f(x+\delta){ for robustness}} / \lambda \right]}_{\text{TRADES Loss}}
\end{equation}
\end{question}
\section{Self-Supervised Learning}
\begin{question}[Contrastive Learning]
Loss:
\begin{equation}
\max_\Theta \Pr_1 = \frac{\exp(z_1)}{\sum_j \exp(z_j)}
\end{equation}
\end{question}
\end{document}
CS 848: Advanced Topics in Database: Algorithmic Aspects of Database Query Processing
Fall 25
Notes for Foundations of Databases
Foundations of Databases (The Alice Book)
I plan to read only the chapters 3 - 6, which should help equip me with the necessary background to understand the material in CS 848.
The Relational Model
Mathjax may have issues to display math symbols in particular styles, such as \(\mathcal{A}\).
You can try to mitigate this by changing the Math Renderer to Common HTML (right click on any math expressions):\( \newcommand{\att}{\textbf{att}} \) \( \newcommand{\attTotalOrder}{\leq_{\att}} \) \( \newcommand{\dom}{\textbf{dom}} \) \( \newcommand{\Dom}{\textit{Dom}} \) \( \newcommand{\relname}{\textbf{relname}} \) \( \newcommand{\sort}{\textit{sort}} \) \( \newcommand{\arity}{\textit{arity}} \) \( \newcommand{\finPowerSet}{\mathcal{P}^{\text{fin}}} \) \( \newcommand{\dbschema}[1]{\textbf{#1}} \) \( \newcommand{\dbinst}[1]{\textbf{#1}} \) \( \newcommand{\rel}[1]{\textit{#1}} \) \( \newcommand{\attname}[1]{\textit{#1}} \) \( \newcommand{\varSet}{\textbf{var}} \)
Informal Terminology
- Relation: table
- Relation Name: name of the table
- Attribute: column name
- Tuple/Record: line in the table
- Domain: set of constants that entries of tuples can take
- Database Schema: specifying the structure of the database
- Database Instance: specifying the actual content in the database
Formal Definitions
- \(\att\): a countably infinite set of attributes
- assume \(\att\) is fixed
- \(\attTotalOrder\): a total order on \(\att\)
- the elements of a set of attributes \(U\) are written according to \(\attTotalOrder\) unless otherwise specified
- \(\dom\): a countably infinite set of domains (disjoint from \(\att\))
- assume \(\dom\) is fixed
- a constant is an element of \(\dom\)
- for most cases, the same domain of values is used for all attributes;
- otherwise, assume a mapping \(\Dom\) on \(\att\), where \(\Dom(A)\) is a set called the domain of attribute \(A\)
- \(\relname\): a countably infinite set of relation names (disjoint from \(\att\) and \(\dom\))
- \(\sort\): a function from \(\relname\) to \(\finPowerSet(\att)\) (the finitary powerset of \(\att\))
- \(\sort^{-1}(U)\) is inifinite for each (possibly empty) finite set of attributes \(U\)
- which allows multiple relations to have the same set of attributes
- \(\sort(R)\) is called the \(\sort\) of a relation name \(R\)
- \(\arity(R) = |\sort(R)|\): the \(\arity\) of relation name \(R\)
- \(\sort^{-1}(U)\) is inifinite for each (possibly empty) finite set of attributes \(U\)
- \(R\): a relation name or a relation schema
- Alternative notations:
- \(R[U]\): indicating \(\sort(R) = U\)
- \(R[n]\): indicating \(\arity(R) = n\)
- Alternative notations:
- \(\dbschema{R}\): a database schema
- a nonempty finite set of relation names
- might be written as \(\dbschema{R} = \{R_1[U_1], \ldots, R_n[U_n]\}\)
Examples
Table Movies
Title | Director | Actor |
---|---|---|
A | X | Y |
Table Location
Theater | Address | Phone |
---|---|---|
T1 | Addr1 | P1 |
Table Pariscope
Theater | Title | Schedule |
---|---|---|
T1 | A | 9:00 |
- The database schema \(\dbschema{CINEMA} = \{ \rel{Movies}, \rel{Location}, \rel{Pariscope} \}\)
- The sorts of the relation names
- \(\sort(\rel{Movies}) = \{ \attname{Title}, \attname{Director}, \attname{Actor} \}\)
- \(\sort(\rel{Location}) = \{ \attname{Theater}, \attname{Address}, \attname{Phone} \}\)
- \(\sort(\rel{Pariscope}) = \{ \attname{Theater}, \attname{Title}, \attname{Schedule} \}\)
Named v.s. Unnamed Attributes/Tuples
Named Perspective
E.g. \( \langle A:5, B:3 \rangle \)
a tuple over a finite set of attributes \(U\) (or over a relation schema \(R[U]\)) is a total mapping (viewed as a function) \(u\) from \(U\) to \(\dom\).
- \(u(U)\): \(u\) is a tuple over \(U\)
- \(u(A)\): the value of \(u\) on an attribute \(A\) in \(U\) (\(A \in U\))
- \(u[V] = u \vert_V\): the restriction of \(u\) to a subset \(V \subseteq U\), i.e., \(u[V]\) denotes the tuple \(v\) over \(V\) such that \(v(A) = u(A)\) for all \(A \in V\)
Unnamed Perspective
E.g. \( \langle 5, 3 \rangle \)
a tuple is an ordered n-tuple (\(n \geq 0\)) of constants (i.e., an element of the Cartesian product \(\dom^n\))
- \(u(i)\): the \(i\)-th coordinate of \(u\)
Correspondence
Because of the total order \(\attTotalOrder\), a tuple \(\langle A_1:a_1, A_2:a_2 \rangle\) (defined as a function) can be viewed as an ordered tuple with \((A_1:a_1)\) as a first component and \((A_2:a_2)\) as a second component. Ignoring the names, this tuple can be viewed as the ordered tuple \(\langle a_1, a_2 \rangle\).
Conversely, the ordered tuple \(t = \langle a_1, a_2 \rangle\) can be viewed as a function over the set of integers \(\{1, 2\}\) with \(t(i) = a_i\) for each \(i\).
Conventional v.s. Logic Programming Relational Model
Conventional Perspective
a relation (instance) over a relation schema \(\rel{R}[U]\) is a finite set of tuples \(I\) with sort \(U\).
a database instance of database schema \(\dbschema{R}\) is a mapping \(\dbinst{I}\) with domain \(\dbschema{R}\), such that \(\dbinst{I}(\rel{R})\) is a relation over \(\rel{R}\) for each \(\rel{R} \in \dbschema{R}\).
Logic Programming Perspective
This perspective is used primarily with the ordered-tuple perspective on tuples.
A fact over \(\rel{R}\) is an expression of the form \(\rel{R}(a_1, \ldots, a_n)\),
where \(a_i \in \dom\) for \(i \in [1, n]\).
If \(u = \langle a_1, \ldots, a_n \rangle\), we sometimes write \(\rel{R}(u)\) for \(\rel{R}(a_1, \ldots, a_n)\).
a relation (instance) over a relation schema \(\rel{R}\) is a finite set of facts over \(\rel{R}\).
a database instance of database schema \(\dbschema{R}\) is a finite set \(\dbinst{I}\) that is the union of relation instances over \(\rel{R}\) for all \(\rel{R} \in \dbschema{R}\).
Examples
Assume \(\sort(R) = AB, \sort(S) = A\).
Named and Conventional
\[ \begin{split} I(R) = \{ f_1, f_2, f_3 \} \quad && \\ & f_1(A) = a &\quad f_1(B) = b \\ & f_2(A) = c &\quad f_2(B) = b \\ & f_3(A) = a &\quad f_3(B) = a \\ I(S) = \{g\} \quad && \\ & g(A) = d & \end{split} \]
Unnamed and Conventional
\[ \begin{split} I(R) = \{ \langle a, b \rangle, \langle c, b \rangle, \langle a, a \rangle \} \quad && \\ I(S) = \{ \langle d \rangle \} \quad && \end{split} \]
Named and Logic Programming
\[ \{ R(A:a, B:b), R(A:c, B:b), R(A:a, B:a), S(A:d) \} \]
Unnamed and Logic Programming
\[ \{ R(a, b), R(c, b), R(a, a), S(d) \} \]
Variables
an infinite set of variables \(\varSet\) will be used to range over elements of \(\dom\).
- a free tuple over \(U\) or \(\rel{R}(U)\) is a function \(u\) from \(U\) to \(\dom \cup \varSet\)
- an atom over \(\rel{R}\) is an expression of the form \(\rel{R}(e_1, \ldots, e_n)\), where \(n = \arity(\rel{R})\) and each term \(e_i \in \dom \cup \varSet\)
- a ground atom is an atom with no variables, i.e., a fact (see Logic Programming Perspective)
Notations
Object | Notation |
---|---|
Constants | \(a, b, c\) |
Variables | \(x, y\) |
Sets of Variables | \(X, Y\) |
Terms | \(e\) |
Attributes | \(A, B, C\) |
Sets of Attributes | \(U, V, W\) |
Relation Names | \(\rel{R}, \rel{S}; \rel{R}[U], \rel{S}[V] \) |
Database Schemas | \(\dbschema{R}, \dbschema{S} \) |
Tuples | \(t, s\) |
Free Tuples | \(u, v, w\) |
Facts | \(\rel{R}(a_1, \ldots, a_n), R(t)\) |
Atoms | \(\rel{R}(e_1, \ldots, e_n), R(u)\) |
Relation Instances | \(I, J\) |
Database Instances | \(\dbinst{I}, \dbinst{J} \) |
Conjunctive Queries
Mathjax may have issues to display math symbols in particular styles, such as \(\mathcal{A}\).
You can try to mitigate this by changing the Math Renderer to Common HTML (right click on any math expressions):\( \newcommand{\att}{\textbf{att}} \) \( \newcommand{\attTotalOrder}{\leq_{\att}} \) \( \newcommand{\dom}{\textbf{dom}} \) \( \newcommand{\Dom}{\textit{Dom}} \) \( \newcommand{\relname}{\textbf{relname}} \) \( \newcommand{\sort}{\textit{sort}} \) \( \newcommand{\arity}{\textit{arity}} \) \( \newcommand{\finPowerSet}{\mathcal{P}^{\text{fin}}} \) \( \newcommand{\dbschema}[1]{\textbf{#1}} \) \( \newcommand{\dbinst}[1]{\textbf{#1}} \) \( \newcommand{\rel}[1]{\textit{#1}} \) \( \newcommand{\attname}[1]{\textit{#1}} \) \( \newcommand{\varSet}{\textbf{var}} \) \( \newcommand{\ans}{\textit{ans}} \) \( \newcommand{\var}{\textit{var}} \) \( \newcommand{\adom}{\textit{adom}} \) \( \newcommand{\tableau}{\textbf{T}} \) \( \newcommand{\free}{\textit{free}} \) \( \newcommand{\body}{\text{body}} \)
- Logic-Based Perspective
- Incorporating Equality
- Query Composition and Views
- Algebraic Perspectives
- Adding Union
a query (mapping) is from (or over) its input schema and to its output schema.
- query: a syntactic object
- query mapping: a function defined by a query interpreted under a specified semantics
- its domain: the family of all instances of an input schema
- its range: the family of instances of an output schema
- we often blur query and query mapping when the meaning is clear from context
- input schema: a specified relation or database schema
- output schema: a relation schema or database schema
- For a relation schema, the relation name may be specified as part of the query syntax or by the context
- \(q_1 \equiv q_2\) denotes two queries \(q_1\) and \(q_2\) over \(\dbschema{R}\) are equivalent, i.e., they have the same output schema and \(q_1(\dbinst{I}) = q_2(\dbinst{I})\) for each instance \(\dbinst{I}\) over \(\dbschema{R}\).
Logic-Based Perspective
Three versions of conjunctive queries:
- Rule-Based Conjunctive Queries
- Tableau Queries
- Conjunctive Calculus
Rule-Based Conjunctive Queries
Definition
A rule-based conjunctive query (or often more simply called rules) \(q\) over a relation schema \(\dbschema{R}\) is an expression of the form \[ \ans(u) \leftarrow \rel{R}_1(u_1), \ldots, \rel{R}_n(u_n) \] where \(n \geq 0\), \(\rel{R}_1, \ldots, \rel{R}_n\) are relation names in \(\dbschema{R}\); \(\ans\) is a relation name not in \(\dbschema{R}\); and \(u, u_1, \ldots, u_n\) are free tuples (i.e., may use either variables or constants).
- \(\var(q)\): the set of variables occurring in \(q\).
- body: \(\rel{R}_1(u_1), \ldots, \rel{R}_n(u_n)\)
- head: \(\ans(u)\)
- range restricted: each variable occurring in the head also occurs in the body
- all conjunctive queries considered here are range restricted
- valuation: a valuation \(v\) over \(V\), a finite subset of \(\varSet\), is a total function \(v\) from \(V\) to \(\dom\) (of constants)
- extended to be identity on \(\dom\) (so that the domain can contain both variables and constants)
- extended to map free tuples to tuples (so that the domain/range can be in the tuple form)
Semantics
Let \(q\) be the query, and let \(\dbinst{I}\) be a database instance of schema \(\dbschema{R}\). The image of \(\dbinst{I}\) under \(q\) is:
\[ q(\dbinst{I}) = \{ v(u) \mid v \text{ is a valuation over } \var(q) \text{ and } v(u_i) \in \dbinst{I}(\rel{R}_i) \text{ for each } i \in [1, n] \} \]
- active domain
- of a database instance \(\dbinst{I}\), denoted \(\adom(\dbinst{I})\), is the set of constants that occur in \(\dbinst{I}\)
- of a relation instance \(I\), denoted \(\adom(I)\), is the set of constants that occur in \(I\)
- of a query \(q\), denoted \(\adom(q)\), is the set of constants that occur in \(q\)
- \(\adom(q, \dbinst{I})\) is an abbreviation for \(\adom(q) \cup \adom(\dbinst{I})\)
- \(\adom(q(\dbinst{I})) \subseteq \adom(q, \dbinst{I})\)
- extensional relations: relations in the body of the query, i.e., \(\rel{R}_1, \ldots, \rel{R}_n\)
- because they are known/provided by the input instance \(\dbinst{I}\)
- intensional relation: the relation in the head of the query, i.e., \(\ans\)
- because it is not stored and its value is computed on request by the query
- extensional database (edb): a database instance associated with the extensional relations
- intensional database (idb): the rule itself
- idb relation: the relation defined by the idb
Properties
Conjunctive queries are:
- monotonic: a query \(q\) over \(\dbschema{R}\) is monotonic if for each \(\dbinst{I}, \dbinst{J}\) over \(\dbschema{R}\), \(\dbinst{I} \subseteq \dbinst{J} \implies q(\dbinst{I}) \subseteq q(\dbinst{J})\)
- satisfiable: a query \(q\) is satisfiable if there exists a database instance \(\dbinst{I}\) such that \(q(\dbinst{I}) \neq \emptyset\)
Tableau Queries
A tableau query is simply a pair \((\tableau, u)\) where \(\tableau\) is a tableau and each variable in \(u\) also occurs in \(\tableau\).
This is closest to the visual form provided by Query-By-Example (QBE).
- summary: the free tuple \(u\) representing the tuples included in the answer to the query
- embedding: a valuation \(v\) for the variables occurring in \(\tableau\) such that \(v(\tableau) \subseteq \dbinst{I}\)
- the output of \((\tableau, u)\) on \(\dbinst{I}\) consists of all tuples \(v(u)\) for each embedding \(v\) of \(\tableau\) into \(\dbinst{I}\)
- typed: a tableau query \(q = (\tableau, u)\) under the named perspective, where \(\tableau\) is over relation schema \(R\) and \(\sort(u) \subseteq \sort(R)\), is typed if no variable of \(\tableau\) is associated with two distinct attributes in \(q\)
Examples
Table Movies
Title | Director | Actor |
---|---|---|
\(x_{ti}\) | “Bergman” | \(x_{ac}\) |
Table Location
Theater | Address | Phone |
---|---|---|
\(x_{th}\) | \(x_{ad}\) | \(x_p\) |
Table Pariscope
Theater | Title | Schedule |
---|---|---|
\(x_{th}\) | \(x_{ti}\) | \(x_{s}\) |
The above tableau query is typed because each variable is associated with only one attribute:
- \(x_{ti}\): \(\attname{Title}\)
- \(x_{ac}\): \(\attname{Actor}\)
- \(x_{th}\): \(\attname{Theater}\)
- \(x_{ad}\): \(\attname{Address}\)
- \(x_p\): \(\attname{Phone}\)
- \(x_{s}\): \(\attname{Schedule}\)
However, the following tableau query is untyped:
Table Movies
Title | Director | Actor |
---|---|---|
\(x_{ti}\) | \(x_{ac}\) | \(x_{ac}\) |
Because \(x_{ac}\) is associated with both \(\attname{Director}\) and \(\attname{Actor}\).
Conjunctive Calculus
The conjunctive query \[ \ans(u) \leftarrow \rel{R}_1(u_1), \ldots, \rel{R}_n(u_n) \]
can be expressed as the following conjunctive calculus query that has the same semantics: \[ \left\{ e_1, \ldots, e_m \vert \exists x_1, \ldots, x_k \left( \rel{R}(u_1) \wedge \ldots \wedge \rel{R}(u_n) \right) \right\} \] where \(x_1, \ldots, x_k\) are all the variables occurring in the body and not the head.
Conjunctive Calculus Formula
Let \(\dbschema{R}\) be a relation schema. A (well-formed) formula over \(\dbschema{R}\) for the conjunctive calculus is an expression having one of the following forms:
- an atom over \(\dbschema{R}\);
- \(\varphi \wedge \psi \), where \(\varphi\) and \(\psi\) are formulas over \(\dbschema{R}\); or
- \(\exists x \varphi\), where \(x\) is a variable and \(\varphi\) is a formula over \(\dbschema{R}\).
An occurrence of a variable \(x\) in formula \(\varphi\) is free if:
- \(\varphi\) is an atom; or
- \(\varphi = (\psi \wedge \xi)\) and the occurrence of \(x\) is free in \(\psi\) or \(\xi\); or
- \(\varphi = \exists y \psi\), \(x \neq y\), and the occurrence of \(x\) is free in \(\psi\).
\(\free(\varphi)\): the set of free variables in \(\varphi\).
An occurrence of a variable that is not free is bound.
Conjunctive Calculus Query
A conjunctive calculus query over database schema \(\dbschema{R}\) is an expression of the form \[ \{ e_1, \ldots, e_m \mid \varphi \} \] where
- \(\varphi\) is a conjunctive calculus formula,
- \(\langle e_1, \ldots, e_m \rangle\) is a free tuple, and
- the set of variables occurring in \(\langle e_1, \ldots, e_m \rangle\) is exactly \(\free(\varphi)\).
For named perspective, the above query can be written as: \[ \{ \langle e_1, \ldots, e_m \rangle : A_1, \ldots, A_m \mid \varphi \} \]
Note: In the previous chapter, a named tuple was usually denoted as:
\[ \langle A_1:a_1, \ldots, A_m:a_m \rangle \]
But here we denote the named tuple as:
\[ \langle e_1, \ldots, e_m \rangle : A_1, \ldots, A_m \]
Not sure which one is better or whether the author did this intentionally.
Semantics
Valuation
A valuation over \(V \subset \varSet \) is a total function \(v\) from \(V\) to \(\dom\), which can be viewed as a syntactic expression of the form: \[ \{ x_1/a_1, \ldots, x_n/a_n \} \] where
- \(x_1, \ldots, x_n\) is a listing of \(V\),
- \(a_i = v(x_i)\) for each \(i \in [1, n]\).
Interpretation as a Set
If \(x \not \in V\), and \(c \in \dom\), then \(v \cup \{ x/c \}\) is the valuation over \(V \cup \{ x \}\) that agrees with \(v\) on \(V\) and maps \(x\) to \(c\).
Satisfaction
Let \(\dbschema{R}\) be a database schema, \(\varphi\) a conjunctive calculus formula over \(\dbschema{R}\), and \(v\) a valuation over \(\free(\varphi)\). Then \(\dbinst{I} \models \varphi[v]\), if
- \(\varphi = \rel{R}(u)\) is an atom and \(v(u) \in \dbinst{I}(\rel{R})\); or
- \(\varphi = (\psi \wedge \xi)\) and \(\dbinst{I} \models \psi[v \vert_{\free(\psi)}]\) and \(\dbinst{I} \models \xi[v \vert_{\free(\xi)}]\); or
- \(\varphi = \exists x \psi\) and there exists a constant \(c \in \dom\) such that \(\dbinst{I} \models \psi[v \cup \{ x/c \}]\).
Image
Let \(q = \{ e_1, \ldots, e_m \mid \varphi \}\) be a conjunctive calculus query over \(\dbschema{R}\). For an instance \(\dbinst{I}\) over \(\dbschema{R}\), the image of \(\dbinst{I}\) under \(q\) is: \[ q(\dbinst{I}) = \{ v( \langle e_1, \ldots, e_m \rangle ) \mid v \text{ is a valuation over } \free(\varphi) \text{ and } \dbinst{I} \models \varphi[v] \} \]
- active domain
- of a formula \(\varphi\), denoted \(\adom(\varphi)\), is the set of constants that occur in \(\varphi\)
- \(\adom(\varphi, \dbinst{I})\) is an abbreviation for \(\adom(\varphi) \cup \adom(\dbinst{I})\)
- If \(\dbinst{I} \models \varphi[v]\), then the range of \(v\) is contained in \(\adom(\dbinst{I})\)
- to evaluate a conjunctive calculus query, one need only consider valuations with range constrained in \(\adom(\varphi, \dbinst{I})\), i.e., only a finite number of them
- equivalent: conjunctive calculus formulas \(\varphi\) and \(\psi\) over \(\dbschema{R}\) are equivalent,
if
- they have the same free variables and,
- for each \(\dbinst{I}\) over \(\dbschema{R}\) and valuation \(v\) over \(\free(\varphi) = \free(\psi)\), \(\dbinst{I} \models \varphi[v]\) if and only if \(\dbinst{I} \models \psi[v]\)
Normal Form
\[ \exists x_1, \ldots, x_m \left( \rel{R}_1(u_1) \wedge \ldots \wedge \rel{R}_n(u_n) \right) \]
Each conjunctive calculus query is equivalent to a conjunctive calculus query in normal form.
Expressiveness of Query Languages
Let \(\mathcal{Q}_1\) and \(\mathcal{Q}_2\) be two query languages.
- \(\mathcal{Q}_1 \sqsubseteq \mathcal{Q}_2\): \(\mathcal{Q}_1\) is dominated by \(\mathcal{Q}_2\) (or \(\mathcal{Q}_2\) is weaker than \(\mathcal{Q}_1\)), if for each query \(q_1 \in \mathcal{Q}_1\), there exists a query \(q_2 \in \mathcal{Q}_2\) such that \(q_1 \equiv q_2\).
- \(\mathcal{Q}_1 \equiv \mathcal{Q}_2\): \(\mathcal{Q}_1\) and \(\mathcal{Q}_2\) are equivalent, if \(\mathcal{Q}_1 \sqsubseteq \mathcal{Q}_2\) and \(\mathcal{Q}_2 \sqsubseteq \mathcal{Q}_1\).
The rule-based conjunctive queries, tableau queries, and conjunctive calculus queries are all equivalent.
Incorporating Equality
The conjunctive query
\[ \begin{split} \ans(x_{th}, x_{ad}) \leftarrow & \rel{Movies}(x_{ti}, \text{“Bergman”}, x_{ac}), \\ & \qquad \rel{Location}(x_{th}, x_{ad}, x_p), \rel{Pariscope}(x_{th}, x_{ti}, x_s) \end{split} \]
can be expressed as:
\[ \begin{split} \ans(x_{th}, x_{ad}) \leftarrow & \rel{Movies}(x_{ti}, x_d, x_{ac}), x_d = \text{“Bergman”}, \\ & \qquad \rel{Location}(x_{th}, x_{ad}, x_p), \rel{Pariscope}(x_{th}, x_{ti}, x_s) \end{split} \]
Problem 1: Infinite Answers
Unrestricted rules with equality may yield infinite answers: \[ \ans(x, y) \leftarrow \rel{R}(x), y = z \]
A rule-based conjunctive query with equality is a range-restricted rule-based conjunctive query with equality.
Problem 2: Unsatisfiable Queries
Consider the following query: \[ \ans(x) \leftarrow \rel{R}(x), x = a, x = b \] where \(\rel{R}\) is a unary relation and \(a, b \in \dom\) with \(a \neq b\).
Each satisfiable rule with equality is equivalent to a rule without equality.
No expressive power is gained if the query is satisfiable.
Query Composition and Views
a conjunctive query program (with or without equality) is a sequence \(P\) of rules having the form: \[ \begin{split} S_1(u_1) & \leftarrow \text{body}_1 \\ S_2(u_2) & \leftarrow \text{body}_2 \\ & \vdots \\ S_m(u_m) & \leftarrow \text{body}_m \end{split} \] where
- each \(S_i\) is a distinct relation name not in \(\dbschema{R}\),
- for each \(i \in [1, m]\), the only relation names that may occur in \(\body_i\) are in \(\dbschema{R} \cup \{ S_1, \ldots, S_{i-1} \}\).
Closure under Composition
If conjunctive query program \(P\) defines final relation \(S\), then there is a conjunctive query \(q\), possibly with equality, such that on all input instances \(\dbinst{I}\), \(q(\dbinst{I}) = P(\dbinst{I})(S)\).
If \(P\) is satisfiable, then there is a \(q\) without equality.
Composition and User Views
views are specified as queries (or query programs), which may be
- materialized: a physical copy of the view is stored and maintained
- virtual: relevant information about the view is computed as needed
- queries against the virtual view generate composed queries against the underlying database
Algebraic Perspectives
The Unnamed Perspective: The SPC Algebra
Unnamed conjunctive algebra:
- Selection (\(\sigma\))
- Projection (\(\pi\))
- Cross-Product (or Cartesian Product, \(\times\))
Selection (“Horizontal” Operator)
Primitive Forms: \(\sigma_{j=a}\) and \(\sigma_{j=k}\)
where \(j, k\) are positive integers, and \(a \in \dom\)
(constants are usually surrounded by quotes).
\[ \sigma_{j=a}(I) = \{ t \in I \mid t(j) = a \} \]
\(\sigma_{j=k}\) is sometimes called atomic selection.
Projection (“Vertical” Operator)
General Form: \(\pi_{j_1, \ldots, j_n}\)
where \(n \geq 0\) and \(j_1, \ldots, j_n\) are positive integers
(empty sequence is written \([\;]\)).
\[ \pi_{j_1, \ldots, j_n}(I) = \{ \langle t(j_1), \ldots, t(j_n) \rangle \mid t \in I \} \]
Cross-Product (or Cartesian Product)
\[ I \times J = \{ \langle t(1), \ldots, t(n), s(1), \ldots, s(m) \rangle \mid t \in I, s \in J \} \]
Formal Inductive Definition
Let \(\dbschema{R}\) be a relation schema.
There are two kinds of base SPC (algebra) queries:
- Input Relation: Expression \(R\); with arity equal to \(\arity(R)\)
- Unary Singleton Constant: Expression \(\{\langle a \rangle \}\), where \(a \in \dom\); with arity equal to 1
The family of SPC (algebra) queries contains all above base SPC queries and, for SPC queries \(q_1, q_2\) with arities \(\alpha_1, \alpha_2\), respectively,
- Selection: \(\sigma_{j=1}(q_1)\) and \(\sigma_{j=k}(q_1)\), whenever \(j, k \leq \alpha_1\) and \(a \in \dom\); these have arity \(\alpha_1\).
- Projection: \(\pi_{j_1, \ldots, j_n}(q_1)\), where \(j_1, \ldots, j_n \leq \alpha_1\); this has arity \(n\).
- Cross-Product: \(q_1 \times q_2\); this has arity \(\alpha_1 + \alpha_2\).
Unsatisifiable Queries
E.g. \(\sigma_{1=a}(\sigma_{1=b}(R))\) where \(\arity(R) \geq 1\) and \(a \neq b\).
This is equivalent to \(q^\emptyset\).
Generalized SPC Algebra
- Intersection (\(\cap\)): is easily simulated by selection and cross-product.
- Generalized Selection (\(\sigma\)): permits the specification of multiple conditions.
- Positive Conjunctive Selection Formula: \(F = \gamma_1 \land \ldots \land \gamma_n (n \geq 1)\) where each \(\gamma_i\) is either \(j = k\) or \(j = a\)
- Positive Conjunctive Selection Operator: an expression of the form \(\sigma_F\), where \(F\) is a positive conjunctive selection formula
- can be simulated by a sequence of selections
- Equi-Join (\(\bowtie\)): a binary operator that combines cross-product and selection
- Equi-Join Condition: \(F = \gamma_1 \land \ldots \land \gamma_n (n \geq 1)\) where each \(\gamma_i\) is of the form \(j = k\)
- Equi-Join Operator: an expression of the form \(I \bowtie_F J\), where \(F\) is an equi-join condition
- can be simulated by \(\sigma_{F’}(I \times J)\), where \(F’\) is obtained from \(F\) by replacing each condition \(j = k\) with \(j = \alpha(I) + k\)
Normal Form
\[ \pi_{j_1, \ldots, j_n}( \{ \langle a_1 \rangle \} \times \ldots \times \{ \langle a_m \rangle \} \times \sigma_F(R_1 \times \ldots \times R_k) ) \] where
- \(n \geq 0; m \geq 0\);
- \(a_1, \ldots, a_m \in \dom\);
- \(\{1, \ldots, m\} \subseteq \{j_1, \ldots, j_n\}\);
- \(R_1, \ldots, R_k\) are relation names (repeats permitted);
- \(F\) is a positive conjunctive selection formula.
For each (generalized) SPC query, there is an equivalent SPC query in normal form.
The proof is based on repeated application of the following equivalence-preserving SPC algebra rewrite rules (or transformations):
- Merge-Select: \[ \sigma_{F}( \sigma_{F’}(q) ) \equiv \sigma_{F \land F’}(q) \]
- Merge-Project: \[ \pi_{\boldsymbol{j}}( \pi_{\boldsymbol{k}}(q) ) \equiv \pi_{\boldsymbol{l}}(q), \text{ where } l_i = k_{j_i} \text{ for each term } l_i \text{ in } \boldsymbol{l} \]
- Push-Select-Though-Project: \[ \sigma_F( \pi_{\boldsymbol{j}}(q) ) \equiv \pi_{\boldsymbol{j}}( \sigma_{F’}(q) ), \text{ where } F’ \text{ is obtained from } F \text{ by replacing each } k \text{ with } j_k \]
- Push-Select-Though-Singleton: \[ \sigma_{1=j}(\langle a \rangle \times q) \equiv \langle a \rangle \times \sigma_{(j - 1) = a}(q) \]
- Associate-Cross: \[ (q_1 \times \ldots \times q_n) \times q \equiv q_1 \times \ldots \times q_n \times q \] \[ q \times (q_1 \times \ldots \times q_n) \equiv q_1 \times \ldots \times q_n \times q \]
- Commute-Cross: \[ (q \times q’) \equiv \pi_{\boldsymbol{j} \boldsymbol{j’}} (q’ \times q) \] where \(\boldsymbol{j’} = [1, \ldots, \arity(q’)]\) and \(\boldsymbol{j} = [\arity(q’) + 1, \ldots, \arity(q’) + \arity(q)]\)
- Push-Cross-Though-Select: \[ \sigma_F(q) \times q’ \equiv \sigma_{F}(q \times q’) \] \[ q \times \sigma_F(q’) \equiv \sigma_{F’}(q \times q’) \] where \(F’\) is obtained from \(F\) by replacing each \(j\) with \(\arity(q) + j\)
- Push-Cross-Though-Project: \[ \pi_{\boldsymbol{j}}(q) \times q’ \equiv \pi_{\boldsymbol{j}}(q \times q’) \] \[ q \times \pi_{\boldsymbol{j}}(q’) \equiv \pi_{\boldsymbol{j’}}(q \times q’) \] where \(\boldsymbol{j’}\) is obtained from \(\boldsymbol{j}\) by replacing each \(j\) with \(\arity(q) + j\)
Notation for Rewriting
\[ \begin{split} q &\rightarrow_{\mathcal{S}} q’ \quad \text{or} \\ \text{simply} \quad q &\rightarrow q’ \end{split} \]
- for a set \(\mathcal{S}\) of rewrite rules and algebra expressions \(q, q’\),
- if \(\mathcal{S}\) is understood from the context,
- if \(q’\) is the result of replacing a subexpression of \(q\) according to one of the rules in \(\mathcal{S}\).
\(\rightarrow^*_S\): the reflexive and transitive closure of \(\rightarrow\).
A family \(\mathcal{S}\) of rewrite rules is sound if \(q \rightarrow_{\mathcal{S}} q’\) implies \(q \equiv q’\). If \(\mathcal{S}\) is sound, then \(q \rightarrow^*_{\mathcal{S}} q’\) implies \(q \equiv q’\).
The Named Perspective: The SPJR Algebra
Named conjunctive algebra:
- Selection (\(\sigma\))
- Projection (\(\pi\)); repeats not permitted
- Join (\(\bowtie\)); natural join
- Renaming (\(\delta\))
Selection
\(\sigma_{A=a}, \sigma_{A=B}\) where \(A, B \in \att, a \in \dom\)
These operators apply to any instance \(I\) with \(A, B \in \sort(I)\).
Projection
\(\pi_{A_1, \ldots, A_n}\) where \(n \geq 0\) and \(A_1, \ldots, A_n \in \sort(I)\)
(Natural) Join
\[ \begin{split} I \bowtie J = \{ t \text{ over } V \cup W \vert & \text{ for some } v \in I \text{ and } w \in J, \\ & \quad t[V] = v \land t[W] = w \} \end{split} \] where \(I\) and \(J\) have sorts \(V\) and \(W\), respectively.
- When \(\sort(I) = \sort(J)\), \(I \bowtie J = I \cap J\).
- When \(\sort(I) \cap \sort(J) = \emptyset\), \(I \bowtie J = I \times J\).
Renaming
An attribute renaming for a finite set \(U\) of attributes is a one-one mapping from \(U\) to \(\att\).
\[ \delta_f(I) = \{ v \text{ over } f[U] \vert \text{ for some } u \in I, v(f(A)) = u(A) \text{ for each } A \in U \} \] where
- input is an instance \(I\) over \(U\)
- \(f\) is an attribute renaming for \(U\)
- which can be described by \((A, f(A))\), where \(f(A) \neq A\)
- usually written as \(A_1 A_2 \ldots A_n \rightarrow B_1 B_2 \ldots B_n\) to indicate that \(f(A_i) = B_i\) for each \(i \in [1, n] (n \geq 0)\)
- which can be described by \((A, f(A))\), where \(f(A) \neq A\)
Formal Inductive Definition
The base SPJR algebra queries are:
- Input Relation: Expression \(R\); with sort equal to \(\sort(R)\)
- Unary Singleton Constant: Expression \(\{ \langle A:a \rangle \}\), where \(a \in \dom\); with \(\sort = A\)
The remainder of the syntax and semantics of the SPJR algebra is defined in analogy to those of the SPC algebra.
Normal Form
\[ \pi_{B_1, \ldots, B_m}( \{ \langle A_1:a_1 \rangle \} \bowtie \ldots \bowtie \{ \langle A_m:a_m \rangle \} \bowtie \sigma_F(\delta_{f_1}(R_1) \bowtie \ldots \bowtie \delta_{f_k}(R_k)) ) \] where
- \(n \geq 0; m \geq 0\);
- \(a_1, \ldots, a_m \in \dom\);
- \({A_1, \ldots, A_m}\) occurs in \({B_1, \ldots, B_n}\);
- the \(A_i\)s are distinct;
- \(R_1, \ldots, R_k\) are relation names (repeats permitted);
- \(\delta_{f_i}\) is a renaming operator for \(\sort(R_i)\);
- no \(A_i\) occur in any \(\delta_{f_j}(R_j)\);
- the sorts of \(\delta_{f_1}(R_1), \ldots, \delta_{f_k}(R_k)\) are pairwise disjoint;
- \(F\) is a positive conjunctive selection formula.
For each SPJR query, there is an equivalent SPJR query in normal form.
Equivalence Theorem
- Lemma: The SPC and SPJR algebras are equivalent.
- Equivalence Theorem: The rule-based conjunctive queries, tableau queries, conjunctive calculus queries, satisifiable SPC algebra, and satisifiable SPJR algebra are all equivalent.
Adding Union
The following have equivalent expressive power:
- the nonrecursive datalog programs (with single relation target)
- the SPCU queries
- the SPJRU queries
- union can only be applied to expressions having the same sort
Nonrecursive Datalog Program
A nonrecursive datalog program (nr-datalog program) over schema \(\dbschema{R}\) is a set of rules:
\[ \begin{split} S_1 & \leftarrow \body_1 \\ S_2 & \leftarrow \body_2 \\ & \vdots \\ S_m & \leftarrow \body_m \end{split} \] where
- no relation name in \(\dbschema{R}\) occurs in a rule head
- the same relation name may appear in more than one rule head
- there is some ordering \(r_1, \ldots, r_m\) of the rules such that the relation name in the head of \(r_i\) does not occur in the body of any rule \(r_j\) with \(j \leq i\)
Union and the Conjunctive Calculus
Simply permitting disjunction (denoted \(\lor\)) in the formula of a conjunctive calculus along with conjunction can have serious consequences.
E.g.,
\[ q = \{ x, y, z \vert R(x, y) \lor R(y, z) \} \]
The answer of \(q\) on nonempty instance \(I\) will be
\[ q(I)= (I \times \dom) \cup (\dom \times I) \]
Because \(R(x, y)\) decides that the first two components in \(\langle x, y, z \rangle\) are in \(I\) and \(z\) can be anything in \(\dom\), and \(R(y, z)\) decides that the last two components in \(\langle x, y, z \rangle\) are in \(I\) and \(x\) can be anything in \(\dom\).
Adding Negation: Algebra and Calculus
Mathjax may have issues to display math symbols in particular styles, such as \(\mathcal{A}\).
You can try to mitigate this by changing the Math Renderer to Common HTML (right click on any math expressions):\( \newcommand{\att}{\textbf{att}} \) \( \newcommand{\attTotalOrder}{\leq_{\att}} \) \( \newcommand{\dom}{\textbf{dom}} \) \( \newcommand{\Dom}{\textit{Dom}} \) \( \newcommand{\relname}{\textbf{relname}} \) \( \newcommand{\sort}{\textit{sort}} \) \( \newcommand{\arity}{\textit{arity}} \) \( \newcommand{\finPowerSet}{\mathcal{P}^{\text{fin}}} \) \( \newcommand{\dbschema}[1]{\textbf{#1}} \) \( \newcommand{\dbinst}[1]{\textbf{#1}} \) \( \newcommand{\rel}[1]{\textit{#1}} \) \( \newcommand{\attname}[1]{\textit{#1}} \) \( \newcommand{\varSet}{\textbf{var}} \) \( \newcommand{\ans}{\textit{ans}} \) \( \newcommand{\var}{\textit{var}} \) \( \newcommand{\adom}{\textit{adom}} \) \( \newcommand{\tableau}{\textbf{T}} \) \( \newcommand{\free}{\textit{free}} \) \( \newcommand{\body}{\text{body}} \) \( \newcommand{\datalogNeg}{\textit{datalog}^\neg} \)
first-order queries can be expressed in:
- relational algebra
- relational calculus
- relational calculus is essentially first-order predicate calculus without function symbols
- nonrecursive datalog with negation
The Relational Algebra
difference operator (\(-\))
it can only be applied to expressions that have the same sort/arity.
The unnamed relational algebra is obtained by adding the difference operator to the SPCU algebra (usually also with intersection operator). Because union is present, nonsingleton constant relations may be used.
The named relational algebra is obtained in an analogous way.
The unnamed and named relational algebras are equivalent in expressive power.
Nonrecursive Datalog with Negation
To obtain a rule-based language with expressive power equivalent to the relational algebra, we extend nonrecursive datalog programs by permitting negative literals in rule bodies.
nonrecursive datalog with negation (nonrecursive \(\datalogNeg\), nr-\(\datalogNeg\)) has the rule of the form:
\[ q : \quad S(u) \leftarrow L_1, \ldots, L_n \] where
- \(S\) is a relation name
- \(u\) is a free tuple of appropriate arity
- each \(L_i\) is a literal
- literal is an expression of the form \(R(v)\) or \(\neg R(v)\), where
- \(R\) is a relation name and
- \(v\) is a tuple of appropriate arity
- literal is an expression of the form \(R(v)\) or \(\neg R(v)\), where
- the rule is range restricted if each variable \(x\) occurring in the rule occurs in at least one literal of the form \(R(v)\) in the rule body
- unless otherwise specified, all \(\datalogNeg\) rules are range restricted
Semantics
The image of \(\dbinst{I}\) under \(q\) is
\[ \begin{split} q(\dbinst{I}) = \{ v(u) \vert & v \text{ is a valuation and for each } i \in [1, n], \\ & \quad v(u_i) \in \dbinst{I}(R_i), \text{ if } L_i = R_i(u_i), \text{ and } \\ & \quad v(u_i) \not \in \dbinst{I}(R_i), \text{ if } L_i = \neg R_i(u_i) \} \end{split} \]
This is image can be expressed as a difference \(q_1 - q_2\),
where \(q_1\) is an SPC query and \(q_2\) is an SPCU query.
Example
Table Movies
Title | Director | Actor |
---|---|---|
A | X | Y |
List those movies for which all actors of the movie have acted under Hitchcock’s direction.
\[ \begin{split} \rel{Hitch-actor}(z) & \leftarrow \rel{Movies}(x, \text{“Hitchcock”}, z) \\ \rel{not-ans}(x) & \leftarrow \rel{Movie}(x, y, z), \neg \rel{Hitch-actor}(z) \\ \rel{ans}(x) & \leftarrow \rel{Movie}(x, y, z), \neg \rel{not-ans}(x) \end{split} \]
A Bad Example
The following program forms a kind of merging of the first two rules of the previous program.
\[ \begin{split} \rel{bad-not-ans}(x) & \leftarrow \rel{Movies}(x, y, z), \\ & \qquad \quad \neg \rel{Movies}(x’, \text{“Hitchcock”}, z), \\ & \qquad \quad \rel{Movies}(x’, \text{“Hitchcock}, z’) \\ \rel{ans}(x) & \leftarrow \rel{Movie}(x, y, z), \neg \rel{not-ans}(x) \end{split} \]
\(x’\) can be all movies directed by Hitchcock, and \(z\) can be all actors.
Thus, \(\neg \rel{Movies}(x’, \text{“Hitchcock”}, z)\) will exclude \(z\) that are actors in all of Hitchcock’s movies.
Then, \(x\) can be movies that have actors not in all of Hitchcock’s movies.
So, the final \(ans\) actually answers the following query:
List those movies for which no actor is not in all of Hitchcock’s movies.
i.e.
List those movies for which all actors of the movie acted in all of Hitchcock’s movies.
Expressive Power
The relational algebras and the family of nr-\(\datalogNeg\) programs that have single relation output have equivalent expressive power.
The Relational Calculus
The flexibility brings a nontrivial cost:
If used without restriction, the calculus can easily express queries
whose answers are infinite.
The calculus presented in this chapter is sometimes called the domain calculus because the variables range over elements of the underlying domain of values.
Well-Formed Formulas, Revisited
We obtain the relational calculus by adding negation (\(\neg\)), disjunction (\(\lor\)), and universal quantification (\(\forall\)) to the conjunctive calculus with equality.
Both disjunction and universal quantification can be viewed as consequences of adding negation:
- \(\varphi \lor \psi \equiv \neg (\neg \varphi \land \neg \psi)\)
- \(\forall x \, \varphi \equiv \neg \exists x \, \neg \varphi\)
Formal Definition
For a given input schema \(\dbschema{R}\),
the base formulas include:
- atoms over \(\dbschema{R}\)
- equality (inequality) atoms of the form \(e = e’ (e \neq e’)\) for terms \(e, e’\)
the (well-formed) formulas include the base formulas and formulas of the form:
- \(\varphi \land \psi\)
- where \(\varphi\) and \(\psi\) are formulas over \(\dbschema{R}\)
- \(\varphi \lor \psi\)
- where \(\varphi\) and \(\psi\) are formulas over \(\dbschema{R}\)
- \(\neg \varphi\)
- where \(\varphi\) is a formula over \(\dbschema{R}\)
- \(\exists x \, \varphi\)
- where \(x\) is a variable and \(\varphi\) is a formula over \(\dbschema{R}\)
- \(\forall x \, \varphi\)
- where \(x\) is a variable and \(\varphi\) is a formula over \(\dbschema{R}\)
\(e \neq e’\) is viewed as an abbreviation of \(\neg (e = e’)\) in some contexts.
Additional Logical Connectives
- implies (\(\rightarrow\))
- \(\varphi \rightarrow \psi \equiv \neg \varphi \lor \psi\)
- is equivalent to (\(\leftrightarrow\))
- \(\varphi \leftrightarrow \psi \equiv (\varphi \land \psi) \lor (\neg \varphi \land \neg \psi)\)
G1 Pain Points
About Alcohol
- G1/G2: You must not drive if you have been drinking alcohol. Your blood-alcohol level must be zero.
- All drivers who are 21 and under, regardless of license class, must have a BAC level of zero when operating a motor vehicle.
- Blood-alcohol level <0.05% is ok
- 0.05%-0.08% -> warn range -> not ok for warn-range suspension, not ok for accompanying driver.
- >0.08% is not ok
About Demerit Points
- New drivers: 9+ -> suspended for 60d
- Fully licensed drivers: 15 -> suspended for 30d
- 7: Failing to remain at the scene of a collision, Failing to stop for police
- 6: careless, racing, exceeding the speed limit by 40+km/h if limit <=80, or 50+km/h, failing to stop for a school bus.
- 5: only bus drivers
- 4: exceeding the speed limit by 30-49 km/h, following too closely, failing to stop at a pedestrian crossover.
- 3: exceeding the speed limit by 16-29 km/h, stop/yield sign, failing to report collision, wrong way (closed road/one-way road), emergency vehicles, HOV, opening a door, radar detector, crowded seat.
- 2: seatbelt, improper turn, lower headlight beam, sharing the road, signal, slow, towing, reversing on a highway
About Distance
150m
- Headlights must shine a white light that can be seen at least 150 meters in front.
- Red rear lights should be seen 150 meters away.
- When you use high-beam headlights, remember to switch to low beams within 150 meters of an oncoming vehicle.
- Never make a U-turn unless you can see at least 150 meters in both directions.
- It is illegal to follow within 150 meters of a fire vehicle responding to an alarm.
- Snow-removal vehicles on public roadways are equipped with flashing blue lights that can be seen from 150 meters.
- Roadside stop: A 150-meter gap in both directions provides enough space to perform the move safely.
110m
- Headlights must shine a white light that is strong enough to light up objects 110 meters away.
100m
- Do not park on or within 100 meters of a bridge.
60m
- Use your low beams when you are less than 60 meters behind another vehicle unless you are passing it.
30m
- Passing within 30 meters of a pedestrian crossover is not permitted.
- Passing left of a centerline is not permitted 30 meters from a bridge, viaduct or tunnel.
28m
- In fact, a recent study found that drivers who were texting or changing music on their phones traveled 28 meters further (nearly half a hockey rink) before responding to a hazard than drivers who were paying attention.
20m
- If you are coming from behind the bus, stop at least 20 meters away.
15m
- Do not park within 15 meters of the nearest rail of a level railway crossing.
- Do not park within 15 meters if it is controlled by traffic lights.
9m
- Do not park within 9 meters of an intersection.
6m
- Do not park within six meters of a public entrance to a hotel, theatre or public hall when it is open to the public.
5m
- If a train is coming, stop at least 5 meters from the nearest rail or gate.
3m
- Do not park within 3 meters of a fire hydrant.
2m
- At streetcar stops, stay at least 2 meters behind the rear doors where passengers are getting off or on.
1m
- Bicycles and mopeds traveling at a lower speed than other traffic are expected to ride about 1 meter from the curb or parked cars.
- When passing a cyclist, drivers of motor vehicles must maintain a minimum distance of 1 meter, where practical between their vehicle and the cyclist.
- Drive alongside, or parallel to, the vehicle ahead of the empty space, leaving about 1 meter between the vehicles.
60cm
- Roadside stop: Leave at least 60 centimeters between your vehicle and the parked vehicle.
30cm
- Roadside stop: You should not be more than about 30 centimeters away from the curb or edge of the road.
Other Numbers
10 times
- Texting or browsing on your phone takes your eyes off the road and increases your risk of crashing by 10 times.