Hongxu Xu
PhD Student (May 2025 – Present)
Supervised by Prof. Chengnian Sun
Cheriton School of Computer Science
University of Waterloo, Canada
B.Sc. in Computer Science (Sep 2014 – Jun 2018)
Beijing Normal University, China
Research Interests
- Software Engineering
- Programming Languages
without a focus yet…
Socials
- GitHub: https://github.com/xuhongxu96
- LinkedIn: https://www.linkedin.com/in/xuhongxu/
- Email: h4️⃣4️⃣5️⃣xu 🌀 uwaterloo.ca
If you’d like to hear more about my thoughts and life, consider following me on:
-
Substack: https://hongxuxu.substack.com/ (English only)
Subscribe to my newsletter on Substack to get my latest updates. -
微信公众号: xuhongxu_it (Chinese only)
Scan the QR code on the right to follow my WeChat Official Account.
扫描右侧二维码 关注我的微信公众号。
Misc.
- Chinese Name: 许宏旭
- Literally means “宏: Great 旭: Rising Sun”
- Pronunciation:
Hong-shyoo(IPA:[xʊŋ ɕy])
Publications
Papers
Nothing yet…
Books
Feb 2024CMake构建实战:项目开发卷 (CMake Build Practice: Project Development Volume)
Published by 人民邮电出版社 (Posts & Telecom Press, China)
Produced by 异步图书 (epubit)
Experience
PhD Student (May 2025 – Present)
Supervised by Prof. Chengnian Sun
Cheriton School of Computer Science
University of Waterloo, Canada
Senior SDE (Sep 2022 – Mar 2025)
Seed/Data Speech Team
ByteDance, Shanghai, China
Led the development of the Text-to-Speech engine and contributed to the Doubao AI assistant application.
SDE-2 (Aug 2021 – Sep 2022)
MSAI Team
Microsoft STC-Asia, Suzhou, China
Led the development of Microsoft WordBreaker and initiated a modern NLP toolkit for Office 365.
SDE (Jul 2018 – Aug 2021)
MSAI Team
Microsoft STC-Asia
Beijing, China (Relocated to Suzhou, Jiangsu, China in May 2019)
Worked on 20-year-old Microsoft WordBreaker.
Short-Term Contributor (Nov 2020 – Jan 2021)
Windows APS Team (temporary assignment)
Microsoft STC-Asia, Suzhou, China
Contributed to the formation of the new team, and Windows 11 application development.
B.Sc. in Computer Science (Sep 2014 – Jun 2018)
Beijing Normal University, China
SDE Intern (Jul 2017 – Dec 2017)
Bing Search Relevance Team
Microsoft STC-Asia, Beijing, China
Answer triggering model development for Bing Search.
Awards
- Outstanding Graduate, Beijing Normal University, 2018
- Top Ten Volunteer, Beijing Normal University, 2015
- First Prize, National Olympiad in Informatics in Provinces (NOIP), 2013
Teaching
Note
IA: Instructional ApprenticeTA: Teaching Assistant
IA, CS 136L - Tools and Techniques for Software Development, Winter 2026IA, CS 246 - Object-Oriented Software Development, Fall 2025TA, CS 246 - Object-Oriented Software Development, Spring 2025
My Thoughts and Life
Note
I share technical content on this website, including course notes, projects, and research.
For more personal thoughts and life updates, please check out my Substack newsletter and WeChat Official Account below.
Tip
Follow me on Substack and WeChat!
![]()
Substack: https://hongxuxu.substack.com/ (English only)
Subscribe to my newsletter on Substack to get my latest updates.微信公众号: xuhongxu_it (Chinese only)
Scan the QR code on the right to follow my WeChat Official Account.
扫描右侧二维码 关注我的微信公众号。
Selected Substack Posts
My 2025
G1 Pain Points
CS 452/652: Real-Time Programming (the Train Course)
Spring 25
I dropped this course after I learned that it is really time-consuming.
However, before I dropped it, I did some research and found some useful resources, which I summarized in MarklinSim All-in-One repo.
In short, follow the readme in the repo to set up:
- MarklinSim (an emulator for the train kit)
- QEMU (an emulator for the Raspberry Pi 4B)
- Example Image (your real-time system).
I believe it will save you a lot of time. Good luck!
CS 480/680: Introduction to Machine Learning
Spring 25
This is not a hard course.
CS 848: Advanced Topics in Database: Algorithmic Aspects of Database Query Processing
Fall 25
https://cs.uwaterloo.ca/~xiaohu/courses/CS848/CS848-F25.html
My Notes
My course on database theory, specifically regarding Query Processing algorithms, is coming to an end soon. It has been very rewarding, so I’m writing this down to record what I’ve learned.
I originally chose this course as a challenge because I couldn’t understand a single paper mentioned in the syllabus—I didn’t even know what problems they were trying to solve. After a semester of struggling to grasp just the surface, I discovered that this is actually quite an interesting—albeit still very difficult—field.
I am writing this note to document the relatively simple and fundamental parts. I also want to organize the logical thread—what is the problem, how is it solved, what new problems arise from that solution, and how are those solved? I hope this will help me pick it up quickly if I need it in the future, and also serve as a simple introduction for readers.
Note
These notes were originally written in Chinese, and translated by Gemini.
Therefore, some expressions may seem a bit unnatural in English. I will gradually refine them over time.
From Pairwise Join to Global Perspective
Motivating Example: Different Orders of Pairwise JOINs
Table 1: Student_Teacher
| Student | Teacher |
|---|---|
| S0 | T0 |
| S0 | T1 |
| S1 | T2 |
| S2 | T2 |
Table 2: Teacher_Lab
| Teacher | Lab |
|---|---|
| T0 | L0 |
| T0 | L1 |
| T1 | L0 |
| T1 | L1 |
| T2 | L2 |
| T3 | L3 |
Table 3: Lab_Device
| Lab | Device |
|---|---|
| L2 | GPU0 |
| L3 | GPU0 |
| L3 | GPU1 |
Assume we have the 3 tables above in a database. We allow the same student to have multiple teachers, and the same teacher to guide multiple students, etc. Now, we want to see which devices each student can access via the laboratories their teachers belong to.
Any reader who has used SQL should know how to query this (the SQL below is simplified for demonstration and is not standard syntax):
SELECT * FROM Student_Teacher
INNER JOIN Teacher_Lab ON Teacher
INNER JOIN Lab_Device ON Lab
So, how do we obtain the execution results of this SQL statement in the database?
Since this involves three data tables, it is easy to see that we can perform Pairwise Joins. There are several options for the order:
Order 1 (Join Table 1 and Table 2 first)
Intermediate_Result = SELECT * FROM Student_Teacher INNER JOIN Teacher_Lab ON Teacher
Final_Result = SELECT * FROM Intermediate_Result INNER JOIN Lab_Device ON Lab
Order 2 (Join Table 2 and Table 3 first)
Intermediate_Result = SELECT * FROM Teacher_Lab INNER JOIN Lab_Device ON Lab
Final_Result = SELECT * FROM Student_Teacher INNER JOIN Intermediate_Result ON Teacher
Order 3 (Join Table 1 and Table 3 first)
Intermediate_Result = SELECT * FROM Student_Teacher CROSS JOIN Lab_Device
Final_Result = SELECT * FROM Intermediate_Result INNER JOIN Teacher_Lab ON Teacher, Lab
graph BT
subgraph "Order 3"
direction BT
T3_A[("Student_Teacher")]:::table
T3_C[("Lab_Device")]:::table
T3_B[("Teacher_Lab")]:::table
J3_1{{"CROSS JOIN (Cartesian)"}}:::op
Res3["_Final Result_<br>JOIN ON Teacher & Lab"]:::result
%% Cross join first
T3_A --> J3_1
T3_C --> J3_1
%% Then join with the middle table
J3_1 --> Res3
T3_B --> Res3
end
subgraph "Order 2"
direction BT
T2_A[("Student_Teacher")]:::table
T2_B[("Teacher_Lab")]:::table
T2_C[("Lab_Device")]:::table
J2_1{{"JOIN ON Lab"}}:::op
Res2["_Final Result_<br>JOIN ON Teacher"]:::result
T2_B --> J2_1
T2_C --> J2_1
T2_A --> Res2
J2_1 --> Res2
end
subgraph "Order 1"
direction BT
T1_A[("Student_Teacher")]:::table
T1_B[("Teacher_Lab")]:::table
T1_C[("Lab_Device")]:::table
J1_1{{"JOIN ON Teacher"}}:::op
Res1["_Final Result_<br>JOIN ON Lab"]:::result
T1_A --> J1_1
T1_B --> J1_1
J1_1 --> Res1
T1_C --> Res1
%% Labeling the final join edge for clarity
linkStyle 2 stroke-width:2px,fill:none,stroke:green;
linkStyle 3 stroke-width:2px,fill:none,stroke:green;
end
Which one is better? Order 3 involves a CROSS JOIN (Cartesian product), so it can basically be ruled out immediately.
For any execution order, the input data (database/tables) is the same, and the output final result must also be the same. The only place where differences can occur is in the intermediate results. Let’s simulate the execution process.
The intermediate result for Order 1 is as follows, with a total of 6 rows:
| Student | Teacher | Lab |
|---|---|---|
| S0 | T0 | L0 |
| S0 | T1 | L1 |
| S0 | T0 | L0 |
| S0 | T1 | L1 |
| S1 | T2 | L2 |
| S2 | T2 | L2 |
The intermediate result for Order 2 is as follows, with a total of 3 rows:
| Teacher | Lab | Device |
|---|---|---|
| T2 | L2 | GPU0 |
| T3 | L3 | GPU0 |
| T3 | L3 | GPU1 |
Regardless of the order, the final result is as follows, with a total of 2 rows:
| Student | Teacher | Lab | Device |
|---|---|---|---|
| S1 | T2 | L2 | GPU0 |
| S2 | T2 | L2 | GPU0 |
It is evident that Order 2 is superior because it outputs fewer intermediate results. These intermediate results not only take up memory but also participate in subsequent JOIN calculations, so obviously, the fewer the better.
This is exactly what modern Database Management Systems (DBMS) usually do: use Pairwise JOINs while selecting a potentially optimal JOIN order based on statistics and rules.
A More Intuitive View
We can represent this database as the path graph below, where each edge represents a row of data in a table:
graph LR
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
end
S0 --- T0
S0 --- T1
S1 --- T2
S2 --- T2
T0 --- L0
T0 --- L1
T1 --- L0
T1 --- L1
T2 --- L2
T3 --- L3
L2 --- GPU0
L3 --- GPU0
L3 --- GPU1
%% Apply styling
class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;
Consider Order 1, which looks for paths that can traverse all fields completely from Left to Right:
graph LR
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
classDef thick-line stroke-width:10px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
end
S0 === T0
S0 === T1
S1 === T2
S2 === T2
T0 === L0
T0 === L1
T1 === L0
T1 === L1
T2 === L2
T3 --- L3
L2 === GPU0
L3 --- GPU0
L3 --- GPU1
%% Apply styling
class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;
The intermediate result of Order 1 contains three fields: “Student”, “Teacher”, and “Lab”. There are a total of 6 paths that can go from “Student” to “Lab” (Left to Right), which is exactly the number of rows in the intermediate result.
Similarly, consider Order 2, looking for paths from Right to Left:
graph RL
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
classDef thick-line stroke-width:10px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
end
T0 --- S0
T1 --- S0
T2 === S1
T2 === S2
L0 --- T0
L1 --- T0
L0 --- T1
L1 --- T1
L2 === T2
L3 === T3
GPU0 === L2
GPU0 === L3
GPU1 === L3
%% Apply styling
class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;
The number of rows in the intermediate result for Order 2 is also equal to the number of paths from the “Device” field to the “Teacher” field.
In fact, whether choosing Left-to-Right or Right-to-Left, both orders produce some waste in the intermediate results:
- When going Left-to-Right, reaching L0 and L1 is a dead end (cannot reach a Device).
- When going Right-to-Left, reaching T3 is a dead end (cannot reach a Student).
Taking it to the Extreme
graph LR
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
classDef thick-line stroke-width:4px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
S3(( S3 ))
S4(( S4 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
GPU2(( GPU2 ))
GPU3(( GPU3 ))
GPU4(( GPU4 ))
end
S0 --- T0
S1 --- T0
S2 --- T0
S3 --- T0
T0 --- L0
T0 --- L1
T1 --- L2
T2 --- L2
L2 --- GPU0
L2 --- GPU1
L2 --- GPU2
L2 --- GPU3
S4 --- T3 --- L3 --- GPU4
%% Apply styling
class S0,S1,S2,S3,S4,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1,GPU2,GPU3,GPU4 dot;
The example shown above is very extreme: almost all students (S0-S3) chose the same teacher (T0), and almost all devices (GPU0-GPU3) are concentrated in the same lab (L2). However, the most popular teacher (T0) belongs to labs (L0/L1) that don’t have a single machine…
Assuming this is reality, it is easy to see that after performing the same JOIN query on this database instance, the result should contain only one row:
| Student | Teacher | Lab | Device |
|---|---|---|---|
| S4 | T3 | L3 | GPU4 |
However, if we adopt the “Pairwise JOIN” scheme, regardless of which JOIN order is chosen, the intermediate results will have 9 rows (interactions between T0’s students and L2’s devices or similar paths). When the data volume is larger and similar Data Skew exists, the waste of computing resources can be immense.
What should we do?
Global Perspective
Let’s continue to observe this extreme example. What causes the waste in intermediate results are those dead-end paths. So, is it possible to filter out these dead ends before generating intermediate results?
Yes, but the Pairwise JOIN method obviously won’t work; we need a Global Perspective.
The cause of data expansion is that data rows joining on the same field value in two tables create a Cartesian product. Therefore, we need to avoid performing a JOIN (or strictly speaking, avoid performing an INNER JOIN) before completing the filtering.
Why speak strictly? Because we actually still have to JOIN, but it is a special kind of JOIN: SEMI JOIN. It is perfectly suited for this filtering task.
First Pass: Left to Right
To filter out rows in the right table that cannot JOIN.
Teacher_Lab_1 = SELECT * FROM Teacher_Lab WHERE EXISTS(
SELECT * FROM Student_Teacher WHERE (Student_Teacher.Teacher = Teacher_Lab.Teacher)
)
Lab_Device_1 = SELECT * FROM Lab_Device WHERE EXISTS(
SELECT * FROM Teacher_Lab_1 WHERE (Teacher_Lab_1.Lab = Lab_Device.Lab)
)
graph LR
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
classDef dotrm fill:#aaa,stroke:#666,stroke-width:2px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
S3(( S3 ))
S4(( S4 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
GPU2(( GPU2 ))
GPU3(( GPU3 ))
GPU4(( GPU4 ))
end
S0 --- T0
S1 --- T0
S2 --- T0
S3 --- T0
T0 --- L0
T0 --- L1
T1 ~~~ L2
T2 ~~~ L2
L2 ~~~ GPU0
L2 ~~~ GPU1
L2 ~~~ GPU2
L2 ~~~ GPU3
S4 --- T3 --- L3 --- GPU4
%% Apply styling
class S0,S1,S2,S3,S4,T0,T3,L0,L1,L3,GPU4 dot;
class T1,T2,L2,GPU0,GPU1,GPU2,GPU3 dotrm;
We guarantee that the filtered data in the right-side table can definitely JOIN with all data tables to its left (including non-adjacent ones). Therefore, after completing the first pass of filtering, the data remaining in the right-most table is guaranteed to participate in the final result.
Second Pass: Right to Left
To filter out rows in the left table that cannot JOIN.
Teacher_Lab_2 = SELECT * FROM Teacher_Lab_1 WHERE EXISTS(
SELECT * FROM Lab_Device_1 WHERE (Teacher_Lab_1.Lab = Lab_Device_1.Lab)
)
Student_Teacher_1 = SELECT * FROM Student_Teacher WHERE EXISTS(
SELECT * FROM Teacher_Lab_2 WHERE (Student_Teacher.Teacher = Teacher_Lab_2.Teacher)
)
graph LR
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
classDef dotrm fill:#aaa,stroke:#666,stroke-width:2px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
S3(( S3 ))
S4(( S4 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
GPU2(( GPU2 ))
GPU3(( GPU3 ))
GPU4(( GPU4 ))
end
S0 ~~~ T0
S1 ~~~ T0
S2 ~~~ T0
S3 ~~~ T0
T0 ~~~ L0
T0 ~~~ L1
T1 ~~~ L2
T2 ~~~ L2
L2 ~~~ GPU0
L2 ~~~ GPU1
L2 ~~~ GPU2
L2 ~~~ GPU3
S4 --- T3 --- L3 --- GPU4
%% Apply styling
class S4,T3,L3,GPU4 dot;
class S0,S1,S2,S3,T0,L0,L1 dotrm;
class T1,T2,L2,GPU0,GPU1,GPU2,GPU3 dotrm;
We guarantee that the filtered data in the left-side table can definitely JOIN with all data tables to its right (including non-adjacent ones). Since the first pass guaranteed the right-most data participates in the final result, the second pass ensures that the remaining data in ALL tables will participate in the final result.
Third Pass: Left to Right
To complete the JOIN.
At this point, all paths that do not participate in the final JOIN result have been cleared away. We can now safely perform Pairwise JOINs.
Summary: Yannakakis Algorithm (Simplified)
What is introduced here is the Yannakakis Algorithm specialized for linear JOIN queries. Compared to Pairwise JOIN, its advantage lies in being sensitive to the result size.
The complexity of Pairwise JOIN is independent of the JOIN result size. Therefore, even if the final result is actually very small, it may produce a massive amount of intermediate results. For the database instance mentioned above, the time complexity of Pairwise JOIN is , whereas the time complexity of the Yannakakis algorithm shown here is , where is the data size and is the size of the JOIN result.
The worst-case time complexity for both is consistent. Assuming there are data tables with no common attributes, equivalent to doing a full Cartesian product: Pairwise JOIN time complexity is , and Yannakakis time complexity is , where .
JOIN algorithms with complexity are also known as Output-Optimal algorithms. Regardless of the JOIN algorithm, one must read all the data () and output all the results (), so is already optimal for a specific output size.
Graphical Representation of Joins
In the previous note, we discussed the specialized version of the Yannakakis algorithm for linear JOIN queries. However, queries in practice go far beyond just linear structures. So, how do we generalize the Yannakakis algorithm to various types of JOIN queries?
Graphical Representation
Recap: Path Graph of a Database Instance
Previously, we showed the path graph of a database instance. This represents the specific data content of a database instance (where every edge represents a row of data in a table):
graph LR
%% Define styling to match the handwritten dot style
classDef dot stroke-width:2px;
%% Layer definitions for alignment
subgraph Students
S0(( S0 ))
S1(( S1 ))
S2(( S2 ))
end
subgraph Teachers
T0(( T0 ))
T1(( T1 ))
T2(( T2 ))
T3(( T3 ))
end
subgraph Labs
L0(( L0 ))
L1(( L1 ))
L2(( L2 ))
L3(( L3 ))
end
subgraph Devices
GPU0(( GPU0 ))
GPU1(( GPU1 ))
end
S0 --- T0
S0 --- T1
S1 --- T2
S2 --- T2
T0 --- L0
T0 --- L1
T1 --- L0
T1 --- L1
T2 --- L2
T3 --- L3
L2 --- GPU0
L3 --- GPU0
L3 --- GPU1
%% Apply styling
class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;
But now, we want to represent the structure of the JOIN query itself graphically, without involving specific database instances—in other words, in a data-independent way.
Graphical Representation of Query Plans
Now, disregarding data content, the path graph above transforms into paths between data tables. Considering the Left-to-Right and Right-to-Left orders, we can draw the following two charts:
flowchart TD
subgraph Order 2
direction BT
A2["Student_Teacher (Student, Teacher)"]
B2["Teacher_Lab (Teacher, Lab)"]
C2["Lab_Device (Lab, Device)"]
C2 --> B2 --> A2
end
subgraph Order 1
direction BT
A["Student_Teacher (Student, Teacher)"]
B["Teacher_Lab (Teacher, Lab)"]
C["Lab_Device (Lab, Device)"]
A --> B --> C
end
In a path graph, paths can only be linked if adjacent tables share common attributes. We can mark these common attributes on the edges of the graph. Additionally, if we view this as an undirected graph, we don’t need to distinguish between the two orders.
flowchart BT
A["Student_Teacher (Student, Teacher)"]
B["Teacher_Lab (Teacher, Lab)"]
C["Lab_Device (Lab, Device)"]
A--Teacher---B
B--Lab---C
Now, let’s consider a slightly more complex example.
Graphical Representation of Star JOIN Queries
Star JOIN query structures are very common in practice, primarily existing in scenarios using foreign keys. Assume we have the following data tables:
- Orders (Order_ID, Warehouse_ID, Product_ID, Buyer_ID)
- Warehouses (Warehouse_ID, Warehouse_Address)
- Products (Product_ID, Product_Name)
- Buyers (Buyer_ID, Buyer_Name, Buyer_Address)
We want to retrieve all order details:
SELECT * FROM Orders
JOIN Warehouses ON Warehouse_ID
JOIN Products ON Product_ID
JOIN Buyers ON Buyer_ID
The graphical representation of this query is as follows:
flowchart TB
B[Warehouses]
A[Orders]
C[Products]
D[Buyers]
A --Warehouse_ID --- B
A --Product_ID --- C
D --Buyer_ID --- A
At first glance, this graph might not seem “star-shaped”; it looks more like a tree. However, if the Orders table had more foreign keys, I believe the reader would understand why this category of queries is called a Star Query.
Additionally, calling it a tree is also correct; a star structure is essentially a tree. In fact, the graphical representation of a JOIN query is called a Join Tree.
Hypergraphs and Join Trees
We have gained a superficial understanding of Join Trees, but we have not yet provided a formal definition because we are missing a tool—the Hypergraph. Hypergraphs can also be used to visually display JOIN queries; the difference is that Hypergraphs merely represent the structure of the JOIN query, while Join Trees imply information about the query execution plan.
Hypergraph
A Hypergraph consists of a finite number of vertices and a set of hyperedges, where represents the finite set of vertices and represents the set of hyperedges.
Unlike edges in a standard Graph which can only connect two endpoints, a Hyperedge can connect multiple vertices simultaneously, represented as a subset of vertices in the hypergraph, i.e., .
Hypergraphs map very naturally to JOIN queries:
- Each vertex represents a column (attribute) in the data tables.
- Each hyperedge contains multiple vertices, thus representing a data table.
Using the Star Query mentioned earlier as an example:

Note
In this hypergraph, a “circle” is actually a hyperedge, corresponding to a data table (the table name is connected by a dashed line); each word inside the circle is a vertex, corresponding to a column.
It took me some time to get used to the graphical representation of hypergraphs—after all, the “edges” in the graphs we usually encounter are represented by thin lines, unlike these “hyperedges” which actually have area.
Join Tree
The formal definition of a Join Tree corresponds to that of a hypergraph.
Given a hypergraph , its corresponding Join Tree is a tree that satisfies the following two conditions:
- The nodes of tree correspond one-to-one with the hyperedges () of hypergraph .
- For any hypergraph vertex , the set of tree nodes in corresponding to the hyperedges containing forms a connected subgraph. (This is often called the Running Intersection Property).
Let’s stick with the previous example. The hypergraph has 4 hyperedges, corresponding to 4 data tables: Buyers, Orders, Warehouses, Products.
Its corresponding Join Tree is as follows. We can see that the Join Tree has 4 nodes, corresponding one-to-one with the 4 hyperedges, thus satisfying the first condition.
flowchart TB
B["Warehouses(Warehouse_ID, Address)"]
A["Orders(Order_ID, Warehouse_ID, Product_ID, Buyer_ID)"]
C["Products(Product_ID, Name)"]
D["Buyers(Buyer_ID, Name, Address)"]
A --Warehouse_ID --- B
A --Product_ID --- C
D --Buyer_ID --- A
The phrasing of the second condition is slightly more complex, but understanding it is quite easy:
- “Hyperedges containing a common vertex ” simply means data tables that share a common field (e.g., “Buyers” and “Orders” share the “Buyer_ID” field).
- “Corresponding nodes in tree ” refers to finding the tree nodes corresponding to these data tables in the Join Tree (since condition 1 requires a one-to-one mapping).
- “Forms a connected subgraph”: This is a basic graph theory concept. For a tree (an undirected graph), if there is a path between two points, they are connected (the path doesn’t have to be a direct edge; it can pass through multiple edges).
In the Join Tree above, we can see that the two nodes “Buyers” and “Orders”, which share the common field “Buyer_ID”, are indeed connected. Iterating through every field can prove that this Join Tree satisfies the second condition.
Yannakakis Algorithm (Full Version)
With the Join Tree, we can now complete the Yannakakis algorithm introduced earlier on this tree. Unlike before, where we only performed JOINs on a linear sequence of tables, we can now perform JOINs on a tree, thereby supporting more types of queries.
Let’s Recap
The simplified algorithm flow mentioned in the previous note was:
- First Pass: Left to Right, filter out rows in the right-side table that cannot JOIN.
- After the first pass, data remaining in the right-most table is guaranteed to participate in the final result.
- Second Pass: Right to Left, filter out rows in the left-side table that cannot JOIN.
- The second pass guarantees that the remaining data in ALL tables will participate in the final result.
- Third Pass: Left to Right, complete the JOIN.
Generalizing to Join Trees
The characteristics of the two-pass filtering can easily be generalized to a tree structure.
- First Pass: Bottom-Up (from leaf nodes up to the root), filtering out rows in parent nodes that cannot JOIN with any child node.
- After the first pass, the data remaining in the root node is guaranteed to participate in the final result.
- Second Pass: Top-Down, filtering out rows in child nodes that cannot JOIN with the parent node.
- The second pass guarantees that the remaining data in ALL tables will participate in the final result.
- Third Pass: Bottom-Up, complete the JOIN.
Pseudo-code
Input: Join Tree , Set of Relations , Root Node . Output: Result of the JOIN .
- Preprocessing: Determine Parent-Child Relationships
For each node in :
the parent node of node in tree ; - Bottom-Up Phase
Visit node of in Bottom-Up order (excluding root ):
// Filter tuples in parent relation that do not match using child relation - Top-Down Phase
Visit node of in Top-Down order (excluding root ):
// Filter tuples in child relation that do not match using parent relation - Compute and Return JOIN Result
At this point, all relations have reached “Global Consistency” (every tuple is guaranteed to appear in the final JOIN result):
Return the natural join of all relations:
New Problem: Does Every Query Correspond to a Join Tree?
Suppose we have three data tables: .
Consider the following query:
SELECT * FROM R, S, T WHERE R.a = T.a AND R.b = S.b AND S.c = T.c
Can you find the Join Tree that corresponds to it?
Notes for Foundations of Databases
Foundations of Databases (The Alice Book)
The Alice Book is helpful for understanding the basic notations in database theory. I recommend to read the first two chapters to get familiar with the notations used in this course.
That said, the book is not required. In fact, paying real attention to the first several lectures should be sufficient to survive the course, and it turns out to be interesting.
In this section, I summarize some key concepts from the first two chapters of the Alice Book.
The Relational Model
Informal Terminology
- Relation: table
- Relation Name: name of the table
- Attribute: column name
- Tuple/Record: line in the table
- Domain: set of constants that entries of tuples can take
- Database Schema: specifying the structure of the database
- Database Instance: specifying the actual content in the database
Formal Definitions
- : a countably infinite set of attributes
- assume is fixed
- : a total order on
- the elements of a set of attributes are written according to unless otherwise specified
- : a countably infinite set of domains (disjoint from )
- assume is fixed
- a constant is an element of
- for most cases, the same domain of values is used for all attributes;
- otherwise, assume a mapping on , where is a set called the domain of attribute
- : a countably infinite set of relation names (disjoint from and )
- : a function from to (the finitary powerset of )
- is inifinite for each (possibly empty) finite set of attributes
- which allows multiple relations to have the same set of attributes
- is called the of a relation name
- : the of relation name
- is inifinite for each (possibly empty) finite set of attributes
- : a relation name or a relation schema
- Alternative notations:
- : indicating
- : indicating
- Alternative notations:
- : a database schema
- a nonempty finite set of relation names
- might be written as
Examples
Table Movies
| Title | Director | Actor |
|---|---|---|
| A | X | Y |
Table Location
| Theater | Address | Phone |
|---|---|---|
| T1 | Addr1 | P1 |
Table Pariscope
| Theater | Title | Schedule |
|---|---|---|
| T1 | A | 9:00 |
- The database schema
- The sorts of the relation names
Named v.s. Unnamed Attributes/Tuples
Named Perspective
E.g.
a tuple over a finite set of attributes (or over a relation schema ) is a total mapping (viewed as a function) from to .
- : is a tuple over
- : the value of on an attribute in ()
- : the restriction of to a subset , i.e., denotes the tuple over such that for all
Unnamed Perspective
E.g.
a tuple is an ordered n-tuple () of constants (i.e., an element of the Cartesian product )
- : the -th coordinate of
Correspondence
Because of the total order , a tuple (defined as a function) can be viewed as an ordered tuple with as a first component and as a second component. Ignoring the names, this tuple can be viewed as the ordered tuple .
Conversely, the ordered tuple can be viewed as a function over the set of integers with for each .
Conventional v.s. Logic Programming Relational Model
Conventional Perspective
a relation (instance) over a relation schema is a finite set of tuples with sort .
a database instance of database schema is a mapping with domain , such that is a relation over for each .
Logic Programming Perspective
This perspective is used primarily with the ordered-tuple perspective on tuples.
A fact over is an expression of the form ,
where for .
If , we sometimes write for .
a relation (instance) over a relation schema is a finite set of facts over .
a database instance of database schema is a finite set that is the union of relation instances over for all .
Examples
Assume .
Named and Conventional
Unnamed and Conventional
Named and Logic Programming
Unnamed and Logic Programming
Variables
an infinite set of variables will be used to range over elements of .
- a free tuple over or is a function from to
- an atom over is an expression of the form , where and each term
- a ground atom is an atom with no variables, i.e., a fact (see Logic Programming Perspective)
Notations
| Object | Notation |
|---|---|
| Constants | |
| Variables | |
| Sets of Variables | |
| Terms | |
| Attributes | |
| Sets of Attributes | |
| Relation Names | |
| Database Schemas | |
| Tuples | |
| Free Tuples | |
| Facts | |
| Atoms | |
| Relation Instances | |
| Database Instances |
Conjunctive Queries
a query (mapping) is from (or over) its input schema and to its output schema.
- query: a syntactic object
- query mapping: a function defined by a query interpreted under a specified semantics
- its domain: the family of all instances of an input schema
- its range: the family of instances of an output schema
- we often blur query and query mapping when the meaning is clear from context
- input schema: a specified relation or database schema
- output schema: a relation schema or database schema
- For a relation schema, the relation name may be specified as part of the query syntax or by the context
- denotes two queries and over are equivalent, i.e., they have the same output schema and for each instance over .
Logic-Based Perspective
Three versions of conjunctive queries:
- Rule-Based Conjunctive Queries
- Tableau Queries
- Conjunctive Calculus
Rule-Based Conjunctive Queries
Definition
A rule-based conjunctive query (or often more simply called rules) over a relation schema is an expression of the form where , are relation names in ; is a relation name not in ; and are free tuples (i.e., may use either variables or constants).
- : the set of variables occurring in .
- body:
- head:
- range restricted: each variable occurring in the head also occurs in the body
- all conjunctive queries considered here are range restricted
- valuation: a valuation over , a finite subset of , is a total function from to (of constants)
- extended to be identity on (so that the domain can contain both variables and constants)
- extended to map free tuples to tuples (so that the domain/range can be in the tuple form)
Semantics
Let be the query, and let be a database instance of schema . The image of under is:
- active domain
- of a database instance , denoted , is the set of constants that occur in
- of a relation instance , denoted , is the set of constants that occur in
- of a query , denoted , is the set of constants that occur in
- is an abbreviation for
- extensional relations: relations in the body of the query, i.e.,
- because they are known/provided by the input instance
- intensional relation: the relation in the head of the query, i.e.,
- because it is not stored and its value is computed on request by the query
- extensional database (edb): a database instance associated with the extensional relations
- intensional database (idb): the rule itself
- idb relation: the relation defined by the idb
Properties
Conjunctive queries are:
- monotonic: a query over is monotonic if for each over ,
- satisfiable: a query is satisfiable if there exists a database instance such that
Tableau Queries
A tableau query is simply a pair where is a tableau and each variable in also occurs in .
This is closest to the visual form provided by Query-By-Example (QBE).
- summary: the free tuple representing the tuples included in the answer to the query
- embedding: a valuation for the variables occurring in such that
- the output of on consists of all tuples for each embedding of into
- typed: a tableau query under the named perspective, where is over relation schema and , is typed if no variable of is associated with two distinct attributes in
Examples
Table Movies
| Title | Director | Actor |
|---|---|---|
| “Bergman” |
Table Location
| Theater | Address | Phone |
|---|---|---|
Table Pariscope
| Theater | Title | Schedule |
|---|---|---|
The above tableau query is typed because each variable is associated with only one attribute:
- :
- :
- :
- :
- :
- :
However, the following tableau query is untyped:
Table Movies
| Title | Director | Actor |
|---|---|---|
Because is associated with both and .
Conjunctive Calculus
The conjunctive query
can be expressed as the following conjunctive calculus query that has the same semantics:
where are all the variables occurring in the body and not the head.
Conjunctive Calculus Formula
Let be a relation schema. A (well-formed) formula over for the conjunctive calculus is an expression having one of the following forms:
- an atom over ;
- , where and are formulas over ; or
- , where is a variable and is a formula over .
An occurrence of a variable in formula is free if:
- is an atom; or
- and the occurrence of is free in or ; or
- , , and the occurrence of is free in .
: the set of free variables in .
An occurrence of a variable that is not free is bound.
Conjunctive Calculus Query
A conjunctive calculus query over database schema is an expression of the form
where
- is a conjunctive calculus formula,
- is a free tuple, and
- the set of variables occurring in is exactly .
For named perspective, the above query can be written as:
Note: In the previous chapter, a named tuple was usually denoted as:
But here we denote the named tuple as:
Not sure which one is better or whether the author did this intentionally.
Semantics
Valuation
A valuation over is a total function from to , which can be viewed as a syntactic expression of the form:
where
- is a listing of ,
- for each .
Interpretation as a Set
If , and , then is the valuation over that agrees with on and maps to .
Satisfaction
Let be a database schema, a conjunctive calculus formula over , and a valuation over . Then , if
- is an atom and ; or
- and and ; or
- and there exists a constant such that .
Image
Let be a conjunctive calculus query over . For an instance over , the image of under is:
- active domain
- of a formula , denoted , is the set of constants that occur in
- is an abbreviation for
- If , then the range of is contained in
- to evaluate a conjunctive calculus query, one need only consider valuations with range constrained in , i.e., only a finite number of them
- equivalent: conjunctive calculus formulas and over are equivalent,
if
- they have the same free variables and,
- for each over and valuation over , if and only if
Normal Form
Each conjunctive calculus query is equivalent to a conjunctive calculus query in normal form.
Expressiveness of Query Languages
Let and be two query languages.
- : is dominated by (or is weaker than ), if for each query , there exists a query such that .
- : and are equivalent, if and .
The rule-based conjunctive queries, tableau queries, and conjunctive calculus queries are all equivalent.
Incorporating Equality
The conjunctive query
can be expressed as:
Problem 1: Infinite Answers
Unrestricted rules with equality may yield infinite answers:
A rule-based conjunctive query with equality is a range-restricted rule-based conjunctive query with equality.
Problem 2: Unsatisfiable Queries
Consider the following query:
where is a unary relation and with .
Each satisfiable rule with equality is equivalent to a rule without equality.
No expressive power is gained if the query is satisfiable.
Query Composition and Views
a conjunctive query program (with or without equality) is a sequence of rules having the form:
where
- each is a distinct relation name not in ,
- for each , the only relation names that may occur in are in .
Closure under Composition
If conjunctive query program defines final relation , then there is a conjunctive query , possibly with equality, such that on all input instances , .
If is satisfiable, then there is a without equality.
Composition and User Views
views are specified as queries (or query programs), which may be
- materialized: a physical copy of the view is stored and maintained
- virtual: relevant information about the view is computed as needed
- queries against the virtual view generate composed queries against the underlying database
Algebraic Perspectives
The Unnamed Perspective: The SPC Algebra
Unnamed conjunctive algebra:
- Selection ()
- Projection ()
- Cross-Product (or Cartesian Product, )
Selection (“Horizontal” Operator)
Primitive Forms: and
where are positive integers, and
(constants are usually surrounded by quotes).
is sometimes called atomic selection.
Projection (“Vertical” Operator)
General Form:
where and are positive integers
(empty sequence is written ).
Cross-Product (or Cartesian Product)
Formal Inductive Definition
Let be a relation schema.
There are two kinds of base SPC (algebra) queries:
- Input Relation: Expression ; with arity equal to
- Unary Singleton Constant: Expression , where ; with arity equal to 1
The family of SPC (algebra) queries contains all above base SPC queries and, for SPC queries with arities , respectively,
- Selection: and , whenever and ; these have arity .
- Projection: , where ; this has arity .
- Cross-Product: ; this has arity .
Unsatisifiable Queries
E.g. where and .
This is equivalent to .
Generalized SPC Algebra
- Intersection (): is easily simulated by selection and cross-product.
- Generalized Selection (): permits the specification of multiple conditions.
- Positive Conjunctive Selection Formula: where each is either or
- Positive Conjunctive Selection Operator: an expression of the form , where is a positive conjunctive selection formula
- can be simulated by a sequence of selections
- Equi-Join (): a binary operator that combines cross-product and selection
- Equi-Join Condition: where each is of the form
- Equi-Join Operator: an expression of the form , where is an equi-join condition
- can be simulated by , where is obtained from by replacing each condition with
Normal Form
where
- ;
- ;
- ;
- are relation names (repeats permitted);
- is a positive conjunctive selection formula.
For each (generalized) SPC query, there is an equivalent SPC query in normal form.
The proof is based on repeated application of the following equivalence-preserving SPC algebra rewrite rules (or transformations):
- Merge-Select:
- Merge-Project:
- Push-Select-Though-Project:
- Push-Select-Though-Singleton:
- Associate-Cross:
- Commute-Cross: where and
- Push-Cross-Though-Select: where is obtained from by replacing each with
- Push-Cross-Though-Project: where is obtained from by replacing each with
Notation for Rewriting
- for a set of rewrite rules and algebra expressions ,
- if is understood from the context,
- if is the result of replacing a subexpression of according to one of the rules in .
: the reflexive and transitive closure of .
A family of rewrite rules is sound if implies . If is sound, then implies .
The Named Perspective: The SPJR Algebra
Named conjunctive algebra:
- Selection ()
- Projection (); repeats not permitted
- Join (); natural join
- Renaming ()
Selection
where
These operators apply to any instance with .
Projection
where and
(Natural) Join
where and have sorts and , respectively.
- When , .
- When , .
Renaming
An attribute renaming for a finite set of attributes is a one-one mapping from to .
where
- input is an instance over
- is an attribute renaming for
- which can be described by , where
- usually written as to indicate that for each
- which can be described by , where
Formal Inductive Definition
The base SPJR algebra queries are:
- Input Relation: Expression ; with sort equal to
- Unary Singleton Constant: Expression , where ; with
The remainder of the syntax and semantics of the SPJR algebra is defined in analogy to those of the SPC algebra.
Normal Form
where
- ;
- ;
- occurs in ;
- the s are distinct;
- are relation names (repeats permitted);
- is a renaming operator for ;
- no occur in any ;
- the sorts of are pairwise disjoint;
- is a positive conjunctive selection formula.
For each SPJR query, there is an equivalent SPJR query in normal form.
Equivalence Theorem
- Lemma: The SPC and SPJR algebras are equivalent.
- Equivalence Theorem: The rule-based conjunctive queries, tableau queries, conjunctive calculus queries, satisifiable SPC algebra, and satisifiable SPJR algebra are all equivalent.
Adding Union
The following have equivalent expressive power:
- the nonrecursive datalog programs (with single relation target)
- the SPCU queries
- the SPJRU queries
- union can only be applied to expressions having the same sort
Nonrecursive Datalog Program
A nonrecursive datalog program (nr-datalog program) over schema is a set of rules:
where
- no relation name in occurs in a rule head
- the same relation name may appear in more than one rule head
- there is some ordering of the rules such that the relation name in the head of does not occur in the body of any rule with
Union and the Conjunctive Calculus
Simply permitting disjunction (denoted ) in the formula of a conjunctive calculus along with conjunction can have serious consequences.
E.g.,
The answer of on nonempty instance will be
Because decides that the first two components in are in and can be anything in , and decides that the last two components in are in and can be anything in .