Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hongxu Xu

Hongxu Xu

PhD Student (May 2025 – Present)
Supervised by Prof. Chengnian Sun
Cheriton School of Computer Science
University of Waterloo, Canada

B.Sc. in Computer Science (Sep 2014 – Jun 2018)
Beijing Normal University, China

Research Interests

  • Software Engineering
  • Programming Languages

without a focus yet…

Socials

If you’d like to hear more about my thoughts and life, consider following me on:

WeChat MP
  • Substack: https://hongxuxu.substack.com/ (English only)
    Subscribe to my newsletter on Substack to get my latest updates.

  • 微信公众号: xuhongxu_it (Chinese only)
    Scan the QR code on the right to follow my WeChat Official Account.
    扫描右侧二维码 关注我的微信公众号。

Misc.

  • Chinese Name: 许宏旭
    • Literally means “宏: Great 旭: Rising Sun”
    • Pronunciation: Hong-shyoo (IPA: [xʊŋ ɕy])

Publications

Papers

Nothing yet…

Books

Experience

PhD Student (May 2025 – Present)

Supervised by Prof. Chengnian Sun

Cheriton School of Computer Science
University of Waterloo, Canada

Senior SDE (Sep 2022 – Mar 2025)

Seed/Data Speech Team
ByteDance, Shanghai, China

Led the development of the Text-to-Speech engine and contributed to the Doubao AI assistant application.

SDE-2 (Aug 2021 – Sep 2022)

MSAI Team
Microsoft STC-Asia, Suzhou, China

Led the development of Microsoft WordBreaker and initiated a modern NLP toolkit for Office 365.

SDE (Jul 2018 – Aug 2021)

MSAI Team
Microsoft STC-Asia
Beijing, China (Relocated to Suzhou, Jiangsu, China in May 2019)

Worked on 20-year-old Microsoft WordBreaker.

Short-Term Contributor (Nov 2020 – Jan 2021)

Windows APS Team (temporary assignment)
Microsoft STC-Asia, Suzhou, China

Contributed to the formation of the new team, and Windows 11 application development.

B.Sc. in Computer Science (Sep 2014 – Jun 2018)

Beijing Normal University, China

SDE Intern (Jul 2017 – Dec 2017)

Bing Search Relevance Team
Microsoft STC-Asia, Beijing, China

Answer triggering model development for Bing Search.

Awards

  • Outstanding Graduate, Beijing Normal University, 2018
  • Top Ten Volunteer, Beijing Normal University, 2015
  • First Prize, National Olympiad in Informatics in Provinces (NOIP), 2013

Teaching

Note

IA : Instructional Apprentice
TA : Teaching Assistant

What’s the difference between TAs and IAs?

My Thoughts and Life

Note

I share technical content on this website, including course notes, projects, and research.

For more personal thoughts and life updates, please check out my Substack newsletter and WeChat Official Account below.

Tip

Follow me on Substack and WeChat!

WeChat MP
  • Substack: https://hongxuxu.substack.com/ (English only)
    Subscribe to my newsletter on Substack to get my latest updates.

  • 微信公众号: xuhongxu_it (Chinese only)
    Scan the QR code on the right to follow my WeChat Official Account.
    扫描右侧二维码 关注我的微信公众号。

Selected Substack Posts

My 2025

My 2025 by Hongxu Xu

Read on Substack

G1 Pain Points

G1 Pain Points by Hongxu Xu

Ontario Drive Test

Read on Substack

CS 452/652: Real-Time Programming (the Train Course)

Spring 25

I dropped this course after I learned that it is really time-consuming.

However, before I dropped it, I did some research and found some useful resources, which I summarized in MarklinSim All-in-One repo.

In short, follow the readme in the repo to set up:

  1. MarklinSim (an emulator for the train kit)
  2. QEMU (an emulator for the Raspberry Pi 4B)
  3. Example Image (your real-time system).

I believe it will save you a lot of time. Good luck!

CS 480/680: Introduction to Machine Learning

Spring 25

This is not a hard course.

CS 848: Advanced Topics in Database: Algorithmic Aspects of Database Query Processing

Fall 25

https://cs.uwaterloo.ca/~xiaohu/courses/CS848/CS848-F25.html

My Notes

My course on database theory, specifically regarding Query Processing algorithms, is coming to an end soon. It has been very rewarding, so I’m writing this down to record what I’ve learned.

I originally chose this course as a challenge because I couldn’t understand a single paper mentioned in the syllabus—I didn’t even know what problems they were trying to solve. After a semester of struggling to grasp just the surface, I discovered that this is actually quite an interesting—albeit still very difficult—field.

I am writing this note to document the relatively simple and fundamental parts. I also want to organize the logical thread—what is the problem, how is it solved, what new problems arise from that solution, and how are those solved? I hope this will help me pick it up quickly if I need it in the future, and also serve as a simple introduction for readers.

Note

These notes were originally written in Chinese, and translated by Gemini.
Therefore, some expressions may seem a bit unnatural in English. I will gradually refine them over time.

From Pairwise Join to Global Perspective

Motivating Example: Different Orders of Pairwise JOINs

Table 1: Student_Teacher

StudentTeacher
S0T0
S0T1
S1T2
S2T2

Table 2: Teacher_Lab

TeacherLab
T0L0
T0L1
T1L0
T1L1
T2L2
T3L3

Table 3: Lab_Device

LabDevice
L2GPU0
L3GPU0
L3GPU1

Assume we have the 3 tables above in a database. We allow the same student to have multiple teachers, and the same teacher to guide multiple students, etc. Now, we want to see which devices each student can access via the laboratories their teachers belong to.

Any reader who has used SQL should know how to query this (the SQL below is simplified for demonstration and is not standard syntax):

SELECT * FROM Student_Teacher
    INNER JOIN Teacher_Lab ON Teacher
    INNER JOIN Lab_Device ON Lab

So, how do we obtain the execution results of this SQL statement in the database?

Since this involves three data tables, it is easy to see that we can perform Pairwise Joins. There are several options for the order:

Order 1 (Join Table 1 and Table 2 first)

Intermediate_Result = SELECT * FROM Student_Teacher INNER JOIN Teacher_Lab ON Teacher
Final_Result = SELECT * FROM Intermediate_Result INNER JOIN Lab_Device ON Lab

Order 2 (Join Table 2 and Table 3 first)

Intermediate_Result = SELECT * FROM Teacher_Lab INNER JOIN Lab_Device ON Lab
Final_Result = SELECT * FROM Student_Teacher INNER JOIN Intermediate_Result ON Teacher

Order 3 (Join Table 1 and Table 3 first)

Intermediate_Result = SELECT * FROM Student_Teacher CROSS JOIN Lab_Device
Final_Result = SELECT * FROM Intermediate_Result INNER JOIN Teacher_Lab ON Teacher, Lab

graph BT
    subgraph "Order 3"
        direction BT
        T3_A[("Student_Teacher")]:::table
        T3_C[("Lab_Device")]:::table
        T3_B[("Teacher_Lab")]:::table

        J3_1{{"CROSS JOIN (Cartesian)"}}:::op
        Res3["_Final Result_<br>JOIN ON Teacher & Lab"]:::result

        %% Cross join first
        T3_A --> J3_1
        T3_C --> J3_1
        
        %% Then join with the middle table
        J3_1 --> Res3
        T3_B --> Res3
    end

    subgraph "Order 2"
        direction BT
        T2_A[("Student_Teacher")]:::table
        T2_B[("Teacher_Lab")]:::table
        T2_C[("Lab_Device")]:::table

        J2_1{{"JOIN ON Lab"}}:::op
        Res2["_Final Result_<br>JOIN ON Teacher"]:::result

        T2_B --> J2_1
        T2_C --> J2_1
        
        T2_A --> Res2
        J2_1 --> Res2
    end
    
    subgraph "Order 1"
        direction BT
        T1_A[("Student_Teacher")]:::table
        T1_B[("Teacher_Lab")]:::table
        T1_C[("Lab_Device")]:::table
        
        J1_1{{"JOIN ON Teacher"}}:::op
        Res1["_Final Result_<br>JOIN ON Lab"]:::result

        T1_A --> J1_1
        T1_B --> J1_1
        
        J1_1 --> Res1
        T1_C --> Res1
        
        %% Labeling the final join edge for clarity
        linkStyle 2 stroke-width:2px,fill:none,stroke:green;
        linkStyle 3 stroke-width:2px,fill:none,stroke:green;
    end

Which one is better? Order 3 involves a CROSS JOIN (Cartesian product), so it can basically be ruled out immediately.

For any execution order, the input data (database/tables) is the same, and the output final result must also be the same. The only place where differences can occur is in the intermediate results. Let’s simulate the execution process.

The intermediate result for Order 1 is as follows, with a total of 6 rows:

StudentTeacherLab
S0T0L0
S0T1L1
S0T0L0
S0T1L1
S1T2L2
S2T2L2

The intermediate result for Order 2 is as follows, with a total of 3 rows:

TeacherLabDevice
T2L2GPU0
T3L3GPU0
T3L3GPU1

Regardless of the order, the final result is as follows, with a total of 2 rows:

StudentTeacherLabDevice
S1T2L2GPU0
S2T2L2GPU0

It is evident that Order 2 is superior because it outputs fewer intermediate results. These intermediate results not only take up memory but also participate in subsequent JOIN calculations, so obviously, the fewer the better.

This is exactly what modern Database Management Systems (DBMS) usually do: use Pairwise JOINs while selecting a potentially optimal JOIN order based on statistics and rules.

A More Intuitive View

We can represent this database as the path graph below, where each edge represents a row of data in a table:

graph LR
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
    end

    S0 --- T0
    S0 --- T1
    S1 --- T2
    S2 --- T2
    
    T0 --- L0
    T0 --- L1
    T1 --- L0
    T1 --- L1
    T2 --- L2
    T3 --- L3
    
    L2 --- GPU0
    L3 --- GPU0
    L3 --- GPU1

    %% Apply styling
    class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;

Consider Order 1, which looks for paths that can traverse all fields completely from Left to Right:

graph LR
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;
    classDef thick-line stroke-width:10px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
    end

    S0 === T0
    S0 === T1
    S1 === T2
    S2 === T2
    
    T0 === L0
    T0 === L1
    T1 === L0
    T1 === L1
    T2 === L2
    T3 --- L3
    
    L2 === GPU0
    L3 --- GPU0
    L3 --- GPU1

    %% Apply styling
    class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;

The intermediate result of Order 1 contains three fields: “Student”, “Teacher”, and “Lab”. There are a total of 6 paths that can go from “Student” to “Lab” (Left to Right), which is exactly the number of rows in the intermediate result.

Similarly, consider Order 2, looking for paths from Right to Left:

graph RL
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;
    classDef thick-line stroke-width:10px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
    end
    
    T0 --- S0
    T1 --- S0
    T2 === S1
    T2 === S2

    L0 --- T0
    L1 --- T0
    L0 --- T1
    L1 --- T1
    L2 === T2
    L3 === T3
    
    GPU0 === L2
    GPU0 === L3
    GPU1 === L3

    %% Apply styling
    class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;

The number of rows in the intermediate result for Order 2 is also equal to the number of paths from the “Device” field to the “Teacher” field.

In fact, whether choosing Left-to-Right or Right-to-Left, both orders produce some waste in the intermediate results:

  • When going Left-to-Right, reaching L0 and L1 is a dead end (cannot reach a Device).
  • When going Right-to-Left, reaching T3 is a dead end (cannot reach a Student).

Taking it to the Extreme

graph LR
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;
    classDef thick-line stroke-width:4px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
        S3(( S3 ))
        S4(( S4 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
        GPU2(( GPU2 ))
        GPU3(( GPU3 ))
        GPU4(( GPU4 ))
    end

    S0 --- T0
    S1 --- T0
    S2 --- T0
    S3 --- T0

    T0 --- L0
    T0 --- L1
    
    T1 --- L2
    T2 --- L2
    
    L2 --- GPU0
    L2 --- GPU1
    L2 --- GPU2
    L2 --- GPU3
    
    S4 --- T3 --- L3 --- GPU4

    %% Apply styling
    class S0,S1,S2,S3,S4,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1,GPU2,GPU3,GPU4 dot;

The example shown above is very extreme: almost all students (S0-S3) chose the same teacher (T0), and almost all devices (GPU0-GPU3) are concentrated in the same lab (L2). However, the most popular teacher (T0) belongs to labs (L0/L1) that don’t have a single machine

Assuming this is reality, it is easy to see that after performing the same JOIN query on this database instance, the result should contain only one row:

StudentTeacherLabDevice
S4T3L3GPU4

However, if we adopt the “Pairwise JOIN” scheme, regardless of which JOIN order is chosen, the intermediate results will have 9 rows (interactions between T0’s students and L2’s devices or similar paths). When the data volume is larger and similar Data Skew exists, the waste of computing resources can be immense.

What should we do?

Global Perspective

Let’s continue to observe this extreme example. What causes the waste in intermediate results are those dead-end paths. So, is it possible to filter out these dead ends before generating intermediate results?

Yes, but the Pairwise JOIN method obviously won’t work; we need a Global Perspective.

The cause of data expansion is that data rows joining on the same field value in two tables create a Cartesian product. Therefore, we need to avoid performing a JOIN (or strictly speaking, avoid performing an INNER JOIN) before completing the filtering.

Why speak strictly? Because we actually still have to JOIN, but it is a special kind of JOIN: SEMI JOIN. It is perfectly suited for this filtering task.

First Pass: Left to Right

To filter out rows in the right table that cannot JOIN.

Teacher_Lab_1 = SELECT * FROM Teacher_Lab WHERE EXISTS(
    SELECT * FROM Student_Teacher WHERE (Student_Teacher.Teacher = Teacher_Lab.Teacher)
)

Lab_Device_1 = SELECT * FROM Lab_Device WHERE EXISTS(
    SELECT * FROM Teacher_Lab_1 WHERE (Teacher_Lab_1.Lab = Lab_Device.Lab)
)

graph LR
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;
    classDef dotrm fill:#aaa,stroke:#666,stroke-width:2px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
        S3(( S3 ))
        S4(( S4 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
        GPU2(( GPU2 ))
        GPU3(( GPU3 ))
        GPU4(( GPU4 ))
    end

    S0 --- T0
    S1 --- T0
    S2 --- T0
    S3 --- T0

    T0 --- L0
    T0 --- L1
    
    T1 ~~~ L2
    T2 ~~~ L2
    
    L2 ~~~ GPU0
    L2 ~~~ GPU1
    L2 ~~~ GPU2
    L2 ~~~ GPU3
    
    S4 --- T3 --- L3 --- GPU4

    %% Apply styling
    class S0,S1,S2,S3,S4,T0,T3,L0,L1,L3,GPU4 dot;
    class T1,T2,L2,GPU0,GPU1,GPU2,GPU3 dotrm;

We guarantee that the filtered data in the right-side table can definitely JOIN with all data tables to its left (including non-adjacent ones). Therefore, after completing the first pass of filtering, the data remaining in the right-most table is guaranteed to participate in the final result.

Second Pass: Right to Left

To filter out rows in the left table that cannot JOIN.

Teacher_Lab_2 = SELECT * FROM Teacher_Lab_1 WHERE EXISTS(
    SELECT * FROM Lab_Device_1 WHERE (Teacher_Lab_1.Lab = Lab_Device_1.Lab)
)

Student_Teacher_1 = SELECT * FROM Student_Teacher WHERE EXISTS(
    SELECT * FROM Teacher_Lab_2 WHERE (Student_Teacher.Teacher = Teacher_Lab_2.Teacher)
)

graph LR
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;
    classDef dotrm fill:#aaa,stroke:#666,stroke-width:2px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
        S3(( S3 ))
        S4(( S4 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
        GPU2(( GPU2 ))
        GPU3(( GPU3 ))
        GPU4(( GPU4 ))
    end

    S0 ~~~ T0
    S1 ~~~ T0
    S2 ~~~ T0
    S3 ~~~ T0

    T0 ~~~ L0
    T0 ~~~ L1
    
    T1 ~~~ L2
    T2 ~~~ L2
    
    L2 ~~~ GPU0
    L2 ~~~ GPU1
    L2 ~~~ GPU2
    L2 ~~~ GPU3
    
    S4 --- T3 --- L3 --- GPU4

    %% Apply styling
    class S4,T3,L3,GPU4 dot;
    class S0,S1,S2,S3,T0,L0,L1 dotrm;
    class T1,T2,L2,GPU0,GPU1,GPU2,GPU3 dotrm;

We guarantee that the filtered data in the left-side table can definitely JOIN with all data tables to its right (including non-adjacent ones). Since the first pass guaranteed the right-most data participates in the final result, the second pass ensures that the remaining data in ALL tables will participate in the final result.

Third Pass: Left to Right

To complete the JOIN.

At this point, all paths that do not participate in the final JOIN result have been cleared away. We can now safely perform Pairwise JOINs.

Summary: Yannakakis Algorithm (Simplified)

What is introduced here is the Yannakakis Algorithm specialized for linear JOIN queries. Compared to Pairwise JOIN, its advantage lies in being sensitive to the result size.

The complexity of Pairwise JOIN is independent of the JOIN result size. Therefore, even if the final result is actually very small, it may produce a massive amount of intermediate results. For the database instance mentioned above, the time complexity of Pairwise JOIN is , whereas the time complexity of the Yannakakis algorithm shown here is , where is the data size and is the size of the JOIN result.

The worst-case time complexity for both is consistent. Assuming there are data tables with no common attributes, equivalent to doing a full Cartesian product: Pairwise JOIN time complexity is , and Yannakakis time complexity is , where .

JOIN algorithms with complexity are also known as Output-Optimal algorithms. Regardless of the JOIN algorithm, one must read all the data () and output all the results (), so is already optimal for a specific output size.

Graphical Representation of Joins

In the previous note, we discussed the specialized version of the Yannakakis algorithm for linear JOIN queries. However, queries in practice go far beyond just linear structures. So, how do we generalize the Yannakakis algorithm to various types of JOIN queries?

Graphical Representation

Recap: Path Graph of a Database Instance

Previously, we showed the path graph of a database instance. This represents the specific data content of a database instance (where every edge represents a row of data in a table):

graph LR
    %% Define styling to match the handwritten dot style
    classDef dot stroke-width:2px;

    %% Layer definitions for alignment
    subgraph Students
        S0(( S0 ))
        S1(( S1 ))
        S2(( S2 ))
    end

    subgraph Teachers
        T0(( T0 ))
        T1(( T1 ))
        T2(( T2 ))
        T3(( T3 ))
    end

    subgraph Labs
        L0(( L0 ))
        L1(( L1 ))
        L2(( L2 ))
        L3(( L3 ))
    end

    subgraph Devices
        GPU0(( GPU0 ))
        GPU1(( GPU1 ))
    end

    S0 --- T0
    S0 --- T1
    S1 --- T2
    S2 --- T2
    
    T0 --- L0
    T0 --- L1
    T1 --- L0
    T1 --- L1
    T2 --- L2
    T3 --- L3
    
    L2 --- GPU0
    L3 --- GPU0
    L3 --- GPU1

    %% Apply styling
    class S0,S1,S2,T0,T1,T2,T3,L0,L1,L2,L3,GPU0,GPU1 dot;

But now, we want to represent the structure of the JOIN query itself graphically, without involving specific database instances—in other words, in a data-independent way.

Graphical Representation of Query Plans

Now, disregarding data content, the path graph above transforms into paths between data tables. Considering the Left-to-Right and Right-to-Left orders, we can draw the following two charts:

flowchart TD
    subgraph Order 2
    direction BT
    A2["Student_Teacher (Student, Teacher)"]
    B2["Teacher_Lab (Teacher, Lab)"]
    C2["Lab_Device (Lab, Device)"]
    C2 --> B2 --> A2
    end
    subgraph Order 1
    direction BT
    A["Student_Teacher (Student, Teacher)"]
    B["Teacher_Lab (Teacher, Lab)"]
    C["Lab_Device (Lab, Device)"]
    A --> B --> C
    end

In a path graph, paths can only be linked if adjacent tables share common attributes. We can mark these common attributes on the edges of the graph. Additionally, if we view this as an undirected graph, we don’t need to distinguish between the two orders.

flowchart BT
    A["Student_Teacher (Student, Teacher)"]
    B["Teacher_Lab (Teacher, Lab)"]
    C["Lab_Device (Lab, Device)"]
    A--Teacher---B 
    B--Lab---C

Now, let’s consider a slightly more complex example.

Graphical Representation of Star JOIN Queries

Star JOIN query structures are very common in practice, primarily existing in scenarios using foreign keys. Assume we have the following data tables:

  1. Orders (Order_ID, Warehouse_ID, Product_ID, Buyer_ID)
  2. Warehouses (Warehouse_ID, Warehouse_Address)
  3. Products (Product_ID, Product_Name)
  4. Buyers (Buyer_ID, Buyer_Name, Buyer_Address)

We want to retrieve all order details:

SELECT * FROM Orders
    JOIN Warehouses ON Warehouse_ID
    JOIN Products ON Product_ID
    JOIN Buyers ON Buyer_ID

The graphical representation of this query is as follows:

flowchart TB
    B[Warehouses]
    A[Orders]
    C[Products]
    D[Buyers]
    
    A --Warehouse_ID --- B
    A --Product_ID --- C
    D --Buyer_ID --- A

At first glance, this graph might not seem “star-shaped”; it looks more like a tree. However, if the Orders table had more foreign keys, I believe the reader would understand why this category of queries is called a Star Query.

Additionally, calling it a tree is also correct; a star structure is essentially a tree. In fact, the graphical representation of a JOIN query is called a Join Tree.

Hypergraphs and Join Trees

We have gained a superficial understanding of Join Trees, but we have not yet provided a formal definition because we are missing a tool—the Hypergraph. Hypergraphs can also be used to visually display JOIN queries; the difference is that Hypergraphs merely represent the structure of the JOIN query, while Join Trees imply information about the query execution plan.

Hypergraph

A Hypergraph consists of a finite number of vertices and a set of hyperedges, where represents the finite set of vertices and represents the set of hyperedges.

Unlike edges in a standard Graph which can only connect two endpoints, a Hyperedge can connect multiple vertices simultaneously, represented as a subset of vertices in the hypergraph, i.e., .

Hypergraphs map very naturally to JOIN queries:

  • Each vertex represents a column (attribute) in the data tables.
  • Each hyperedge contains multiple vertices, thus representing a data table.

Using the Star Query mentioned earlier as an example:

hypergraph

Note

In this hypergraph, a “circle” is actually a hyperedge, corresponding to a data table (the table name is connected by a dashed line); each word inside the circle is a vertex, corresponding to a column.

It took me some time to get used to the graphical representation of hypergraphs—after all, the “edges” in the graphs we usually encounter are represented by thin lines, unlike these “hyperedges” which actually have area.

Join Tree

The formal definition of a Join Tree corresponds to that of a hypergraph.

Given a hypergraph , its corresponding Join Tree is a tree that satisfies the following two conditions:

  1. The nodes of tree correspond one-to-one with the hyperedges () of hypergraph .
  2. For any hypergraph vertex , the set of tree nodes in corresponding to the hyperedges containing forms a connected subgraph. (This is often called the Running Intersection Property).

Let’s stick with the previous example. The hypergraph has 4 hyperedges, corresponding to 4 data tables: Buyers, Orders, Warehouses, Products.

Its corresponding Join Tree is as follows. We can see that the Join Tree has 4 nodes, corresponding one-to-one with the 4 hyperedges, thus satisfying the first condition.

flowchart TB
    B["Warehouses(Warehouse_ID, Address)"]
    A["Orders(Order_ID, Warehouse_ID, Product_ID, Buyer_ID)"]
    C["Products(Product_ID, Name)"]
    D["Buyers(Buyer_ID, Name, Address)"]
    
    A --Warehouse_ID --- B
    A --Product_ID --- C
    D --Buyer_ID --- A

The phrasing of the second condition is slightly more complex, but understanding it is quite easy:

  • “Hyperedges containing a common vertex simply means data tables that share a common field (e.g., “Buyers” and “Orders” share the “Buyer_ID” field).
  • “Corresponding nodes in tree refers to finding the tree nodes corresponding to these data tables in the Join Tree (since condition 1 requires a one-to-one mapping).
  • “Forms a connected subgraph”: This is a basic graph theory concept. For a tree (an undirected graph), if there is a path between two points, they are connected (the path doesn’t have to be a direct edge; it can pass through multiple edges).

In the Join Tree above, we can see that the two nodes “Buyers” and “Orders”, which share the common field “Buyer_ID”, are indeed connected. Iterating through every field can prove that this Join Tree satisfies the second condition.

Yannakakis Algorithm (Full Version)

With the Join Tree, we can now complete the Yannakakis algorithm introduced earlier on this tree. Unlike before, where we only performed JOINs on a linear sequence of tables, we can now perform JOINs on a tree, thereby supporting more types of queries.

Let’s Recap

The simplified algorithm flow mentioned in the previous note was:

  1. First Pass: Left to Right, filter out rows in the right-side table that cannot JOIN.
    • After the first pass, data remaining in the right-most table is guaranteed to participate in the final result.
  2. Second Pass: Right to Left, filter out rows in the left-side table that cannot JOIN.
    • The second pass guarantees that the remaining data in ALL tables will participate in the final result.
  3. Third Pass: Left to Right, complete the JOIN.

Generalizing to Join Trees

The characteristics of the two-pass filtering can easily be generalized to a tree structure.

  1. First Pass: Bottom-Up (from leaf nodes up to the root), filtering out rows in parent nodes that cannot JOIN with any child node.
    • After the first pass, the data remaining in the root node is guaranteed to participate in the final result.
  2. Second Pass: Top-Down, filtering out rows in child nodes that cannot JOIN with the parent node.
    • The second pass guarantees that the remaining data in ALL tables will participate in the final result.
  3. Third Pass: Bottom-Up, complete the JOIN.

Pseudo-code

Input: Join Tree , Set of Relations , Root Node . Output: Result of the JOIN .

  1. Preprocessing: Determine Parent-Child Relationships
    For each node in :
    the parent node of node in tree ;
  2. Bottom-Up Phase
    Visit node of in Bottom-Up order (excluding root ):

    // Filter tuples in parent relation that do not match using child relation
  3. Top-Down Phase
    Visit node of in Top-Down order (excluding root ):

    // Filter tuples in child relation that do not match using parent relation
  4. Compute and Return JOIN Result
    At this point, all relations have reached “Global Consistency” (every tuple is guaranteed to appear in the final JOIN result):
    Return the natural join of all relations:

New Problem: Does Every Query Correspond to a Join Tree?

Suppose we have three data tables: .

Consider the following query:

SELECT * FROM R, S, T WHERE R.a = T.a AND R.b = S.b AND S.c = T.c

Can you find the Join Tree that corresponds to it?

Notes for Foundations of Databases

Foundations of Databases (The Alice Book)

The Alice Book is helpful for understanding the basic notations in database theory. I recommend to read the first two chapters to get familiar with the notations used in this course.

That said, the book is not required. In fact, paying real attention to the first several lectures should be sufficient to survive the course, and it turns out to be interesting.

In this section, I summarize some key concepts from the first two chapters of the Alice Book.

The Relational Model

Informal Terminology

  • Relation: table
    • Relation Name: name of the table
  • Attribute: column name
  • Tuple/Record: line in the table
  • Domain: set of constants that entries of tuples can take
  • Database Schema: specifying the structure of the database
  • Database Instance: specifying the actual content in the database

Formal Definitions

  • : a countably infinite set of attributes
    • assume is fixed
  • : a total order on
    • the elements of a set of attributes are written according to unless otherwise specified
  • : a countably infinite set of domains (disjoint from )
    • assume is fixed
    • a constant is an element of
    • for most cases, the same domain of values is used for all attributes;
      • otherwise, assume a mapping on , where is a set called the domain of attribute
  • : a countably infinite set of relation names (disjoint from and )
  • : a function from to (the finitary powerset of )
    • is inifinite for each (possibly empty) finite set of attributes
      • which allows multiple relations to have the same set of attributes
    • is called the of a relation name
    • : the of relation name
  • : a relation name or a relation schema
    • Alternative notations:
      • : indicating
      • : indicating
  • : a database schema
    • a nonempty finite set of relation names
    • might be written as

Examples

Table Movies

TitleDirectorActor
AXY

Table Location

TheaterAddressPhone
T1Addr1P1

Table Pariscope

TheaterTitleSchedule
T1A9:00
  • The database schema
  • The sorts of the relation names

Named v.s. Unnamed Attributes/Tuples

Named Perspective

E.g.

a tuple over a finite set of attributes (or over a relation schema ) is a total mapping (viewed as a function) from to .

  • : is a tuple over
  • : the value of on an attribute in ()
  • : the restriction of to a subset , i.e., denotes the tuple over such that for all

Unnamed Perspective

E.g.

a tuple is an ordered n-tuple () of constants (i.e., an element of the Cartesian product )

  • : the -th coordinate of

Correspondence

Because of the total order , a tuple (defined as a function) can be viewed as an ordered tuple with as a first component and as a second component. Ignoring the names, this tuple can be viewed as the ordered tuple .

Conversely, the ordered tuple can be viewed as a function over the set of integers with for each .

Conventional v.s. Logic Programming Relational Model

Conventional Perspective

a relation (instance) over a relation schema is a finite set of tuples with sort .

a database instance of database schema is a mapping with domain , such that is a relation over for each .

Logic Programming Perspective

This perspective is used primarily with the ordered-tuple perspective on tuples.

A fact over is an expression of the form , where for .
If , we sometimes write for .

a relation (instance) over a relation schema is a finite set of facts over .

a database instance of database schema is a finite set that is the union of relation instances over for all .

Examples

Assume .

Named and Conventional

Unnamed and Conventional

Named and Logic Programming

Unnamed and Logic Programming

Variables

an infinite set of variables will be used to range over elements of .

  • a free tuple over or is a function from to
  • an atom over is an expression of the form , where and each term
  • a ground atom is an atom with no variables, i.e., a fact (see Logic Programming Perspective)

Notations

ObjectNotation
Constants
Variables
Sets of Variables
Terms
Attributes
Sets of Attributes
Relation Names
Database Schemas
Tuples
Free Tuples
Facts
Atoms
Relation Instances
Database Instances

Conjunctive Queries

a query (mapping) is from (or over) its input schema and to its output schema.

  • query: a syntactic object
  • query mapping: a function defined by a query interpreted under a specified semantics
    • its domain: the family of all instances of an input schema
    • its range: the family of instances of an output schema
    • we often blur query and query mapping when the meaning is clear from context
  • input schema: a specified relation or database schema
  • output schema: a relation schema or database schema
    • For a relation schema, the relation name may be specified as part of the query syntax or by the context
  • denotes two queries and over are equivalent, i.e., they have the same output schema and for each instance over .

Logic-Based Perspective

Three versions of conjunctive queries:

  1. Rule-Based Conjunctive Queries
  2. Tableau Queries
  3. Conjunctive Calculus

Rule-Based Conjunctive Queries

Definition

A rule-based conjunctive query (or often more simply called rules) over a relation schema is an expression of the form where , are relation names in ; is a relation name not in ; and are free tuples (i.e., may use either variables or constants).

  • : the set of variables occurring in .
  • body:
  • head:
  • range restricted: each variable occurring in the head also occurs in the body
    • all conjunctive queries considered here are range restricted
  • valuation: a valuation over , a finite subset of , is a total function from to (of constants)
    • extended to be identity on (so that the domain can contain both variables and constants)
    • extended to map free tuples to tuples (so that the domain/range can be in the tuple form)

Semantics

Let be the query, and let be a database instance of schema . The image of under is:

  • active domain
    • of a database instance , denoted , is the set of constants that occur in
    • of a relation instance , denoted , is the set of constants that occur in
    • of a query , denoted , is the set of constants that occur in
    • is an abbreviation for
  • extensional relations: relations in the body of the query, i.e.,
    • because they are known/provided by the input instance
  • intensional relation: the relation in the head of the query, i.e.,
    • because it is not stored and its value is computed on request by the query
  • extensional database (edb): a database instance associated with the extensional relations
  • intensional database (idb): the rule itself
  • idb relation: the relation defined by the idb

Properties

Conjunctive queries are:

  • monotonic: a query over is monotonic if for each over ,
  • satisfiable: a query is satisfiable if there exists a database instance such that

Tableau Queries

A tableau query is simply a pair where is a tableau and each variable in also occurs in .

This is closest to the visual form provided by Query-By-Example (QBE).

  • summary: the free tuple representing the tuples included in the answer to the query
  • embedding: a valuation for the variables occurring in such that
    • the output of on consists of all tuples for each embedding of into
  • typed: a tableau query under the named perspective, where is over relation schema and , is typed if no variable of is associated with two distinct attributes in

Examples

Table Movies

TitleDirectorActor
“Bergman”

Table Location

TheaterAddressPhone

Table Pariscope

TheaterTitleSchedule

The above tableau query is typed because each variable is associated with only one attribute:

  • :
  • :
  • :
  • :
  • :
  • :

However, the following tableau query is untyped:

Table Movies

TitleDirectorActor

Because is associated with both and .

Conjunctive Calculus

The conjunctive query

can be expressed as the following conjunctive calculus query that has the same semantics:

where are all the variables occurring in the body and not the head.

Conjunctive Calculus Formula

Let be a relation schema. A (well-formed) formula over for the conjunctive calculus is an expression having one of the following forms:

  1. an atom over ;
  2. , where and are formulas over ; or
  3. , where is a variable and is a formula over .

An occurrence of a variable in formula is free if:

  1. is an atom; or
  2. and the occurrence of is free in or ; or
  3. , , and the occurrence of is free in .

: the set of free variables in .

An occurrence of a variable that is not free is bound.

Conjunctive Calculus Query

A conjunctive calculus query over database schema is an expression of the form

where

  • is a conjunctive calculus formula,
  • is a free tuple, and
  • the set of variables occurring in is exactly .

For named perspective, the above query can be written as:

Note: In the previous chapter, a named tuple was usually denoted as:

But here we denote the named tuple as:

Not sure which one is better or whether the author did this intentionally.

Semantics

Valuation

A valuation over is a total function from to , which can be viewed as a syntactic expression of the form:

where

  • is a listing of ,
  • for each .
Interpretation as a Set

If , and , then is the valuation over that agrees with on and maps to .

Satisfaction

Let be a database schema, a conjunctive calculus formula over , and a valuation over . Then , if

  1. is an atom and ; or
  2. and and ; or
  3. and there exists a constant such that .
Image

Let be a conjunctive calculus query over . For an instance over , the image of under is:

  • active domain
    • of a formula , denoted , is the set of constants that occur in
    • is an abbreviation for
    • If , then the range of is contained in
      • to evaluate a conjunctive calculus query, one need only consider valuations with range constrained in , i.e., only a finite number of them
  • equivalent: conjunctive calculus formulas and over are equivalent, if
    • they have the same free variables and,
    • for each over and valuation over , if and only if

Normal Form

Each conjunctive calculus query is equivalent to a conjunctive calculus query in normal form.

Expressiveness of Query Languages

Let and be two query languages.

  • : is dominated by (or is weaker than ), if for each query , there exists a query such that .
  • : and are equivalent, if and .

The rule-based conjunctive queries, tableau queries, and conjunctive calculus queries are all equivalent.

Incorporating Equality

The conjunctive query

can be expressed as:

Problem 1: Infinite Answers

Unrestricted rules with equality may yield infinite answers:

A rule-based conjunctive query with equality is a range-restricted rule-based conjunctive query with equality.

Problem 2: Unsatisfiable Queries

Consider the following query:

where is a unary relation and with .

Each satisfiable rule with equality is equivalent to a rule without equality.
No expressive power is gained if the query is satisfiable.

Query Composition and Views

a conjunctive query program (with or without equality) is a sequence of rules having the form:

where

  • each is a distinct relation name not in ,
  • for each , the only relation names that may occur in are in .

Closure under Composition

If conjunctive query program defines final relation , then there is a conjunctive query , possibly with equality, such that on all input instances , .

If is satisfiable, then there is a without equality.

Composition and User Views

views are specified as queries (or query programs), which may be

  • materialized: a physical copy of the view is stored and maintained
  • virtual: relevant information about the view is computed as needed
    • queries against the virtual view generate composed queries against the underlying database

Algebraic Perspectives

The Unnamed Perspective: The SPC Algebra

Unnamed conjunctive algebra:

  • Selection ()
  • Projection ()
  • Cross-Product (or Cartesian Product, )

Selection (“Horizontal” Operator)

Primitive Forms: and
where are positive integers, and (constants are usually surrounded by quotes).

is sometimes called atomic selection.

Projection (“Vertical” Operator)

General Form:
where and are positive integers (empty sequence is written ).

Cross-Product (or Cartesian Product)

Formal Inductive Definition

Let be a relation schema.

There are two kinds of base SPC (algebra) queries:

  1. Input Relation: Expression ; with arity equal to
  2. Unary Singleton Constant: Expression , where ; with arity equal to 1

The family of SPC (algebra) queries contains all above base SPC queries and, for SPC queries with arities , respectively,

  • Selection: and , whenever and ; these have arity .
  • Projection: , where ; this has arity .
  • Cross-Product: ; this has arity .

Unsatisifiable Queries

E.g. where and .

This is equivalent to .

Generalized SPC Algebra

  • Intersection (): is easily simulated by selection and cross-product.
  • Generalized Selection (): permits the specification of multiple conditions.
    • Positive Conjunctive Selection Formula: where each is either or
    • Positive Conjunctive Selection Operator: an expression of the form , where is a positive conjunctive selection formula
    • can be simulated by a sequence of selections
  • Equi-Join (): a binary operator that combines cross-product and selection
    • Equi-Join Condition: where each is of the form
    • Equi-Join Operator: an expression of the form , where is an equi-join condition
    • can be simulated by , where is obtained from by replacing each condition with

Normal Form

where

  • ;
  • ;
  • ;
  • are relation names (repeats permitted);
  • is a positive conjunctive selection formula.

For each (generalized) SPC query, there is an equivalent SPC query in normal form.

The proof is based on repeated application of the following equivalence-preserving SPC algebra rewrite rules (or transformations):

  • Merge-Select:
  • Merge-Project:
  • Push-Select-Though-Project:
  • Push-Select-Though-Singleton:
  • Associate-Cross:
  • Commute-Cross: where and
  • Push-Cross-Though-Select: where is obtained from by replacing each with
  • Push-Cross-Though-Project: where is obtained from by replacing each with
Notation for Rewriting

  • for a set of rewrite rules and algebra expressions ,
  • if is understood from the context,
  • if is the result of replacing a subexpression of according to one of the rules in .

: the reflexive and transitive closure of .

A family of rewrite rules is sound if implies . If is sound, then implies .

The Named Perspective: The SPJR Algebra

Named conjunctive algebra:

  • Selection ()
  • Projection (); repeats not permitted
  • Join (); natural join
  • Renaming ()

Selection

where

These operators apply to any instance with .

Projection

where and

(Natural) Join

where and have sorts and , respectively.

  • When , .
  • When , .

Renaming

An attribute renaming for a finite set of attributes is a one-one mapping from to .

where

  • input is an instance over
  • is an attribute renaming for
    • which can be described by , where
      • usually written as to indicate that for each

Formal Inductive Definition

The base SPJR algebra queries are:

  1. Input Relation: Expression ; with sort equal to
  2. Unary Singleton Constant: Expression , where ; with

The remainder of the syntax and semantics of the SPJR algebra is defined in analogy to those of the SPC algebra.

Normal Form

where

  • ;
  • ;
  • occurs in ;
  • the s are distinct;
  • are relation names (repeats permitted);
  • is a renaming operator for ;
  • no occur in any ;
  • the sorts of are pairwise disjoint;
  • is a positive conjunctive selection formula.

For each SPJR query, there is an equivalent SPJR query in normal form.

Equivalence Theorem

  • Lemma: The SPC and SPJR algebras are equivalent.
  • Equivalence Theorem: The rule-based conjunctive queries, tableau queries, conjunctive calculus queries, satisifiable SPC algebra, and satisifiable SPJR algebra are all equivalent.

Adding Union

The following have equivalent expressive power:

  • the nonrecursive datalog programs (with single relation target)
  • the SPCU queries
  • the SPJRU queries
    • union can only be applied to expressions having the same sort

Nonrecursive Datalog Program

A nonrecursive datalog program (nr-datalog program) over schema is a set of rules:

where

  • no relation name in occurs in a rule head
  • the same relation name may appear in more than one rule head
  • there is some ordering of the rules such that the relation name in the head of does not occur in the body of any rule with

Union and the Conjunctive Calculus

Simply permitting disjunction (denoted ) in the formula of a conjunctive calculus along with conjunction can have serious consequences.

E.g.,

The answer of on nonempty instance will be

Because decides that the first two components in are in and can be anything in , and decides that the last two components in are in and can be anything in .