WITH ordered_events AS (
SELECT
*,
LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS previous_timestamp,
LAG(source) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS previous_source
FROM base_events
)
This query is preparing your GA4 event data for session reconstruction.
The goal is:
“For every event, look at the event immediately before it for the same user.”
That allows you to detect:
- inactivity gaps
- traffic source changes
- session boundaries
- journey continuity
You start with raw GA4 events like this:
| user_pseudo_id | event_timestamp | source |
|---|---|---|
| A | 10:00 | |
| A | 10:05 | |
| A | 11:00 | |
| B | 09:00 | direct |
But session stitching requires knowing:
“What happened BEFORE this event?”
That’s what LAG() does.
What This Query Is Doing
WITH ordered_events AS (
This creates a temporary table called:
ordered_events
Think of it like:
“A cleaned and sequenced event stream.”
Step 1: Select All Existing Fields
SELECT *,
This keeps all original columns from base_events.
So if base_events has:
- user_pseudo_id
- event_name
- event_timestamp
- source
- medium
they all remain.
Step 2: Get Previous Event Timestamp
LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS previous_timestamp
This is the important part.
What LAG() Means
LAG() says:
“Look one row backward.”
So for each event:
- get the previous event’s timestamp
- for the SAME user
- ordered chronologically
- Exmple
Suppose user A has:
| event_timestamp |
|---|
| 10:00 |
| 10:05 |
| 11:00 |
After LAG():
| event_timestamp | previous_timestamp |
|---|---|
| 10:00 | NULL |
| 10:05 | 10:00 |
| 11:00 | 10:05 |
Now you can calculate:
11:00 - 10:05 = 55 minutes
That might indicate:
“New session started.”
Why PARTITION BY user_pseudo_id Matters
PARTITION BY user_pseudo_id
This resets the calculation for each user.
Without it:
- events from different users would mix together
- session logic would break
So this says:
“Only compare events within the same user.”
Why ORDER BY event_timestamp Matters
ORDER BY event_timestamp
This ensures events are processed chronologically.
Otherwise:
- previous events would be random
- inactivity calculations become meaningless
Step 3: Get Previous Traffic Source
LAG(source) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
)
AS previous_source
Now we do the same thing — but for traffic source.Example
Before:
| timestamp | source |
|---|---|
| 10:00 | |
| 10:05 | |
| 11:00 |
After:
| timestamp | source | previous_source |
|---|---|---|
| 10:00 | NULL | |
| 10:05 | ||
| 11:00 |
Now you can detect:
“Did acquisition source change?”
If:
source != previous_source
you may decide:
“Start a new session.”
Final Output Structure
After this query, your data becomes:
| user | timestamp | source | previous_timestamp | previous_source |
|---|---|---|---|---|
| A | 10:00 | NULL | NULL | |
| A | 10:05 | 10:00 | ||
| A | 11:00 | 10:05 |
This gives you the foundation for:
- inactivity analysis
- session stitching
- attribution continuity
- journey reconstruction
Why This Matters in GA4 BigQuery
GA4 exports:
- raw events
- NOT reconstructed sessions
So to rebuild sessions:
- order events
- compare against previous behavior
- detect boundaries
- assign session IDs
This query handles:
Step #2: behavioral comparison.
The Mental Model
Think of LAG() as:
“Attach the previous row’s value onto the current row.”
That’s it.
For analytics engineering, this becomes extremely powerful for:
- sessionization
- funnel progression
- churn analysis
- attribution modeling
- pathing analysis
Leave a Reply