Unlock the power of Google Analytics with Big Query

How to Reconstruct Accurate Sessions in BigQuery: Session Stitching in GA4

Session counts in GA4 often don’t match what you see in BigQuery exports. This is because GA4 constructs sessions using rules like inactivity timeouts, traffic source changes, and midnight splits.

For analysts and marketers, this can lead to confusion in attribution, engagement tracking, and conversion reporting. In this post, we’ll break down session stitching, provide simple and advanced SQL examples, and show how to handle common real-world scenarios.


The Problem: Why Sessions Don’t Match

GA4 defines a session as a group of user interactions within a specific time frame. However, sessions can be split or duplicated due to:

  1. Inactivity Timeout: Sessions end after 30 minutes of user inactivity.
  2. Traffic Source Changes: Switching campaigns mid-session triggers a new session.
  3. Midnight Cutoff: Sessions crossing midnight are split into two sessions.
  4. Device Changes: Optionally, a device category change can start a new session.

Impact:

Solution: Reconstruct Sessions in BigQuery

BigQuery lets you stitch events into sessions based on business rules, giving you accurate session counts and metrics.

1. Simple SQL for Session Counting

If you only need basic session counts, you can use the session_start event:

SELECT
  user_pseudo_id,
  COUNT(event_name) AS sessions
FROM
  `your_project.analytics_XXXX.events_*`
WHERE
  event_name = 'session_start'
GROUP BY user_pseudo_id
ORDER BY sessions DESC;

Explanation:

  • This counts every session_start per user.
  • Simple but doesn’t account for traffic source changes, midnight splits, or inactivity longer than 30 minutes.

2. Deep Dive SQL: Session Stitching With Multiple Scenarios

For accurate, session-level analysis, use this advanced query:

-- Step 1: Identify previous events per user
WITH events_with_lag AS (
  SELECT
    user_pseudo_id,
    user_id, -- Optional for cross-device tracking
    event_timestamp,
    event_name,
    traffic_source.source AS traffic_source,
    traffic_source.medium AS traffic_medium,
    device.category AS device_category,
    LAG(event_timestamp) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_ts,
    LAG(traffic_source.source) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_source,
    LAG(device.category) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_device
  FROM
    `your_project.analytics_XXXX.events_*`
),

-- Step 2: Define session boundaries
session_flags AS (
  SELECT
    *,
    CASE
      WHEN TIMESTAMP_DIFF(TIMESTAMP_MICROS(event_timestamp), TIMESTAMP_MICROS(prev_ts), MINUTE) > 30 THEN 1
      WHEN traffic_source != prev_source THEN 1
      WHEN EXTRACT(DATE FROM TIMESTAMP_MICROS(event_timestamp)) != EXTRACT(DATE FROM TIMESTAMP_MICROS(prev_ts)) THEN 1
      WHEN device_category != prev_device THEN 1
      ELSE 0
    END AS new_session_flag
  FROM events_with_lag
),

-- Step 3: Assign session numbers
sessionized_events AS (
  SELECT
    *,
    SUM(new_session_flag) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS UNBOUNDED PRECEDING) + 1 AS session_number
  FROM session_flags
)

-- Step 4: Aggregate sessions
SELECT
  user_pseudo_id,
  user_id,
  session_number,
  MIN(TIMESTAMP_MICROS(event_timestamp)) AS session_start_time,
  MAX(TIMESTAMP_MICROS(event_timestamp)) AS session_end_time,
  COUNT(*) AS events_in_session,
  ARRAY_AGG(DISTINCT traffic_source) AS traffic_sources_in_session,
  ARRAY_AGG(DISTINCT device_category) AS devices_in_session
FROM sessionized_events
GROUP BY user_pseudo_id, user_id, session_number
ORDER BY user_pseudo_id, session_start_time;

What this query does:

Key Takeaways

TL;DR

Discover more from GA4 BigQuery

Subscribe now to keep reading and get access to the full archive.

Continue reading