Unlock the power of Google Analytics with Big Query

Calculating Sessions in BigQuery GA4: Addressing Discrepancies

Calculating Sessions in BigQuery GA4: Addressing Discrepancies

Calculating the number of sessions in BigQuery GA4 provides immense flexibility and granularity. However, it can often lead to discrepancies when compared to the GA4 user interface (UI). Understanding these differences and how to mitigate them is key to accurate analysis.

How to Calculate Sessions in BigQuery GA4

The most accurate and robust method to calculate sessions in BigQuery GA4 involves identifying unique combinations. These include user_pseudo_id (or user_id if implemented) and ga_session_id.

Basic SQL for Session Count:

SELECT
  COUNT(DISTINCT CONCAT(user_pseudo_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id'))) AS total_sessions
FROM
  `your_project_id.your_dataset_id.events_*` -- Replace with your project and dataset
WHERE
  _TABLE_SUFFIX BETWEEN 'YYYYMMDD' AND 'YYYYMMDD' -- Specify your date range

Explanation of the SQL:

General Discrepancy Scenarios

Even without considering consent mode, several factors can cause differences between session counts in the GA4 UI and your BigQuery export:

  1. Unique Count Approximation (GA4 UI):
    • For very large datasets, the GA4 UI, particularly in standard reports, might use an estimation algorithm (HyperLogLog++) for unique counts (like sessions) to improve query performance. BigQuery, conversely, typically provides exact counts.
  2. Session Definition Nuances:
    • Session Start Event: In BigQuery, you explicitly query for the ga_session_id associated with events. GA4’s internal logic, however, can sometimes infer a new session from any event (e.g., a page_view or even a custom event) if no existing session is active within the 30-minute timeout window. While the session_start event is crucial, the GA4 UI’s session count might be slightly higher due to these implicit session initiations.
    • Cross-Day Sessions: In GA4, a session can span across midnight if the user remains active. This differs from Universal Analytics, which would typically create a new session at midnight.
  3. Data Processing Delays:
    • There can be a delay (typically 24-72 hours) for raw data to appear in BigQuery after it’s collected. This means very recent GA4 UI data (especially in real-time reports) will not yet be fully reflected in your BigQuery export.
  4. Sampling (GA4 UI Explorations):
    • While standard GA4 reports are generally unsampled, explorations (e.g., Funnel Explorations, Path Explorations) can be sampled for very large date ranges or complex queries. BigQuery provides access to the raw, unsampled data.
  5. Reporting Identity:
    • The GA4 UI allows you to choose different “Reporting Identities” (e.g., Blended, Observed, Device-Based). “Blended” uses Google Signals and predictive modeling, which can de-duplicate users and sessions beyond what’s available in the raw BigQuery export. BigQuery only reflects the data collected from the client.
    • If your GA4 property uses the “Blended” identity, the UI’s session counts will likely be higher due to this modeling.
  6. Time Zone Differences:
    • BigQuery’s event_timestamp is always in UTC. The GA4 UI reports typically display data based on your property’s configured reporting time zone. This can lead to apparent date-based discrepancies if not accounted for in your BigQuery queries.
Workarounds and Recommendations
  1. Use CONCAT(user_pseudo_id, ga_session_id) for Session Uniqueness:
    • This remains the most robust and recommended method for calculating unique sessions in BigQuery. It directly reflects how GA4 identifies a unique user session at the event level.
  2. Align Reporting Identity for Comparison:
    • When comparing BigQuery data to the GA4 UI, set the GA4 property’s “Reporting Identity” to “Device-based”. This will show data as close as possible to the raw data available in BigQuery, minimizing the impact of modeling and Google Signals.
  3. Account for Data Processing Latency:
    • When comparing, ensure you’re looking at data that is sufficiently old to have fully landed in BigQuery (e.g., data from at least 3 days ago).
  4. Standardize Time Zones:
    • When querying BigQuery, convert event_timestamp to your GA4 property’s reporting time zone if you’re comparing date-specific metrics.
    • SQL
-- Example for converting timestamp to a specific timezone 
SELECT 
TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, 'America/Chicago') AS event_date,
 -- Replace with your timezone 
COUNT(DISTINCT CONCAT(user_pseudo_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id'))) AS total_sessions 
FROM `your_project_id.your_dataset_id.events_*` 
WHERE _TABLE_SUFFIX BETWEEN 'YYYYMMDD' AND 'YYYYMMDD' GROUP BY 1 ORDER BY 1
  1. Implement user_id for Cross-Device/Session Tracking:
    • If feasible, implement user_id as a persistent identifier for logged-in users. This allows for more accurate de-duplication of users and sessions across different devices and time periods, providing a more reliable foundation for your analysis in BigQuery.
  2. Focus on Trends over Absolute Numbers:
    • Given the inherent differences, sometimes focusing on trends and relative changes in sessions over time in BigQuery is more valuable than trying to match the GA4 UI’s absolute numbers perfectly.
  3. Clearly Document Differences:
    • If you’re presenting data derived from BigQuery alongside GA4 UI reports, always communicate the potential reasons for discrepancies to your stakeholders. Transparency builds trust.
  4. Regularly Review Your GA4 Implementation:
    • Ensure your GA4 data layer and tag configurations are correct and consistent. Errors in implementation are a common source of data discrepancies.

By understanding these common discrepancy scenarios and applying the suggested workarounds and recommendations, you can more effectively leverage your GA4 BigQuery export for in-depth and reliable analytics.

Discover more from GA4 BigQuery

Subscribe now to keep reading and get access to the full archive.

Continue reading