Extracting a domain or subdomain from a URL in BigQuery

URL parsing comes up in every web analytics project. Referrer analysis, UTM breakdowns, domain-level aggregation, deduplication — all of them need you to go from a raw URL string to a clean domain or subdomain.

BigQuery gives you three ways to do this. One is obvious, one is underused, and one is a last resort.

Approach 1: Regex (the obvious one)

SELECT
  REGEXP_EXTRACT(url, r'https?://([^/]+)') AS host
FROM your_table

This works until it doesn't. Edge cases pile up fast:

URLs without a protocol (www.example.com/page)
Ports in the URL (example.com:8080/page)
Subdomains you didn't anticipate (api.staging.example.com)
Country-code TLDs (example.co.uk parsed as co.uk)

Every edge case is another regex clause. Three months in, you have a 200-character pattern that nobody can read and nobody wants to touch.

We've inherited BigQuery projects with five different regex patterns for domain extraction, each written by a different analyst, each handling a slightly different set of edge cases. The dashboards built on top of them disagreed with each other in ways nobody noticed for months.

Approach 2: NET functions (the right one)

BigQuery's NET function family handles URL parsing natively. Two functions cover 95% of use cases:

`NET.HOST(url)`

Returns the full host including subdomains:

SELECT NET.HOST('https://docs.google.com/spreadsheets') AS host
-- docs.google.com

`NET.REG_DOMAIN(url)`

Returns the registrable domain — subdomains stripped, TLD handling built in:

SELECT NET.REG_DOMAIN('https://docs.google.com/spreadsheets') AS domain
-- google.com
 
SELECT NET.REG_DOMAIN('https://shop.example.co.uk/cart') AS domain
-- example.co.uk

That co.uk handling is the killer feature. NET.REG_DOMAIN understands public suffix lists. Your regex doesn't.

In a staging model

-- models/staging/stg_sessions.sql
SELECT
    session_id,
    raw_referrer_url,
    NET.HOST(raw_referrer_url)        AS referrer_host,
    NET.REG_DOMAIN(raw_referrer_url)  AS referrer_domain
FROM {{ source('analytics', 'raw_sessions') }}
WHERE raw_referrer_url IS NOT NULL

Two clean columns. Zero regex. Handles every TLD BigQuery knows about.

NULL behavior

Both functions return NULL for malformed input. That's intentional — you get explicit signal instead of a silent empty string or a partial match.

Test for it:

columns:
  - name: referrer_domain
    tests:
      - not_null:
          config:
            severity: warn

A rising NULL rate in your staging model tells you something changed upstream — a new referrer format, a tracking bug, a bot hitting your site with garbage URLs.

Approach 3: UDFs (the last resort)

For genuinely exotic parsing — extracting path segments, query parameters, or fragment identifiers — you can write a JavaScript UDF:

CREATE TEMP FUNCTION parse_url_param(url STRING, param STRING)
RETURNS STRING
LANGUAGE js AS r"""
  try {
    const u = new URL(url);
    return u.searchParams.get(param);
  } catch (e) {
    return null;
  }
""";
 
SELECT parse_url_param('https://example.com?utm_source=linkedin', 'utm_source')
-- linkedin

JavaScript UDFs are slower and more expensive than native functions. Use them when NET.* and REGEXP_EXTRACT genuinely can't do the job — query parameter extraction is the main case.

For domain and subdomain extraction, NET.HOST and NET.REG_DOMAIN are always the better choice.

Practical patterns

Referrer domain grouping

SELECT
    NET.REG_DOMAIN(referrer_url)  AS referrer_domain,
    COUNT(*)                       AS sessions,
    COUNT(DISTINCT user_id)        AS users
FROM {{ ref('stg_sessions') }}
WHERE referrer_url IS NOT NULL
GROUP BY 1
ORDER BY sessions DESC

UTM source vs. referrer reconciliation

SELECT
    COALESCE(utm_source, NET.REG_DOMAIN(referrer_url), 'direct') AS traffic_source,
    COUNT(DISTINCT session_id) AS sessions
FROM {{ ref('stg_sessions') }}
GROUP BY 1

Subdomain-level analysis

SELECT
    NET.HOST(page_url) AS subdomain,
    COUNT(*)           AS pageviews
FROM {{ ref('stg_pageviews') }}
WHERE NET.REG_DOMAIN(page_url) = 'yourdomain.com'
GROUP BY 1
ORDER BY pageviews DESC

Where to put URL parsing in your stack

In staging. Not in dashboards.

Parse URLs once, in your dbt staging models. Store host, domain, and any extracted parameters as clean columns. Everything downstream — marts, dashboards, reports — reads the clean version.

This gives you:

Consistency. One definition of "domain" across every dashboard.
Performance. Parse once at transform time, not on every dashboard query.
Traceability. A change to parsing logic is one PR in one file.

The alternative — every Looker explore and every ad-hoc query running its own REGEXP_EXTRACT — is how you end up with five domain definitions that disagree.

The decision matrix

Need	Function	Notes
Full host with subdomains	`NET.HOST()`	Handles all standard URLs
Root domain only	`NET.REG_DOMAIN()`	Understands public suffix list (co.uk, com.au, etc.)
Specific path segment	`REGEXP_EXTRACT()`	Regex is fine for structured paths
Query parameters	JavaScript UDF	`new URL().searchParams` in a temp UDF
Custom parsing logic	JavaScript UDF	Last resort — slower, more expensive

For 95% of analytics work, NET.REG_DOMAIN is the answer.

We build BigQuery data stacks where decisions like this are made once and applied everywhere — staging models, testing, documentation. If your analytics layer could use that kind of discipline, let's talk.