URL parsing comes up in every web analytics project. Referrer analysis, UTM breakdowns, domain-level aggregation, deduplication — all of them need you to go from a raw URL string to a clean domain or subdomain.
BigQuery gives you three ways to do this. One is obvious, one is underused, and one is a last resort.
Approach 1: Regex (the obvious one)
SELECT
REGEXP_EXTRACT(url, r'https?://([^/]+)') AS host
FROM your_tableThis works until it doesn't. Edge cases pile up fast:
- URLs without a protocol (
www.example.com/page) - Ports in the URL (
example.com:8080/page) - Subdomains you didn't anticipate (
api.staging.example.com) - Country-code TLDs (
example.co.ukparsed asco.uk)
Every edge case is another regex clause. Three months in, you have a 200-character pattern that nobody can read and nobody wants to touch.
We've inherited BigQuery projects with five different regex patterns for domain extraction, each written by a different analyst, each handling a slightly different set of edge cases. The dashboards built on top of them disagreed with each other in ways nobody noticed for months.
Approach 2: NET functions (the right one)
BigQuery's NET function family handles URL parsing natively. Two functions cover 95% of use cases:
NET.HOST(url)
Returns the full host including subdomains:
SELECT NET.HOST('https://docs.google.com/spreadsheets') AS host
-- docs.google.comNET.REG_DOMAIN(url)
Returns the registrable domain — subdomains stripped, TLD handling built in:
SELECT NET.REG_DOMAIN('https://docs.google.com/spreadsheets') AS domain
-- google.com
SELECT NET.REG_DOMAIN('https://shop.example.co.uk/cart') AS domain
-- example.co.ukThat co.uk handling is the killer feature. NET.REG_DOMAIN understands public suffix lists. Your regex doesn't.
In a staging model
-- models/staging/stg_sessions.sql
SELECT
session_id,
raw_referrer_url,
NET.HOST(raw_referrer_url) AS referrer_host,
NET.REG_DOMAIN(raw_referrer_url) AS referrer_domain
FROM {{ source('analytics', 'raw_sessions') }}
WHERE raw_referrer_url IS NOT NULLTwo clean columns. Zero regex. Handles every TLD BigQuery knows about.
NULL behavior
Both functions return NULL for malformed input. That's intentional — you get explicit signal instead of a silent empty string or a partial match.
Test for it:
columns:
- name: referrer_domain
tests:
- not_null:
config:
severity: warnA rising NULL rate in your staging model tells you something changed upstream — a new referrer format, a tracking bug, a bot hitting your site with garbage URLs.
Approach 3: UDFs (the last resort)
For genuinely exotic parsing — extracting path segments, query parameters, or fragment identifiers — you can write a JavaScript UDF:
CREATE TEMP FUNCTION parse_url_param(url STRING, param STRING)
RETURNS STRING
LANGUAGE js AS r"""
try {
const u = new URL(url);
return u.searchParams.get(param);
} catch (e) {
return null;
}
""";
SELECT parse_url_param('https://example.com?utm_source=linkedin', 'utm_source')
-- linkedinJavaScript UDFs are slower and more expensive than native functions. Use them when NET.* and REGEXP_EXTRACT genuinely can't do the job — query parameter extraction is the main case.
For domain and subdomain extraction, NET.HOST and NET.REG_DOMAIN are always the better choice.
Practical patterns
Referrer domain grouping
SELECT
NET.REG_DOMAIN(referrer_url) AS referrer_domain,
COUNT(*) AS sessions,
COUNT(DISTINCT user_id) AS users
FROM {{ ref('stg_sessions') }}
WHERE referrer_url IS NOT NULL
GROUP BY 1
ORDER BY sessions DESCUTM source vs. referrer reconciliation
SELECT
COALESCE(utm_source, NET.REG_DOMAIN(referrer_url), 'direct') AS traffic_source,
COUNT(DISTINCT session_id) AS sessions
FROM {{ ref('stg_sessions') }}
GROUP BY 1Subdomain-level analysis
SELECT
NET.HOST(page_url) AS subdomain,
COUNT(*) AS pageviews
FROM {{ ref('stg_pageviews') }}
WHERE NET.REG_DOMAIN(page_url) = 'yourdomain.com'
GROUP BY 1
ORDER BY pageviews DESCWhere to put URL parsing in your stack
In staging. Not in dashboards.
Parse URLs once, in your dbt staging models. Store host, domain, and any extracted parameters as clean columns. Everything downstream — marts, dashboards, reports — reads the clean version.
This gives you:
- Consistency. One definition of "domain" across every dashboard.
- Performance. Parse once at transform time, not on every dashboard query.
- Traceability. A change to parsing logic is one PR in one file.
The alternative — every Looker explore and every ad-hoc query running its own REGEXP_EXTRACT — is how you end up with five domain definitions that disagree.
The decision matrix
| Need | Function | Notes |
|---|---|---|
| Full host with subdomains | NET.HOST() | Handles all standard URLs |
| Root domain only | NET.REG_DOMAIN() | Understands public suffix list (co.uk, com.au, etc.) |
| Specific path segment | REGEXP_EXTRACT() | Regex is fine for structured paths |
| Query parameters | JavaScript UDF | new URL().searchParams in a temp UDF |
| Custom parsing logic | JavaScript UDF | Last resort — slower, more expensive |
For 95% of analytics work, NET.REG_DOMAIN is the answer.
We build BigQuery data stacks where decisions like this are made once and applied everywhere — staging models, testing, documentation. If your analytics layer could use that kind of discipline, let's talk.