5G UPF: Smart Session Placement
Alexandre Cassen, <acassen@gmail.com>
The companion article described how range partitioning, flow steering policy, and CPU scheduling groups form an end-to-end pipeline in GTP-Guard. When a PFCP session is established, a scheduling algorithm elects a CPU, and that election determines the TEID range, the IP sub-prefix, and the NIC queue assignment for the entire session lifetime. The previous article introduced connection- and resource-based algorithms (wlc, sed, lc, ll, lbw, lpps) that each operate on a single metric.
This article goes further. It introduces trend-based algorithms that look at load trajectory rather than a point-in-time snapshot, a Weighted Score Composite that blends multiple metrics into a single normalized score, and a Constraint-Based Scheduler that enforces hard limits before delegating to any other algorithm. Together they give operators fine-grained control over session placement quality and SLA differentiation.
Single-Metric Scheduling could not be enough
Every scheduling algorithm from the previous article operates on one metric dimension.
Connection-based algorithms (lc, wlc, sed, nq) look at pfcp_sessions. Load-based
algorithms (ll) look at CPU utilization. Traffic-based algorithms (lbw, lpps) look at
bandwidth or packet rate.
In a real data-plane workload these dimensions are correlated but not interchangeable. A CPU can sit at low utilization while saturating its NIC queue bandwidth. It can hold many sessions that are mostly idle. It can show moderate load right now while absorbing a burst that will saturate it in five seconds.
Single-metric scheduling forces the operator to pick one dimension and hope the others follow. For a uniform workload with similar session profiles, this is fine. For a multi-service UPF handling video, VoLTE, IoT, and web browsing simultaneously, it leaves blind spots.
Smoothing the Signal
Before combining metrics or detecting trends, each input must be smoothed. Raw instantaneous values are noisy. The per-CPU metrics polling runs every 200ms, and ethtool stats arrive every 3 seconds. A single sampling interval can capture a burst that vanishes by the next tick.
GTP-Guard maintains EWMA (Exponentially Weighted Moving Average) smoothed values alongside the raw counters for all key metrics:
float load_ewma;
double rx_bw_bps_ewma;
double tx_bw_bps_ewma;
double total_bw_bps_ewma;
double rx_pps_ewma;
double tx_pps_ewma;
These are updated at the polling tick using the formula:
smoothed = alpha * current + (1 - alpha) * previous_smoothed
Quickly on EWMA
An Exponentially Weighted Moving Average reacts to sustained changes while
ignoring short-lived spikes, which is exactly what a scheduler needs when load,
bandwidth, and packet rate all jitter from tick to tick. Each new sample
contributes a fraction alpha of its value, and the remaining 1 - alpha carries
the previous smoothed value forward. Expanding the recursion, older samples decay
geometrically as (1 - alpha)^n, so their influence fades but never disappears
abruptly. A small alpha (for example 0.1) gives long memory and heavy damping,
while a larger alpha (0.5 or more) tracks the raw signal more closely. Compared
with a plain moving window, EWMA keeps a constant memory footprint of one float
per metric and avoids the sharp edge effect that appears when a sample leaves the
window. For detailed informations on EWMA, see the
NIST/SEMATECH e-Handbook section on EWMA control charts.
The default alpha is 0.2. With the 3-second ethtool interval for traffic metrics, this
converges after roughly 15 seconds (five ticks). The show cpu-sched output displays both
raw and smoothed values so operators can observe the smoothing in real time:
CPU Scheduling Group: dp-plane (algorithm: wsc)
CPU Weight Sessions Load Load~ BW(Mbps) BW~(Mbps) PPS PPS~
4 100 42 0.23 0.21 850.2 812.4 125430 121200
5 100 38 0.31 0.22 790.1 801.3 118200 119500
6 100 55 0.18 0.20 1200.5 1050.8 189400 175300
7 100 29 0.45 0.38 420.0 455.2 62100 68900
The ~ columns show the EWMA-smoothed values. CPU 5 has a raw load of 0.31 (a transient
spike) but its smoothed load is 0.22, much closer to its sustained state. CPU 6 shows a raw
bandwidth of 1200 Mbps but the smoothed value is 1050 Mbps, filtering out a recent burst.
Trend-Based Algorithms
gauge_history: capturing trajectory
Every metric except session count maintains a ring buffer of historical samples in
gtp_percpu_metrics. The load history is sampled every 200ms and stores up to 256 samples,
covering roughly 51 seconds. The bandwidth and packet rate histories are sampled every 3
seconds and store up to 256 samples, covering roughly 12 minutes.
This history data enables two trend-based scheduling algorithms.
ls (Least-Slope)
The ls algorithm computes a linear slope over a configurable window of recent load samples. The CPU with the lowest slope (most negative or least positive) wins. A CPU trending down is preferred over one trending up, even if its current absolute load is higher.
slope = (newest_sample - oldest_sample) / window_samples
The window parameter controls reactivity:
| Window | Time coverage | Behavior |
|---|---|---|
| 5-10 samples | 1-2 seconds | Reacts quickly to short bursts |
| 25 samples | 5 seconds | Balanced (default) |
| 100-256 samples | 20-51 seconds | Captures longer-term trends, ignores transients |
Configuration:
cpu-sched-group dp-adaptive
cpumask 0-7
algorithm ls
window 50
!
With a window of 50 (10 seconds), ls ignores sub-second jitter and catches sustained ramps. Consider two CPUs: CPU 4 is at 45% load but trending upward (slope +0.03), while CPU 5 is at 55% load but trending downward (slope -0.02). A point-in-time algorithm picks CPU 4 because it has lower load. The ls algorithm picks CPU 5 because it is cooling down while CPU 4 is heating up.
The show cpu-sched output keeps the same layout as every other algorithm. The slope is
computed internally from the load history ring buffer and drives the election, but it is
not printed as a dedicated column. Operators read the trend indirectly by comparing Load
(raw) with Load~ (EWMA-smoothed):
CPU Scheduling Group: dp-adaptive (algorithm: ls)
CPU Weight Sessions Load Load~ BW(Mbps) BW~(Mbps) PPS PPS~
4 100 42 0.45 0.42 850.2 835.1 125430 123800
5 100 38 0.55 0.57 790.1 795.6 118200 119100
CPU 4 shows a raw load (0.45) higher than its smoothed value (0.42), so the sustained trend is upward. CPU 5 shows the opposite pattern with 0.55 raw versus 0.57 smoothed, so the sustained state is cooling. The ls algorithm picks CPU 5 even though its absolute load is higher at the sampling instant.
ewma (EWMA Least-Load)
The ewma algorithm uses the pre-computed load_ewma to elect the CPU with the lowest
smoothed load. It is structurally identical to ll (least-load) but operates on the smoothed
value instead of the raw one.
The smoothing factor alpha is configurable per scheduling group:
cpu-sched-group dp-smooth
cpumask 0-7
algorithm ewma
ewma-alpha 0.1
!
A lower alpha (0.05-0.1) gives heavier smoothing and longer memory, dampening most spikes. A higher alpha (0.5-0.8) follows load closely with mild smoothing. The default of 0.2 provides a good balance for typical mobile data-plane workloads.
Weighted Score Composite (WSC)
WSC addresses the core limitation of single-metric algorithms by computing a composite score from four EWMA-smoothed metrics, weighted by the operator.
The four metrics
| Index | Metric | Source |
|---|---|---|
| load | CPU utilization | load_ewma |
| sessions | Session count | pfcp_sessions |
| bw | Total bandwidth | total_bw_bps_ewma |
| pps | Packet rate | rx_pps_ewma + tx_pps_ewma |
Scoring
WSC runs two passes over the cpumask.
The first pass collects metric values for each CPU and tracks the group-wide maximum for each metric. These maxima serve as normalization denominators.
The second pass computes a normalized composite score for each CPU:
score(cpu) = sum over k: weight[k] * (value[k] / max[k])
Dividing by the group-wide max normalizes each metric to [0.0, 1.0], making them comparable regardless of unit or scale. The CPU with the lowest score wins.
Normalization is relative to the current group state, not to any absolute capacity. WSC adapts automatically as load increases. It does not need to know the NIC line rate or CPU frequency.
Operator-tunable weights
The metric-weight command sets the importance of each metric. Weights do not need to sum to
1.0. Setting load 2.0 and bw 1.0 makes load twice as important as bandwidth.
Bandwidth-dominated workload. The operator knows the bottleneck is NIC throughput (video streaming, large file transfers):
cpu-sched-group upf-bw-heavy
cpumask 0-7
algorithm wsc
metric-weight load 1.0
metric-weight bw 3.0
metric-weight pps 1.0
metric-weight sessions 0.0
!
Bandwidth gets 3x weight. Sessions are disabled (weight 0.0) because each session carries high traffic volume, making session count a poor proxy for actual load.
Session-heavy IoT deployment. Many low-throughput bearers where the bottleneck is state-table pressure:
cpu-sched-group iot-pool
cpumask 0-7
algorithm wsc
metric-weight sessions 2.0
metric-weight load 1.0
metric-weight bw 0.5
metric-weight pps 0.5
!
Sessions get 2x weight because each device generates negligible traffic but consumes memory and hash-table entries.
Balanced default. When no single bottleneck dominates:
cpu-sched-group upf-balanced
cpumask 0-7
algorithm wsc
!
All four metric weights default to 1.0, giving equal importance to every dimension.
Constraint-Based Scheduling (CBS)
WSC blends metrics into a single score and always picks the least-loaded CPU. But it cannot express hard limits. If every CPU in the group exceeds 85% load, WSC still picks the least-bad one without signaling that a critical threshold has been crossed. It also cannot express trend-based policies because the composite score flattens all dimensions into a single number.
CBS solves this by separating the decision into two phases.
Phase 1: constraint filter
The algorithm evaluates every CPU against operator-defined constraints. A constraint is defined by three parameters: a metric (load, sessions, bw, pps), a mode (instant, ewma, slope), and a threshold. If a CPU's metric value exceeds the threshold, the CPU is excluded from the candidate set.
The three modes leverage different data sources:
| Mode | Source | Use case |
|---|---|---|
| instant | Raw current value | Hard real-time limits |
| ewma | EWMA-smoothed value | Filtering transient spikes |
| slope | Trend from gauge_history | Detecting ramp-ups before saturation |
The slope mode is the most interesting. Instead of asking "is this CPU busy?", it asks "is this CPU becoming busy?" A CPU at 55% load with a slope of +0.05 is a worse placement target than one at 65% with a flat or declining trend. The slope is computed from the gauge_history ring buffer over a configurable window.
Phase 2: fallback delegation
After filtering, the survivor set becomes the candidate cpumask. Any existing scheduling algorithm can serve as the fallback (wlc, wsc, ll, ewma, or any other). CBS temporarily restricts the group's cpumask to the survivors, calls the fallback algorithm, and restores the original cpumask. The fallback sees a reduced CPU set and operates normally on it.
If no CPU passes all constraints (complete overload), CBS falls back to the full cpumask. Refusing placement entirely would cause session establishment failures, which is worse than placing on a busy CPU. The operator can detect this condition through debug logging.
Configuration
constraint <load|sessions|bw|pps> <instant|ewma|slope> <threshold>
fallback-algorithm <algo>
The parser enforces mode compatibility: sessions only supports instant mode because session count changes discretely and has no EWMA or history ring.
Scenario 1: Capacity protection
The simplest CBS use case. Exclude overloaded CPUs, let LC distribute sessions among the healthy ones.
cpu-sched-group protected
cpumask 0-7
algorithm cbs
constraint load ewma 0.8
fallback-algorithm lc
!
The EWMA constraint at 0.8 leaves 20% headroom for burst absorption. EWMA mode avoids false exclusions on transient spikes so that a CPU that briefly touches 95% on a burst but sustains 60% stays eligible.
At election time, suppose the 8 CPUs have smoothed loads of 0.45, 0.62, 0.71, 0.83, 0.55, 0.91, 0.38, 0.77. The constraint excludes CPUs 3 (0.83) and 5 (0.91). LC picks from the remaining 6 CPUs based on session count.
Scenario 2: Trend-aware gating
Detect CPUs absorbing a traffic surge before their absolute load crosses a threshold.
cpu-sched-group trend-aware
cpumask 0-7
algorithm cbs
constraint load slope 0.03
constraint bw ewma 5000000000
fallback-algorithm wsc
metric-weight load 2.0
metric-weight bw 1.0
metric-weight pps 0.5
metric-weight sessions 0.0
window 50
!
The load slope constraint uses a window of 50 samples (10 seconds at 200ms per load sample). A threshold of 0.03 means "exclude any CPU whose load is increasing by more than 3% per sample over the window." This catches CPUs actively absorbing a burst, even if their absolute load is still moderate. A CPU sitting at 50% load but rising steeply is a worse placement target than one at 65% with a flat trend.
The bandwidth EWMA constraint at 5 Gbps protects against NIC queue saturation. The WSC fallback, with load weighted 2x and sessions disabled, then picks the best candidate among the survivors.
Scenario 3: Mixed SLA tiers
Isolate premium subscribers from best-effort traffic at the CPU level.
cpu-sched-group premium
cpumask 0-7
algorithm cbs
constraint load ewma 0.7
constraint bw ewma 3000000000
fallback-algorithm wsc
metric-weight load 2.0
metric-weight sessions 1.0
metric-weight bw 1.0
metric-weight pps 0.5
!
cpu-sched-group best-effort
cpumask 8-15
algorithm wsc
!
The premium group runs CBS with two constraints. The load EWMA threshold is tighter (0.7 versus 0.8 in scenario 1) because premium traffic needs more headroom for latency-sensitive processing. The bandwidth EWMA constraint at 3 Gbps adds protection against NIC queue saturation before the load metric catches up.
The best-effort group runs plain WSC on a separate CPU set (8-15). No constraints, no hard limits. Since best-effort traffic tolerates higher latency, the WSC scoring alone provides adequate balancing.
Binding these to APNs creates full SLA differentiation:
access-point-name enterprise
cpu-sched premium
!
access-point-name consumer
cpu-sched best-effort
!
Enterprise subscribers land on CPUs 0-7 with tight capacity protection. Consumer subscribers land on CPUs 8-15 with best-effort balancing. The two traffic classes are fully isolated at the CPU level.
Scenario 4: IoT session gating
For NB-IoT or LTE-M deployments where the dominant bottleneck is session count rather than throughput.
cpu-sched-group iot
cpumask 0-7
algorithm cbs
constraint sessions instant 50000
fallback-algorithm ewma
ewma-alpha 0.15
!
IoT devices typically establish a PFCP session, send a few hundred bytes of telemetry, then go idle for hours. Each session consumes memory and state-table entries but generates negligible traffic. The constraint at 50000 sessions in instant mode prevents any single CPU from accumulating enough state to cause hash collisions or memory pressure. On 8 CPUs, this allows up to 400000 total sessions.
The EWMA fallback with a low alpha (0.15) gives heavier smoothing. This matters for IoT workloads because the traffic pattern is bursty at the device level (a device wakes up, transmits, sleeps) but smooth in aggregate when thousands of devices are staggered.
Scenario 5: CDN cache miss storm
Detect bandwidth ramp-ups caused by cache invalidation events at a mobile CDN edge.
cpu-sched-group cdn-edge
cpumask 0-15
algorithm cbs
constraint bw slope 200000000
constraint bw ewma 6000000000
constraint load ewma 0.85
fallback-algorithm wsc
metric-weight bw 3.0
metric-weight load 1.0
metric-weight pps 0.5
metric-weight sessions 0.0
window 10
!
The bandwidth slope constraint is the core of this configuration. With a window of 10 ethtool samples (30 seconds), a threshold of 200 Mbps per sample means "exclude any CPU whose bandwidth is increasing by more than 200 Mbps every 3 seconds." A cache miss storm starts gradually. The first few seconds show a gentle bandwidth rise as initial requests arrive. EWMA catches it only after the smoothed value crosses 6 Gbps, by which time the CPU is already congested. The slope constraint catches the trend 15-20 seconds earlier, when bandwidth is still at 3-4 Gbps but climbing at 200 Mbps per tick.
Window tuning
The window parameter controls the slope observation horizon. Its effect differs between
metrics because they sample at different rates:
| Window | Load coverage | BW/PPS coverage |
|---|---|---|
| 5 | 1 second | 15 seconds |
| 10 | 2 seconds | 30 seconds |
| 25 | 5 seconds (default) | 75 seconds |
| 50 | 10 seconds | 2.5 minutes |
| 100 | 20 seconds | 5 minutes |
For groups using slope constraints on both load and traffic metrics, the window is a compromise. A window of 15 covers 3 seconds of load history (short enough to catch spikes) and 45 seconds of traffic history (long enough to detect ramps). If the deployment needs very different windows for load and traffic, splitting into two groups (each with its own window) is a better approach.
Putting It All Together
This section combines all the features into a complete multi-tier UPF deployment. The server handles three traffic classes on dedicated CPU pools, each with a scheduling policy tuned to its workload.
! --- Range partitions ---
range-partition teid-main
type teid ipv4
split 0x00000000/0 count 16
!
range-partition ipv4-enterprise
type ipv4
split 10.0.0.0/12 count 8
!
range-partition ipv4-consumer
type ipv4
split 10.16.0.0/12 count 8
!
range-partition ipv6-main
type ipv6
split 2001:db8::/46 count 16
!
! --- Flow steering policies ---
flow-steering-policy fs-upstream
queue-id 0-15
queue-id bind range-partition teid-main
!
flow-steering-policy fs-downstream-enterprise
queue-id 0-7
queue-id bind range-partition ipv4-enterprise
!
flow-steering-policy fs-downstream-consumer
queue-id 8-15
queue-id bind range-partition ipv4-consumer
!
! --- Scheduling groups ---
cpu-sched-group premium
cpumask 0-7
algorithm cbs
constraint load ewma 0.7
constraint bw ewma 4000000000
fallback-algorithm wsc
metric-weight load 2.0
metric-weight sessions 1.0
metric-weight bw 1.5
metric-weight pps 0.5
window 25
cpumask bind range-partition teid-main
cpumask bind range-partition ipv4-enterprise
cpumask bind range-partition ipv6-main
!
cpu-sched-group standard
cpumask 8-15
algorithm wsc
metric-weight load 1.0
metric-weight bw 1.0
metric-weight pps 1.0
metric-weight sessions 1.0
cpumask bind range-partition teid-main
cpumask bind range-partition ipv4-consumer
cpumask bind range-partition ipv6-main
!
interface p0
flow-steering-policy fs-upstream
flow-steering-policy fs-downstream-enterprise
flow-steering-policy fs-downstream-consumer
!
! --- PFCP router and APNs ---
access-point-name enterprise
cpu-sched premium
range-partition ipv4-enterprise
!
access-point-name consumer
! inherits from pfcp-router
!
pfcp-router main
cpu-sched standard
range-partition teid-main
range-partition ipv4-consumer
range-partition ipv6-main
!
The TEID partition splits the full 32-bit space into 16 parts covering all 16 data-path CPUs. Both scheduling groups bind this same partition, but each group only uses its own slice: the premium group maps cpumask 0-7 to partitions 0-7, while the standard group maps cpumask 8-15 to partitions 8-15. There is no overlap.
The IPv4 pools are separate. Enterprise subscribers get addresses from 10.0.0.0/12, consumer
subscribers from 10.16.0.0/12. Each pool is split into 8 partitions matching its CPU group.
The premium group runs CBS with two constraints: EWMA load at 0.7 and EWMA bandwidth at 4 Gbps. These thresholds leave generous headroom for enterprise sessions. The WSC fallback weights load 2x because latency correlates with CPU utilization, sessions 1x because enterprise sessions are long-lived, and bandwidth 1.5x to track throughput. Any CPU that exceeds 70% smoothed load or is pushing more than 4 Gbps gets filtered out. The survivors enter WSC for final ranking.
The standard group runs plain WSC with equal weights. No hard limits, no filtering. Consumer traffic tolerates higher latency and occasional congestion, so the multi-metric scoring alone provides adequate balancing.
Enterprise subscribers land on CPUs 0-7, consumer subscribers on CPUs 8-15. Each class is isolated at every level: different CPUs, different TEID sub-ranges, different IP pools, different NIC queues, different scheduling policies. A consumer traffic surge cannot affect enterprise session quality because the two never share a processing core.
Debug and Observability
Each scheduling group has a debug toggle that can be enabled at runtime with
debug cpu-sched <group>. When the flag is on, every election decision is logged with the
elected CPU, its current session count, and its configured weight:
cpu-sched: group=premium algo=cbs elected=cpu2 (sessions=24 weight=100)
The trace confirms which group made the decision, which algorithm ran, and which CPU won.
For a CBS group, this line is the outcome of the constraint-filter plus fallback pipeline,
so CPU 2 is guaranteed to have passed every configured constraint. The companion
show cpu-sched <group> output gives the full per-CPU metric table used to reach that
decision, so operators can cross-check the election against live load, smoothed values, and
session counts.
Together with show range-partition and show interface <name> flow-steering, these
commands provide complete visibility into every layer of the session placement pipeline.