aboutsummaryrefslogtreecommitdiff
path: root/docs/data-sources/memory-counters.md
blob: f2bbabda45e4f4a8c64d72c4e4ebab1f659c5100 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
# Memory counters and events

Perfetto allows to gather a number of memory events and counters on
Android and Linux. These events come from kernel interfaces, both ftrace and
/proc interfaces, and are of two types: polled counters and events pushed by
the kernel in the ftrace buffer.

## Per-process polled counters

The process stats data source allows to poll `/proc/<pid>/status` and
`/proc/<pid>/oom_score_adj` at user-defined intervals.

See [`man 5 proc`][man-proc] for their semantic.

### UI

![](/docs/images/proc_stat.png "UI showing trace data collected by process stats pollers")

### SQL

```sql
select c.ts, c.value, t.name as counter_name, p.name as proc_name, p.pid
from counter as c left join process_counter_track as t on c.track_id = t.id
left join process as p using (upid)
where t.name like 'mem.%'
```
ts | counter_name | value_kb | proc_name | pid
---|--------------|----------|-----------|----
261187015027350 | mem.virt | 1326464 | com.android.vending | 28815
261187015027350 | mem.rss | 85592 | com.android.vending | 28815
261187015027350 | mem.rss.anon | 36948 | com.android.vending | 28815
261187015027350 | mem.rss.file | 46560 | com.android.vending | 28815
261187015027350 | mem.swap | 6908 | com.android.vending | 28815
261187015027350 | mem.rss.watermark | 102856 | com.android.vending | 28815
261187090251420 | mem.virt | 1326464 | com.android.vending | 28815

### TraceConfig

To collect process stat counters every X ms set `proc_stats_poll_ms = X` in
your process stats config. X must be greater than 100ms to avoid excessive CPU
usage. Details about the specific counters being collected can be found in the
[ProcessStats reference](/docs/reference/trace-packet-proto.autogen#ProcessStats).

```protobuf
data_sources: {
    config {
        name: "linux.process_stats"
        process_stats_config {
            scan_all_processes_on_start: true
            proc_stats_poll_ms: 1000
        }
    }
}
```

## Per-process memory events (ftrace)

### rss_stat

Recent versions of the Linux kernel allow to report ftrace events when the
Resident Set Size (RSS) mm counters change. This is the same counter available
in `/proc/pid/status` as `VmRSS`. The main advantage of this event is that by
being an event-driven push event it allows to detect very short memory usage
bursts that would be otherwise undetectable by using /proc counters.

Memory usage peaks of hundreds of MB can have dramatically negative impact on
Android, even if they last only few ms, as they can cause mass low memory kills
to reclaim memory.

The kernel feature that supports this has been introduced in the Linux Kernel
in [b3d1411b6] and later improved by [e4dcad20]. They are available in upstream
since Linux v5.5-rc1. This patch has been backported in several Google Pixel
kernels running Android 10 (Q).

[b3d1411b6]: https://github.com/torvalds/linux/commit/b3d1411b6726ea6930222f8f12587d89762477c6
[e4dcad20]: https://github.com/torvalds/linux/commit/e4dcad204d3a281be6f8573e0a82648a4ad84e69

### mm_event

`mm_event` is an ftrace event that captures statistics about key memory events
(a subset of the ones exposed by `/proc/vmstat`). Unlike RSS-stat counter
updates, mm events are extremely high volume and tracing them individually would
be unfeasible. `mm_event` instead reports only periodic histograms in the trace,
reducing sensibly the overhead.

`mm_event` is available only on some Google Pixel kernels running Android 10 (Q)
and beyond. 

When `mm_event` is enabled, the following mm event types are recorded:

* mem.mm.min_flt: Minor page faults
* mem.mm.maj_flt: Major page faults
* mem.mm.swp_flt: Page faults served by swapcache
* mem.mm.read_io: Read page faults backed by I/O
* mem.mm..compaction: Memory compaction events
* mem.mm.reclaim: Memory reclaim events

For each event type, the event records:

* count: how many times the event happened since the previous event.
* min_lat: the smallest latency (the duration of the mm event) recorded since
  the previous event.
* max_lat: the highest latency recorded since the previous event.

### UI

![rss_stat and mm_event](/docs/images/rss_stat_and_mm_event.png)

### SQL

At the SQL level, these events are imported and exposed in the same way as
the corresponding polled events. This allows to collect both types of events
(pushed and polled) and treat them uniformly in queries and scripts.

```sql
select c.ts, c.value, t.name as counter_name, p.name as proc_name, p.pid
from counter as c left join process_counter_track as t on c.track_id = t.id
left join process as p using (upid)
where t.name like 'mem.%'
```

ts | value | counter_name | proc_name | pid
---|-------|--------------|-----------|----
777227867975055 | 18358272 | mem.rss.anon | com.google.android.apps.safetyhub | 31386
777227865995315 | 5 | mem.mm.min_flt.count | com.google.android.apps.safetyhub | 31386
777227865995315 | 8 | mem.mm.min_flt.max_lat | com.google.android.apps.safetyhub | 31386
777227865995315 | 4 | mem.mm.min_flt.avg_lat | com.google.android.apps.safetyhub | 31386
777227865998023 | 3 | mem.mm.swp_flt.count | com.google.android.apps.safetyhub | 31386

### TraceConfig

```protobuf
data_sources: {
    config {
        name: "linux.ftrace"
        ftrace_config {
            ftrace_events: "kmem/rss_stat"
            ftrace_events: "mm_event/mm_event_record"
        }
    }
}

# This is for getting Thread<>Process associations and full process names.
data_sources: {
    config {
        name: "linux.process_stats"
    }
}
```

## System-wide polled counters

This data source allows periodic polling of system data from:

- `/proc/stat`
- `/proc/vmstat`
- `/proc/meminfo`

See [`man 5 proc`][man-proc] for their semantic.

### UI

![System Memory Counters](/docs/images/sys_stat_counters.png
"Example of system memory counters in the UI")

The polling period and specific counters to include in the trace can be set in the trace config.

### SQL

```sql
select c.ts, t.name, c.value / 1024 as value_kb from counters as c left join counter_track as t on c.track_id = t.id
```

ts | name | value_kb
---|------|---------
775177736769834 | MemAvailable | 1708956
775177736769834 | Buffers | 6208
775177736769834 | Cached | 1352960
775177736769834 | SwapCached | 8232
775177736769834 | Active | 1021108
775177736769834 | Inactive(file) | 351496

### TraceConfig

The set of supported counters is available in the
[TraceConfig reference](/docs/reference/trace-config-proto.autogen#SysStatsConfig)

```protobuf
data_sources: {
    config {
        name: "linux.sys_stats"
        sys_stats_config {
            meminfo_period_ms: 1000
            meminfo_counters: MEMINFO_MEM_TOTAL
            meminfo_counters: MEMINFO_MEM_FREE
            meminfo_counters: MEMINFO_MEM_AVAILABLE

            vmstat_period_ms: 1000
            vmstat_counters: VMSTAT_NR_FREE_PAGES
            vmstat_counters: VMSTAT_NR_ALLOC_BATCH
            vmstat_counters: VMSTAT_NR_INACTIVE_ANON
            vmstat_counters: VMSTAT_NR_ACTIVE_ANON

            stat_period_ms: 2500
            stat_counters: STAT_CPU_TIMES
            stat_counters: STAT_FORK_COUNT
        }
    }
}
```



## Low-memory Kills (LMK)

#### Background

The Android framework kills apps and services, especially background ones, to
make room for newly opened apps when memory is needed. These are known as low
memory kills (LMK).

Note LMKs are not always the symptom of a performance problem. The rule of thumb
is that the severity (as in: user perceived impact) is proportional to the state
of the app being killed. The app state can be derived in a trace from the OOM
adjustment score.

A LMK of a foreground app or service is typically a big concern. This happens
when the app that the user was using disappeared under their fingers, or their
favorite music player service suddenly stopped playing music.

A LMK of a cached app or service, instead, is frequently business-as-usual and
in most cases won't be noticed by the end user until they try to go back to
the app, which will then cold-start.

The situation in between these extremes is more nuanced. LMKs of cached
apps/service can be still problematic if it happens in storms (i.e. observing
that most processes get LMK-ed in a short time frame) and are often the symptom
of some component of the system causing memory spikes.

### lowmemorykiller vs lmkd

#### In-kernel lowmemorykiller driver
In Android, LMK used to be handled by an ad-hoc kernel-driver,
Linux's [drivers/staging/android/lowmemorykiller.c](https://github.com/torvalds/linux/blob/v3.8/drivers/staging/android/lowmemorykiller.c).
This driver uses to emit the ftrace event `lowmemorykiller/lowmemory_kill`
in the trace.

#### Userspace lmkd

Android 9 introduced a userspace native daemon that took over the LMK
responsibility: `lmkd`. Not all devices running Android 9 will
necessarily use `lmkd` as the ultimate choice of in-kernel vs userspace is
up to the phone manufacturer, their kernel version and kernel config.

On Google Pixel phones, `lmkd`-side killing is used since Pixel 2 running
Android 9.

See https://source.android.com/devices/tech/perf/lmkd for details.

`lmkd` emits a userspace atrace counter event called `kill_one_process`.

#### Android LMK vs Linux oomkiller

LMKs on Android, whether the old in-kernel `lowmemkiller` or the newer `lmkd`,
use a completely different mechanism than the standard
[Linux kernel's OOM Killer](https://linux-mm.org/OOM_Killer).
Perfetto at the moment supports only Android LMK events (Both in-kernel and
user-space) and does not support tracing of Linux kernel OOM Killer events.
Linux OOMKiller events are still theoretically possible on Android but extremely
unlikely to happen. If they happen, they are more likely the symptom of a
mis-configured BSP.

### UI

Newer userspace LMKs are available in the UI under the `lmkd` track
in the form of a counter. The counter value is the PID of the killed process
(in the example below, PID=27985).

![Userspace lmkd](/docs/images/lmk_lmkd.png "Example of a LMK caused by lmkd")

TODO: we are working on a better UI support for LMKs.

### SQL

Both newer lmkd and legacy kernel-driven lowmemorykiler events are normalized
at import time and available under the `mem.lmk` key in the `instants` table.

```sql
select ts, process.name, process.pid from instants left join process on instants.ref = process.upid where instants.name = 'mem.lmk'
```

| ts | name | pid |
|----|------|-----|
| 442206415875043 | roid.apps.turbo | 27324 |
| 442206446142234 | android.process.acore | 27683 |
| 442206462090204 | com.google.process.gapps | 28198 |

### TraceConfig

To enable tracing of low memory kills add the following options to trace config:

```protobuf
data_sources: {
    config {
        name: "linux.ftrace"
        ftrace_config {
            # For old in-kernel events.
            ftrace_events: "lowmemorykiller/lowmemory_kill"

            # For new userspace lmkds.
            atrace_apps: "lmkd"

            # This is not strictly required but is useful to know the state
            # of the process (FG, cached, ...) before it got killed.
            ftrace_events: "oom/oom_score_adj_update"
        }
    }
}
```

## {#oom-adj} App states and OOM adjustment score

The Android app state can be inferred in a trace from the process
`oom_score_adj`. The mapping is not 1:1, there are more states than
oom_score_adj value groups and the `oom_score_adj` range for cached processes
spans from 900 to 1000.

The mapping can be inferred from the
[ActivityManager's ProcessList sources](https://cs.android.com/android/platform/superproject/+/android10-release:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=126)

```java
// This is a process only hosting activities that are not visible,
// so it can be killed without any disruption.
static final int CACHED_APP_MAX_ADJ = 999;
static final int CACHED_APP_MIN_ADJ = 900;

// This is the oom_adj level that we allow to die first. This cannot be equal to
// CACHED_APP_MAX_ADJ unless processes are actively being assigned an oom_score_adj of
// CACHED_APP_MAX_ADJ.
static final int CACHED_APP_LMK_FIRST_ADJ = 950;

// The B list of SERVICE_ADJ -- these are the old and decrepit
// services that aren't as shiny and interesting as the ones in the A list.
static final int SERVICE_B_ADJ = 800;

// This is the process of the previous application that the user was in.
// This process is kept above other things, because it is very common to
// switch back to the previous app.  This is important both for recent
// task switch (toggling between the two top recent apps) as well as normal
// UI flow such as clicking on a URI in the e-mail app to view in the browser,
// and then pressing back to return to e-mail.
static final int PREVIOUS_APP_ADJ = 700;

// This is a process holding the home application -- we want to try
// avoiding killing it, even if it would normally be in the background,
// because the user interacts with it so much.
static final int HOME_APP_ADJ = 600;

// This is a process holding an application service -- killing it will not
// have much of an impact as far as the user is concerned.
static final int SERVICE_ADJ = 500;

// This is a process with a heavy-weight application.  It is in the
// background, but we want to try to avoid killing it.  Value set in
// system/rootdir/init.rc on startup.
static final int HEAVY_WEIGHT_APP_ADJ = 400;

// This is a process currently hosting a backup operation.  Killing it
// is not entirely fatal but is generally a bad idea.
static final int BACKUP_APP_ADJ = 300;

// This is a process bound by the system (or other app) that's more important than services but
// not so perceptible that it affects the user immediately if killed.
static final int PERCEPTIBLE_LOW_APP_ADJ = 250;

// This is a process only hosting components that are perceptible to the
// user, and we really want to avoid killing them, but they are not
// immediately visible. An example is background music playback.
static final int PERCEPTIBLE_APP_ADJ = 200;

// This is a process only hosting activities that are visible to the
// user, so we'd prefer they don't disappear.
static final int VISIBLE_APP_ADJ = 100;

// This is a process that was recently TOP and moved to FGS. Continue to treat it almost
// like a foreground app for a while.
// @see TOP_TO_FGS_GRACE_PERIOD
static final int PERCEPTIBLE_RECENT_FOREGROUND_APP_ADJ = 50;

// This is the process running the current foreground app.  We'd really
// rather not kill it!
static final int FOREGROUND_APP_ADJ = 0;

// This is a process that the system or a persistent process has bound to,
// and indicated it is important.
static final int PERSISTENT_SERVICE_ADJ = -700;

// This is a system persistent process, such as telephony.  Definitely
// don't want to kill it, but doing so is not completely fatal.
static final int PERSISTENT_PROC_ADJ = -800;

// The system process runs at the default adjustment.
static final int SYSTEM_ADJ = -900;

// Special code for native processes that are not being managed by the system (so
// don't have an oom adj assigned by the system).
static final int NATIVE_ADJ = -1000;
```

[man-proc]: https://manpages.debian.org/stretch/manpages/proc.5.en.html