1 files changed, 239 insertions, 0 deletions
diff --git a/doc/data-flow.md b/doc/data-flow.md
new file mode 100644
index 0000000..1b58cbf
--- /dev/null
+++ b/doc/data-flow.md
@@ -0,0 +1,239 @@
+RAPPOR Data Flow
+================
+
+This doc explains the simulation tools and data formats in the [RAPPOR
+repository](https://github.com/google/rappor).  We'll focus on the code, and
+describe the algorithm only informally.  For details, see the [paper][].
+
+Overview
+--------
+
+Start with this command:
+
+    $ ./demo.sh run
+
+It takes a minute or so to run.  The dependencies listed in the
+[README][] must be installed.  At the end, it will say:
+
+    Wrote _tmp/report.html.  Open this in your browser.
+
+It should look like [this][example].
+
+The following diagram shows what processes and files are involved in the demo.
+Ovals represent **processes**; rectangles represent **data**.  The dotted lines
+denote components that are involved in the simulation, but wouldn't be used in
+a "real" setting.
+
+In most configurations, reporting (in blue) is done by client machines, while
+analysis (in green) is done by a server.
+
+<img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" />
+
+In the simulation, reporting consists of these steps:
+
+  1. Generate simulated input data with different distributions.
+  2. Obscure each value with the RAPPOR privacy-preserving reporting mechanism.
+
+Analysis consists of these steps:
+
+  1. Aggregate the reports by summing bits (i.e. make a counting Bloom filter)
+  2. Come up with candidate strings, and hash them in the same manner as the
+  client.
+  3. Using the reports, RAPPOR parameters, and candidate strings as input,
+  infer the distribution of true values.  We don't see the values themselves.
+  We plot the true and inferred distributions side by side for comparison.
+
+This process is described in detail below.
+
+1. Generating Simulated Input
+-----------------------------
+
+The `tests/gen_sim_input.py` tool generates CSV data, like this:
+
+<!-- TODO: a realistic data set would be nice? How could we generate one?  -->
+
+**exp.csv**
+
+    client, true_value
+    1,      v6
+    1,      v3
+    1,      v3
+    1,      v5
+    1,      v13
+    1,      v1
+    1,      v8
+    2,      v2
+    2,      v3
+    2,      v1
+    2,      v8
+    2,      v1
+    2,      v30
+    2,      v10
+    3,      v4
+    ...
+
+*(spaces added for clarity)*
+
+By default we generate 700,000 rows: 7 random values from `v1` to `v50` for
+each client.  These can be thought of as a variable being reported over time.
+
+We're simulating an environment where there are many RAPPOR clients, and a
+single server does the RAPPOR analysis on the accumulated data.
+
+The `client` is represented by an integer ID.  The `true_value` should **not**
+be sent over the network because we wish to preserve the client's privacy.
+
+
+2. RAPPOR Reporting
+-------------------
+
+The `tests/rappor_sim.py` tool uses the Python client library
+(`client/python/rappor.py`) to obscure the `v1` .. `vN` strings.  We want to
+infer the distribution of these strings over the entire population, but we
+don't want to know any individual values.
+
+After the RAPPOR transformation, we get another CSV file with 700,000 rows.
+Each client is assigned a cohort.
+
+**exp_out.csv**
+
+    client, cohort, rappor
+    1,      63,     1111101011110111
+    1,      15,     1110110011111100
+    1,      12,     0110101111100101
+    1,       0,     1111100111110111
+    1,       3,     1001110111110011
+    1,      14,     1011111010110011
+    1,      33,     0111010100101011
+    2,      40,     0011011010101001
+    2,      35,     1010110101110100
+    2,      58,     1110110110111110
+    2,      38,     0010001111001010
+    2,       5,     1110111011100101
+    2,      36,     0111010100111111
+    2,      39,     0101101000101101
+    3,      32,     0011100111111110
+    ...
+
+*(spaces added for clarity)*
+
+We also get a one-row CSV file that contains the RAPPOR parameters:
+
+**exp_params.csv**
+
+    k,h,m,p,q,f
+    16,2,64,0.5,0.75,0.5
+
+These are described in the [paper][]. The parameters that the clients use
+must be known to the server, or the decoding will fail.  In addition, all
+clients must use the same parameters for a given variable.
+
+The `rappor_sim.py` process also writes these files:
+
+- `exp_hist.csv`: The true histogram of input values.  This is used only for
+  comparison.  In the real world you obviously won't have this.
+- `exp_true_inputs.txt`: A list of the unique values reported, e.g. `v1` ..
+  `v50`.  You won't have this either, in general.  To use RAPPOR, you must
+  supply *candidate strings*, described below.
+
+3. Report Aggregation
+---------------------
+
+`sum_bits.py` takes the `exp_out.csv` output, and produces the "counts" file:
+
+**exp_counts.csv**
+
+    11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222
+    10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187
+    ...
+
+The file has 64 rows, because the simulation has 64 cohorts by default (`m =
+64`).  This parameter should be adjusted based on the number of unique true
+values expected.  <!-- TODO: more detail -->
+
+There are 17 columns.  The left-most column is the total number of reports in
+the cohort.  The remaining 16 columns correspond to the `k = 16` bits in the
+Bloom filter.  Each column contains the number of reports with that bit set
+to 1.
+
+So, in general, the "counts" file is a `(k+1) * m` matrix.
+
+4. Candidate Strings
+--------------------
+
+In the simulation, we assume that the analyst will come up with a *superset* of
+the candidate strings.  This is done in the `more-candidates` /
+`print-candidates` functions in `demo.sh`.
+
+You can also test what happens if you omit true strings from the candidates, by
+editing the invocation of `print-candidates` in `run-dist`:
+
+    # Example of omitting true values.  Generate candidates from
+    # exp_true_inputs.txt, omitting values v1 and v2.
+
+    print-candidates $dist 'v1|v2'  > _tmp/${dist}_candidates.txt
+
+In general, coming up with candidates is an application- or metric-specific
+process.
+
+The candidates are hashed by `hash_candidates.py` to create the "map" file,
+before being passed to R for analysis.
+
+**exp_map.csv**
+
+    v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated>
+    v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated>
+
+The map file has one row per candidate.  In this case, there are 60 rows: 
+50 for the true values and 10 for "fake" values, which make the candidates a
+superset of the true input.
+
+The left most column is the raw candidate string.  Then there are 128 more
+columns: for `m = 64` cohorts times `k = 2` hash functions in the Bloom filter.
+
+<!-- TODO: more detail about setting params?  Examples of coming up with
+candidate strings? -->
+
+5. RAPPOR Analysis
+------------------
+
+Once you have the `counts`, `params`, and `map` files, you can pass it to the
+`tests/analyze.R` tool, which is a small wrapper around the `analyze/R`
+library.
+
+You will get a plot of the true distribution vs. the distribution recovered
+with the RAPPOR privacy algorithm.
+
+[View the example output][example].
+
+You can change the simulation parameters and RAPPOR parameters via flags, and
+compare the resulting distributions.
+
+For example, if you expect more unique values from clients, you should also use
+more cohorts (i.e. raise `m`), to prevent hash function collisions from
+degrading the result quality.
+
+<!-- TODO: 
+     - how to change flags
+     - more detail on what the various parameters do
+     - association analysis
+     - basic RAPPOR
+     - longitudinal privacy
+-->
+
+Conclusion
+----------
+
+RAPPOR allows you infer statistics about populations while preserving the
+privacy of individual clients.  In this doc, we walked through a simple demo.
+However, there are other variations of RAPPOR and settings in which you can use
+RAPPOR, which we'll write more about.
+
+Feel free to send feedback on this doc to
+[rappor-discuss@googlegroups.com](https://groups.google.com/forum/#!forum/rappor-discuss).
+
+
+[README]: https://github.com/google/rappor/blob/master/README.md
+[paper]: http://arxiv.org/abs/1407.6981
+[example]: http://google.github.io/rappor/examples/report.html
+