Rinha Dataset Benchmark

This document explains the Rinha de Backend 2026 dataset benchmark for kiss-binary.

Why This Exists

The Rinha de Backend is a Brazilian backend performance challenge. The 2026 edition provides a labeled reference dataset with 3,000,000 vectors (14 dimensions each) used for fraud detection scoring.

This dataset is useful as a real-world benchmark for kiss-binary because:

This is a kiss-binary benchmark. It is not the Rinha fraud engine. The fraud detection logic, HTTP API, and runtime are separate concerns.

Dataset Files

The benchmark requires official Rinha files placed in a local directory:

Only references.json.gz is required for the binary conversion benchmark.

How to Get the Dataset

The official Rinha de Backend 2026 dataset is in the GitHub repository at: https://github.com/zanfranceschi/rinha-de-backend-2026/tree/main/resources

Download the three files and place them in a directory:

mkdir -p /path/to/rinha-dataset
cd /path/to/rinha-dataset
curl -LO https://raw.githubusercontent.com/zanfranceschi/rinha-de-backend-2026/main/resources/references.json.gz
curl -LO https://raw.githubusercontent.com/zanfranceschi/rinha-de-backend-2026/main/resources/mcc_risk.json
curl -LO https://raw.githubusercontent.com/zanfranceschi/rinha-de-backend-2026/main/resources/normalization.json

Then set RINHA_DATASET_DIR to that directory. Only references.json.gz is required for the binary benchmark.

The dataset contains:

Each vector record in references.json.gz has the format:

{ "vector": [0.01, 0.0833, 0.05, 0.8261, 0.1667, -1, -1, 0.0432, 0.25, 0, 1, 0, 0.2, 0.0416], "label": "legit" }

Binary Format (.kbin)

The conversion produces a compact binary file with this layout:

Offset  Size                    Field
------  ----                    -----
0       4                       magic "KBRN"
4       4                       version (1)
8       4                       logical_dimensions (14)
12      4                       physical_dimensions (16, padded for alignment)
16      4                       vector_count
20      4                       label_word_count (ceil(vector_count / 64))
24      4                       reserved_1 (0)
28      4                       reserved_2 (0)
32      vector_count * 32       vectors: short[16] per vector (14 data + 2 zero)
32+vc*32  label_word_count * 8  labels: long[] bitset (bit 1 = fraud)

For 3,000,000 vectors: ~96.4 MB total (91.6 MB vectors + 0.4 MB labels + 32 B header).

Running the Tests

Synthetic tests (always run, no dataset required)

mvn -B clean verify

This runs RinhaSyntheticDatasetTest with 1,000 synthetic vectors. It validates:

Full dataset tests (requires RINHA_DATASET_DIR)

RINHA_DATASET_DIR=/path/to/rinha/files mvn -B -P rinha-benchmark clean verify

This additionally runs RinhaFullDatasetTest with the real 3M vector dataset. It:

JMH benchmarks

RINHA_DATASET_DIR=/path/to/rinha/files mvn -B -P rinha-benchmark,benchmarks clean package
java -jar target/benchmarks.jar '.*RinhaBinaryBenchmark' -rf json -rff benchmark-results/rinha/jmh-rinha-results.json

JMH benchmarks measure:

  1. Sequential read with BinaryReader
  2. Sequential read with MappedBinaryReader
  3. Sequential read with heap ByteBuffer (baseline)
  4. Random vector access with MappedBinaryReader
  5. Random vector access with heap ByteBuffer (baseline)
  6. Header validation cost (kiss-binary vs ByteBuffer)
  7. Label bitset scan (kiss-binary vs ByteBuffer)

Generated Files

Reading the Benchmark Report

The benchmark report follows a specific structure. Key sections:

What This Is Not

See Also