PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

Vignesh Kothapalli¹, Rishabh Ranjan¹, Valter Hudovernik², Vijay Prakash Dwivedi¹, Johannes Hoffart³, Carlos Guestrin¹, Jure Leskovec¹

¹Stanford University, ²Kumo AI, ³SAP

Paper Code Data Models

PluRel synthesizes diverse multi-tabular relational databases using Structural Causal Models, enabling scaling laws for Relational Foundation Models.

Abstract

Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints.

We introduce PluRel, a framework to synthesize multi-table relational databases through three stages: modeling schemas with directed graphs, inter-table connectivity with bipartite graphs, and feature distributions via conditional causal mechanisms.

We demonstrate that RFM pretraining loss exhibits power-law scaling with synthetic database quantities and tokens. We show that synthetic database scaling improves generalization to real databases and provides strong foundation models for continued pretraining on real data, positioning synthetic data scaling as a promising approach for advancing relational foundation models.

Method

PluRel generates synthetic relational databases through three sequential stages, modeling databases at the schema level (table structure), connectivity level (row-level relationships), and feature level (cell values).

Three-Stage Pipeline

Stage 1: Schema → DAGs define table structure and relationships. Stage 2: Connectivity → HSBMs populate foreign key links. Stage 3: Features → SCMs generate cell values with temporal patterns.

Stage 1: Schema Generation via Directed Graphs

Database schemas are represented as directed acyclic graphs (DAGs), where nodes correspond to tables and edges represent inter-table relationships. A topological ordering of the DAG determines the table generation sequence: initial tables are synthesized independently, while subsequent tables are generated conditionally on their parent tables. Tables are categorized as entity tables (out-degree ≥ 1) or activity tables (remaining nodes). The number of rows, feature columns, and graph topology are all configurable:

from plurel import Config, DatabaseParams, Choices

config = Config(
    database_params=DatabaseParams(
        num_tables_choices=Choices(kind="range", value=[5, 10])
    ),
    schema_file="path/to/schema.sql",  # optional: generate from SQL schema
    cache_dir="~/.cache/plurel",       # optional: cache generated databases
)

Stage 2: Foreign Key Generation via Bipartite Graphs

Row-level connectivity between table pairs is populated through primary-foreign key relationships. Each table contains feature columns, a primary key column indexing rows, and optional foreign key columns referencing parent table rows. A Hierarchical Stochastic Block Model (HSBM) controls row-level information locality, allowing rows to depend on many parent rows or a small subset, enabling flexible dependency modeling.

Stage 3: Feature Generation via Structural Causal Models

Each table is associated with its own SCM defined by a causal graph encoding dependencies between variables. Feature columns correspond to a subset of SCM nodes, supporting numeric, categorical, and boolean types. Tables with foreign keys condition on feature nodes from parent table SCMs.

Temporal correlations are modeled through exogenous inputs combining trend (power-law), cycle (sinusoidal), and fluctuation (random noise) components. Node mechanisms use a projection-reconstruction design: predecessor values and parent table features project into a shared latent space via MLPs, aggregate with exogenous inputs, then reconstruct to the target data type.

Usage

Synthesizing a complete relational database requires only a seed and a configuration object. The resulting dataset is fully compatible with RelBench:

from plurel import SyntheticDataset, Config

# create relbench compatible dataset
dataset = SyntheticDataset(seed=0, config=Config())

# create database which can be cached via relbench APIs
db = dataset.make_db()

For large-scale generation, databases can be synthesized in parallel:

pixi run python scripts/synthetic_gen.py \
    --seed_offset 0 \
    --num_dbs 1000 \
    --num_proc 16 \
    --preprocess

Key Results

Scaling Laws for Data Diversity and Size

We investigate scaling along two axes: the number of synthetic relational databases N (diversity) and the number of pretraining tokens S (size). We find that the validation loss exhibits power-law scaling along both dimensions:

L(N) = A_N · N^−α_N + C_N L(S) = A_S · S^−α_S + C_S

We conduct a comprehensive grid of experiments across (N, S) ∈ {8, 16, 32, 64, 128, 256, 512, 1024} × {0.5B, 1B, 2B, 4B, 8B, 16B, 32B} tokens. Both N and S must be scaled simultaneously to optimize loss; scaling one dimension alone produces non-monotonic U-shaped curves indicating underfitting or overfitting.

Key Takeaway

RFM pretraining loss follows power-law scaling with both data diversity (N) and size (S), mirroring scaling laws observed in LLMs. Both dimensions must be scaled jointly for optimal performance.

Generalization to Real Datasets

We evaluate whether synthetic pretraining benefits transfer to real-world relational databases from RelBench. We compute the masked token prediction loss on the validation splits of all 18 RelBench tasks using the same synthetic scaling configurations.

Low-diversity settings (8–32 synthetic databases) exhibit poor scaling on RelBench, where larger datasets tend to be suboptimal as loss curves show an upward trend. This undesirable behavior is mitigated as the number of databases increases, with benefits from scaling dataset size becoming evident at higher database counts. Models show consistent zero-shot transfer capability despite the distribution mismatch between synthetic and real data.

Key Takeaway

Scaling synthetic data diversity is critical for generalization. With sufficient database diversity, synthetic pretraining enables zero-shot transfer to real-world RelBench tasks.

Continued Pretraining on Real Datasets

Synthetic pretraining creates effective base models for downstream tasks when combined with continued pretraining on real data. Using a leave-one-database-out protocol across six RelBench datasets, we show that synthetic + real pretraining consistently outperforms real-data-only pretraining:

Synthetic data alone is insufficient for robust zero-shot transfer, highlighting that continued pretraining on real data is critical for distribution alignment. The combination of synthetic and real pretraining yields particularly pronounced benefits for behavior-driven and continuous-valued prediction tasks.

Classification (AUROC %)

Dataset	Task	Real only	Synthetic only	Synthetic + Real	Gain
rel-amazon	user-churn	64.2	64.4	65.0	+0.8
rel-hm	user-churn	67.4	63.7	66.0	-1.4
rel-stack	user-badge	80.0	81.4	82.0	+2.0
rel-stack	user-engage	78.9	82.4	86.2	+7.4
rel-amazon	item-churn	67.6	71.0	72.5	+4.9
rel-avito	user-visits	57.2	63.5	63.4	+6.2
rel-avito	user-clicks	54.7	45.9	47.9	-6.8
rel-trial	study-out	54.4	53.8	51.8	-2.6
rel-f1	driver-dnf	80.7	76.7	81.0	+0.3
rel-f1	driver-top3	86.9	82.6	88.4	+1.5
Mean		69.2	68.5	70.4	+1.2

Regression (R² %)

Dataset	Task	Real only	Synthetic only	Synthetic + Real	Gain
rel-hm	item-sales	16.0	4.4	20.0	+4.0
rel-amazon	user-ltv	14.5	9.8	18.5	+4.0
rel-amazon	item-ltv	35.3	10.7	40.5	+5.2
rel-stack	post-votes	22.3	15.7	25.5	+3.2
rel-trial	site-succ	33.7	38.3	38.6	+5.0
rel-trial	study-adv	1.9	-0.8	1.6	-0.3
rel-f1	driver-pos	54.3	41.3	55.5	+1.2
rel-avito	ad-ctr	3.1	2.5	4.9	+1.9
Mean		22.6	15.2	25.7	+3.0

Key Takeaway

Synthetic + real pretraining outperforms real-only pretraining, with +1.2% mean AUROC and +3.0% mean R² gains. Synthetic pretraining provides a strong initialization for continued learning on real data.

BibTeX

@article{kothapalli2026plurel,
  title={{PluRel:} Synthetic Data unlocks Scaling Laws for Relational Foundation Models},
  author={Kothapalli, Vignesh and Ranjan, Rishabh and Hudovernik, Valter and Dwivedi, Vijay Prakash and Hoffart, Johannes and Guestrin, Carlos and Leskovec, Jure},
  journal={arXiv preprint arXiv:2602.04029},
  year={2026}
}

If you use the architecture, training loop or sampler code, please also cite the Relational Transformer paper:

@inproceedings{ranjan2025relationaltransformer,
    title={{Relational Transformer:} Toward Zero-Shot Foundation Models for Relational Data},
    author={Rishabh Ranjan and Valter Hudovernik and Mark Znidar and Charilaos Kanatsoulis and Roshan Upendra and Mahmoud Mohammadi and Joe Meyer and Tom Palczewski and Carlos Guestrin and Jure Leskovec},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026}
}