Real-world datasets are from Megallen (VLDB'16), there are total 10 datasets, including structured, textual and dirty version. Sythetic datasets are from SIGMOD Programming Contest 2022 (SIGMOD Record'23) datasets, including 2 hidden datasets used in contest evaluation and up to 18 datasets newly generated by us. We may just use a subset of the newly-generated 18 datasets.

Megallen:

We choose: Amazon-Google (S, T, D), Walmart-Amazon (S, T, D), DBLP-Google (S, D), DBLP-ACM (S, D). (S: structured, T: textual and D: dirty)

SIGMOD Programming Contest

Original two hidden datasets for evaluation: secret/1 and secret/2.
18 datasets with different size & number of gold matches generated by us, all are listed under secret-new.

Competitors

Sparkly-Auto.
SIGMOD Programming Contest 2022's winner blocker. (only used on "SIGMOD-Programming Contest" datasets)

Metrics

Recall / Output Size.
Run time / Scalability.
F1 score. (For matcher only)

Plan

Metric 1 on all datasets. (Output Size is independent variable)
Metric 2 on the last 6 newly generated datasets in "SIGMOD-Programming Contest".
Orthogonal test, run TA TopK on Sparkly and TF-IDF TopK on Similarity Join Blocker, to investigate the Metric 1's differences. (This may only be ran on synthetic datasets since Sparkly already achieved a excellent performance on real-world datasets.)
Metric 1 and 3 on matcher with taking interchangeable values into consideration.

Results

For the tables of Similarity Join Blocker and Champion Solution, the schemas K=10 is roughly limiting the output size of that blocker as the output size of Sparkly-Auto with K=10.

Recall / Output Size on Megallen Structured Data

Sparkly-Auto

Dataset	K=10	K=20	K=50
amazon-google	32,240, 0.1323	64,434, 0.1615	160,668, 0.1830
dblp-acm	22,940, 0.9968	45,880, 0.9977	114,692, 0.9991
dblp-googlescholar	641,359, 0.9910	1,280,826, 0.9951	3,185,380, 0.9994
walmart-amazon	220,646, 0.9974	440,931, 1.0	1,095,779, 1.0

Similarity Join Blocker

Dataset	K=10	K=20	K=50
amazon-google	30,000, 0.9774 (0.9808)	60,000, 0.9843 (0.9908)	120,000, 0.9859 (150,000, 0.9946)
dblp-acm	20,000, 1.0 (0.996)	40,000, 1.0 (0.9978)	80,000, 1.0 (100,000 0.9987)
dblp-googlescholar	600,000, 0.9942 (0.9908)	1,200,000, 0.9953 (0.9908)	2,400,000, 0.9955 (3,000,000, 0.9908)
walmart-amazon	200,000, 0.9809 (0.9913)	400,000, 0.9887 (0.9991)	800,000, 0.9939 (1,000,000, 1)

Recall / Output Size on Megallen Textual Data

Sparkly

Dataset	K=10	K=20	K=50
amazon-google	32,260, 0.9807	64,516, 0.9892	161,019, 0.9946
walmart-amazon	220,728, 0.9904	441,448, 0.9939	1,103,578, 0.9965

Similarity Join Blocker

Dataset	K=10	K=20	K=50
amazon-google	30,000, 0.9793 (0.99)	60,000, 0.9849 (0.9954)	120,000, 0.9851 (0.9969)
walmart-amazon	200,000, 0.9402 (0.9992)	400,000, 0.9459 (0.9939)	800,000, 0.9551 (0.9965)

Recall / Output Size on Megallen Dirty Data

Sparkly-Auto

Dataset	K=10	K=20	K=50
amazon-google	32,260, 0.99	64,511, 0.9953	161,090, 0.9976
dblp-acm	22,940, 0.9991	45,870, 1.0	114,602, 1.0
dblp-googlescholar	639,945, 0.9953	1,278,009, 0.9937	3,186,741, 0.9983
walmart-amazon	220,732, 1.0	441,459, 1.0	1,103,512, 1.0

Similarity Join Blocker

Dataset	K=10	K=20	K=50
amazon-google	30,000, 0.9587 (0.9838)	60,000, 0.9602 (0.9915)	120,000, 0.9602 (0.9969)
dblp-acm	20,000, 0.9974 (0.9978)	40,000, 0.9974 (0.9987)	80,000, 0.9979 (0.9991)
dblp-googlescholar	600,000, 0.9805 (0.9935)	1,200,000, 0.9809 (0.995)	2,400,000, 0.9810 (0.9957)
walmart-amazon	200,000, 0.9893	400,000, 0.9927	800,000, 0.9945

Recall / Output Size on SIGMOD Programming Contest Hidden Data

Sparkly-Auto

Dataset	K=10	K=20	K=50
secret/1	8,843,915, 0.6275	16,916,084, 0.6888	40,934,530, 0.7804
secret/2	8,823,829, 0.2549	16,916,234, 0.2751	40,814,094, 0.3130

Champion Solution

Dataset	K=10	K=20	K=50
secret/1	8,000,000, 0.7386	16,000,000, 0.7463	40,000,000, 0.7482
secret/2	8,000,000, 0.4007	16,000,000, 0.4240	40,000,000, 0.4453

Similarity Join Blocker

Dataset	K=10	K=20	K=50
secret/1	8,000,000, 0.7834	16,000,000, 0.8123	32,000,000, 0.8411
secret/2	8,000,000, 0.3531	16,000,000, 0.3824	32,000,000, 0.4150

Recall / Output Size on SIDMOG Programming Contest New Data

Sparkly-Auto

Dataset	K=10	K=20	K=50
secret/new/1	8,114,258, 0.5449	15,518,322, 0.6243	37,579,737, 0.7323
secret/new/2	8,531,622, 0.1701	16,467,383, 0.1909	39,739,353, 0.2297
secret/new/3	11,023,375, 0.7222	21,127,273, 0.7637	51,070,261, 0.8290
secret/new/4	10,983,557, 0.3609	21,146,213, 0.3764	51,120,840, 0.4055
secret/new/5	14,621,984, 0.7422	28,151,329, 0.7734	68,017,778, 0.8280
secret/new/6	14,693,455, 0.4300	28,362,519, 0.4431	68,426,150, 0.4689
secret/new/7	2,353,781, 0.5920	4,652,442, 0.6393	11,542,114, 0.7213
secret/new/8	2,264,471, 0.1051	4,421,496, 0.1242	10,735,996, 0.1667
secret/new/9	3,081,204, 0.6081	5,984,783, 0.6603	14,694,013, 0.7482
secret/new/10	2,988,633, 0.1494	5,776,893, 0.1696	13,950,824, 0.2125
secret/new/11	5,237,610, 0.5887	10,052,944, 0.6451	24,448,529, 0.7386
secret/new/12	5,169,850, 0.2188	9,929,842, 0.2403	23,945,011, 0.2826
secret/new/13	17,563,355, 0.7370	33,594,725, 0.7809
secret/new/14
secret/new/15
secret/new/16
secret/new/17
secret/new/18

Champion Solution

Dataset	K=10	K=20	K=50
secret/new/1	8,114,258, 0.6931	15,518,322, 0.6992	37,579,737, 0.7026
secret/new/2	8,531,622, 0.3650	16,467,383, 0.3879	39,739,353, 0.4093
secret/new/3	10,000,000, 0.7712	20,000,000, 0.7780	40,000,000, 0.7788
secret/new/4	10,000,000, 0.4525	20,000,000, 0.4710	40,000,000, 0.4854
secret/new/5	14,000,000, 0.7762	28,000,000, 0.7807	56,000,000, 0.7825
secret/new/6	14,000,000, 0.4745	28,000,000, 0.4906	56,000,000, 0.5024
secret/new/7	2,000,000, 0.7098	4,000,000, 0.7355	8,000,000, 0.7372
secret/new/8	2,000,000, 0.2485	4,000,000, 0.3008	8,000,000, 0.3340
secret/new/9	3,000,000, 0.7333	6,000,000, 0.7373	12,000,000, 0.7379
secret/new/10	3,000,000, 0.3290	6,000,000, 0.3631	12,000,000, 0.3868
secret/new/11
secret/new/12
secret/new/13
secret/new/14
secret/new/15
secret/new/16
secret/new/17
secret/new/18

Similarity Join Blocker

Dataset	K=10	K=20	K=50
secret/new/1
secret/new/2
secret/new/3
secret/new/4
secret/new/5
secret/new/6
secret/new/7
secret/new/8
secret/new/9
secret/new/10
secret/new/11
secret/new/12
secret/new/13
secret/new/14
secret/new/15
secret/new/16
secret/new/17
secret/new/18

Runtime / Scalability on SIGMOD Programming Contest New Data (last 6)

Similarity Join Blocker

Dataset	Elapsed Time
secret/new/11	N/A
secret/new/12	N/A
secret/new/13	N/A
secret/new/14	N/A
secret/new/15	N/A
secret/new/16	N/A

Orthogonal Test on SIGMOD Programming Contest Hidden Data

Sparkly + TA TopK

Dataset	K=10	K=20	K=50
secret/1
secret/2

Similarity Join Blocker + TF-IDF TopK

Dataset	K=10	K=20	K=50
secret/1
secret/2

Appendix

A. Datasets Statics

RS Join

Amazon-Google: $1.3k \times 3.2k$, gold: $1.3k$.

Walmart-Amazon: $2.5k \times 22k$, gold: $1.1k$.

DBLP-Google: $2.6k \times 64k$, gold: $5.3k$.

DBLP-ACM: $2.6k \times 2.2k$, gold: $2.2k$.

Self Join

Secret/1 and Secret/2: $1.2m$.

Secret/new:

table_a: $1123166$, gold: $18314$
table_a: $1114120$, gold: $741826$
table_a: $1523166$, gold: $58890$
table_a: $1514120$, gold: $1798992$
table_a: $2023166$, gold: $119253$
table_a: $2014120$, gold: $3587374$
table_a: $323166$, gold: $137038$
table_a: $314120$, gold: $4338395$
table_a: $423166$, gold: $71572$
table_a: $414120$, gold: $2280462$
table_a: $723166$, gold: $40477$
table_a: $714120$, gold: $1237586$
table_a: $2423166$, gold: $37911$
table_a: $2414120$, gold: $1206102$
table_a: $6023166$, gold: $67501$
table_a: $6014120$, gold: $1858025$
table_a: $12023166$, gold: $117969$
table_a: $12014120$, gold: $2952282$