Real-world datasets are from Megallen (VLDB'16), there are total 10 datasets, including structured, textual and dirty version. Sythetic datasets are from SIGMOD Programming Contest 2022 (SIGMOD Record'23) datasets, including 2 hidden datasets used in contest evaluation and up to 18 datasets newly generated by us. We may just use a subset of the newly-generated 18 datasets.
Megallen:
We choose: Amazon-Google (S, T, D), Walmart-Amazon (S, T, D), DBLP-Google (S, D), DBLP-ACM (S, D). (S: structured, T: textual and D: dirty)
SIGMOD Programming Contest
- Original two hidden datasets for evaluation: secret/1 and secret/2.
- 18 datasets with different size & number of gold matches generated by us, all are listed under secret-new.
Competitors
- Sparkly-Auto.
- SIGMOD Programming Contest 2022's winner blocker. (only used on "SIGMOD-Programming Contest" datasets)
Metrics
- Recall / Output Size.
- Run time / Scalability.
- F1 score. (For matcher only)
Plan
- Metric 1 on all datasets. (Output Size is independent variable)
- Metric 2 on the last 6 newly generated datasets in "SIGMOD-Programming Contest".
- Orthogonal test, run TA TopK on Sparkly and TF-IDF TopK on Similarity Join Blocker, to investigate the Metric 1's differences. (This may only be ran on synthetic datasets since Sparkly already achieved a excellent performance on real-world datasets.)
- Metric 1 and 3 on matcher with taking interchangeable values into consideration.
Results
For the tables of Similarity Join Blocker and Champion Solution, the schemas K=10 is roughly limiting the output size of that blocker as the output size of Sparkly-Auto with K=10.
Recall / Output Size on Megallen Structured Data
Sparkly-Auto
Dataset | K=10 | K=20 | K=50 |
amazon-google | 32,240, 0.1323 | 64,434, 0.1615 | 160,668, 0.1830 |
dblp-acm | 22,940, 0.9968 | 45,880, 0.9977 | 114,692, 0.9991 |
dblp-googlescholar | 641,359, 0.9910 | 1,280,826, 0.9951 | 3,185,380, 0.9994 |
walmart-amazon | 220,646, 0.9974 | 440,931, 1.0 | 1,095,779, 1.0 |
Similarity Join Blocker
Dataset | K=10 | K=20 | K=50 |
amazon-google | 30,000, 0.9774 (0.9808) | 60,000, 0.9843 (0.9908) | 120,000, 0.9859 (150,000, 0.9946) |
dblp-acm | 20,000, 1.0 (0.996) | 40,000, 1.0 (0.9978) | 80,000, 1.0 (100,000 0.9987) |
dblp-googlescholar | 600,000, 0.9942 (0.9908) | 1,200,000, 0.9953 (0.9908) | 2,400,000, 0.9955 (3,000,000, 0.9908) |
walmart-amazon | 200,000, 0.9809 (0.9913) | 400,000, 0.9887 (0.9991) | 800,000, 0.9939 (1,000,000, 1) |
Recall / Output Size on Megallen Textual Data
Sparkly
Dataset | K=10 | K=20 | K=50 |
amazon-google | 32,260, 0.9807 | 64,516, 0.9892 | 161,019, 0.9946 |
walmart-amazon | 220,728, 0.9904 | 441,448, 0.9939 | 1,103,578, 0.9965 |
Similarity Join Blocker
Dataset | K=10 | K=20 | K=50 |
amazon-google | 30,000, 0.9793 (0.99) | 60,000, 0.9849 (0.9954) | 120,000, 0.9851 (0.9969) |
walmart-amazon | 200,000, 0.9402 (0.9992) | 400,000, 0.9459 (0.9939) | 800,000, 0.9551 (0.9965) |
Recall / Output Size on Megallen Dirty Data
Sparkly-Auto
Dataset | K=10 | K=20 | K=50 |
amazon-google | 32,260, 0.99 | 64,511, 0.9953 | 161,090, 0.9976 |
dblp-acm | 22,940, 0.9991 | 45,870, 1.0 | 114,602, 1.0 |
dblp-googlescholar | 639,945, 0.9953 | 1,278,009, 0.9937 | 3,186,741, 0.9983 |
walmart-amazon | 220,732, 1.0 | 441,459, 1.0 | 1,103,512, 1.0 |
Similarity Join Blocker
Dataset | K=10 | K=20 | K=50 |
amazon-google | 30,000, 0.9587 (0.9838) | 60,000, 0.9602 (0.9915) | 120,000, 0.9602 (0.9969) |
dblp-acm | 20,000, 0.9974 (0.9978) | 40,000, 0.9974 (0.9987) | 80,000, 0.9979 (0.9991) |
dblp-googlescholar | 600,000, 0.9805 (0.9935) | 1,200,000, 0.9809 (0.995) | 2,400,000, 0.9810 (0.9957) |
walmart-amazon | 200,000, 0.9893 | 400,000, 0.9927 | 800,000, 0.9945 |
Recall / Output Size on SIGMOD Programming Contest Hidden Data
Sparkly-Auto
Dataset | K=10 | K=20 | K=50 |
secret/1 | 8,843,915, 0.6275 | 16,916,084, 0.6888 | 40,934,530, 0.7804 |
secret/2 | 8,823,829, 0.2549 | 16,916,234, 0.2751 | 40,814,094, 0.3130 |
Champion Solution
Dataset | K=10 | K=20 | K=50 |
secret/1 | 8,000,000, 0.7386 | 16,000,000, 0.7463 | 40,000,000, 0.7482 |
secret/2 | 8,000,000, 0.4007 | 16,000,000, 0.4240 | 40,000,000, 0.4453 |
Similarity Join Blocker
Dataset | K=10 | K=20 | K=50 |
secret/1 | 8,000,000, 0.7834 | 16,000,000, 0.8123 | 32,000,000, 0.8411 |
secret/2 | 8,000,000, 0.3531 | 16,000,000, 0.3824 | 32,000,000, 0.4150 |
Recall / Output Size on SIDMOG Programming Contest New Data
Sparkly-Auto
Dataset | K=10 | K=20 | K=50 |
secret/new/1 | 8,114,258, 0.5449 | 15,518,322, 0.6243 | 37,579,737, 0.7323 |
secret/new/2 | 8,531,622, 0.1701 | 16,467,383, 0.1909 | 39,739,353, 0.2297 |
secret/new/3 | 11,023,375, 0.7222 | 21,127,273, 0.7637 | 51,070,261, 0.8290 |
secret/new/4 | 10,983,557, 0.3609 | 21,146,213, 0.3764 | 51,120,840, 0.4055 |
secret/new/5 | 14,621,984, 0.7422 | 28,151,329, 0.7734 | 68,017,778, 0.8280 |
secret/new/6 | 14,693,455, 0.4300 | 28,362,519, 0.4431 | 68,426,150, 0.4689 |
secret/new/7 | 2,353,781, 0.5920 | 4,652,442, 0.6393 | 11,542,114, 0.7213 |
secret/new/8 | 2,264,471, 0.1051 | 4,421,496, 0.1242 | 10,735,996, 0.1667 |
secret/new/9 | 3,081,204, 0.6081 | 5,984,783, 0.6603 | 14,694,013, 0.7482 |
secret/new/10 | 2,988,633, 0.1494 | 5,776,893, 0.1696 | 13,950,824, 0.2125 |
secret/new/11 | 5,237,610, 0.5887 | 10,052,944, 0.6451 | 24,448,529, 0.7386 |
secret/new/12 | 5,169,850, 0.2188 | 9,929,842, 0.2403 | 23,945,011, 0.2826 |
secret/new/13 | 17,563,355, 0.7370 | 33,594,725, 0.7809 | |
secret/new/14 | | | |
secret/new/15 | | | |
secret/new/16 | | | |
secret/new/17 | | | |
secret/new/18 | | | |
Champion Solution
Dataset | K=10 | K=20 | K=50 |
secret/new/1 | 8,114,258, 0.6931 | 15,518,322, 0.6992 | 37,579,737, 0.7026 |
secret/new/2 | 8,531,622, 0.3650 | 16,467,383, 0.3879 | 39,739,353, 0.4093 |
secret/new/3 | 10,000,000, 0.7712 | 20,000,000, 0.7780 | 40,000,000, 0.7788 |
secret/new/4 | 10,000,000, 0.4525 | 20,000,000, 0.4710 | 40,000,000, 0.4854 |
secret/new/5 | 14,000,000, 0.7762 | 28,000,000, 0.7807 | 56,000,000, 0.7825 |
secret/new/6 | 14,000,000, 0.4745 | 28,000,000, 0.4906 | 56,000,000, 0.5024 |
secret/new/7 | 2,000,000, 0.7098 | 4,000,000, 0.7355 | 8,000,000, 0.7372 |
secret/new/8 | 2,000,000, 0.2485 | 4,000,000, 0.3008 | 8,000,000, 0.3340 |
secret/new/9 | 3,000,000, 0.7333 | 6,000,000, 0.7373 | 12,000,000, 0.7379 |
secret/new/10 | 3,000,000, 0.3290 | 6,000,000, 0.3631 | 12,000,000, 0.3868 |
secret/new/11 | | | |
secret/new/12 | | | |
secret/new/13 | | | |
secret/new/14 | | | |
secret/new/15 | | | |
secret/new/16 | | | |
secret/new/17 | | | |
secret/new/18 | | | |
Similarity Join Blocker
Dataset | K=10 | K=20 | K=50 |
secret/new/1 | | | |
secret/new/2 | | | |
secret/new/3 | | | |
secret/new/4 | | | |
secret/new/5 | | | |
secret/new/6 | | | |
secret/new/7 | | | |
secret/new/8 | | | |
secret/new/9 | | | |
secret/new/10 | | | |
secret/new/11 | | | |
secret/new/12 | | | |
secret/new/13 | | | |
secret/new/14 | | | |
secret/new/15 | | | |
secret/new/16 | | | |
secret/new/17 | | | |
secret/new/18 | | | |
Runtime / Scalability on SIGMOD Programming Contest New Data (last 6)
Similarity Join Blocker
Dataset | Elapsed Time |
secret/new/11 | N/A |
secret/new/12 | N/A |
secret/new/13 | N/A |
secret/new/14 | N/A |
secret/new/15 | N/A |
secret/new/16 | N/A |
Orthogonal Test on SIGMOD Programming Contest Hidden Data
Sparkly + TA TopK
Dataset | K=10 | K=20 | K=50 |
secret/1 | | | |
secret/2 | | | |
Similarity Join Blocker + TF-IDF TopK
Dataset | K=10 | K=20 | K=50 |
secret/1 | | | |
secret/2 | | | |
Appendix
A. Datasets Statics
RS Join
Amazon-Google: $1.3k \times 3.2k$, gold: $1.3k$.
Walmart-Amazon: $2.5k \times 22k$, gold: $1.1k$.
DBLP-Google: $2.6k \times 64k$, gold: $5.3k$.
DBLP-ACM: $2.6k \times 2.2k$, gold: $2.2k$.
Self Join
Secret/1 and Secret/2: $1.2m$.
Secret/new:
- table_a: $1123166$, gold: $18314$
- table_a: $1114120$, gold: $741826$
- table_a: $1523166$, gold: $58890$
- table_a: $1514120$, gold: $1798992$
- table_a: $2023166$, gold: $119253$
- table_a: $2014120$, gold: $3587374$
- table_a: $323166$, gold: $137038$
- table_a: $314120$, gold: $4338395$
- table_a: $423166$, gold: $71572$
- table_a: $414120$, gold: $2280462$
- table_a: $723166$, gold: $40477$
- table_a: $714120$, gold: $1237586$
- table_a: $2423166$, gold: $37911$
- table_a: $2414120$, gold: $1206102$
- table_a: $6023166$, gold: $67501$
- table_a: $6014120$, gold: $1858025$
- table_a: $12023166$, gold: $117969$
- table_a: $12014120$, gold: $2952282$