Entity Matching by Similarity Join
 
Loading...
Searching...
No Matches
Datasets:

Real-world datasets are from Megallen (VLDB'16), there are total 10 datasets, including structured, textual and dirty version. Sythetic datasets are from SIGMOD Programming Contest 2022 (SIGMOD Record'23) datasets, including 2 hidden datasets used in contest evaluation and up to 18 datasets newly generated by us. We may just use a subset of the newly-generated 18 datasets.

Megallen:

We choose: Amazon-Google (S, T, D), Walmart-Amazon (S, T, D), DBLP-Google (S, D), DBLP-ACM (S, D). (S: structured, T: textual and D: dirty)

SIGMOD Programming Contest

  1. Original two hidden datasets for evaluation: secret/1 and secret/2.
  2. 18 datasets with different size & number of gold matches generated by us, all are listed under secret-new.

Competitors

  1. Sparkly-Auto.
  2. SIGMOD Programming Contest 2022's winner blocker. (only used on "SIGMOD-Programming Contest" datasets)

Metrics

  1. Recall / Output Size.
  2. Run time / Scalability.
  3. F1 score. (For matcher only)

Plan

  1. Metric 1 on all datasets. (Output Size is independent variable)
  2. Metric 2 on the last 6 newly generated datasets in "SIGMOD-Programming Contest".
  3. Orthogonal test, run TA TopK on Sparkly and TF-IDF TopK on Similarity Join Blocker, to investigate the Metric 1's differences. (This may only be ran on synthetic datasets since Sparkly already achieved a excellent performance on real-world datasets.)
  4. Metric 1 and 3 on matcher with taking interchangeable values into consideration.

Results

For the tables of Similarity Join Blocker and Champion Solution, the schemas K=10 is roughly limiting the output size of that blocker as the output size of Sparkly-Auto with K=10.

Recall / Output Size on Megallen Structured Data

Sparkly-Auto

Dataset K=10 K=20 K=50
amazon-google 32,240, 0.1323 64,434, 0.1615 160,668, 0.1830
dblp-acm 22,940, 0.9968 45,880, 0.9977 114,692, 0.9991
dblp-googlescholar 641,359, 0.9910 1,280,826, 0.9951 3,185,380, 0.9994
walmart-amazon 220,646, 0.9974 440,931, 1.0 1,095,779, 1.0

Similarity Join Blocker

Dataset K=10 K=20 K=50
amazon-google 30,000, 0.9774 (0.9808) 60,000, 0.9843 (0.9908) 120,000, 0.9859 (150,000, 0.9946)
dblp-acm 20,000, 1.0 (0.996) 40,000, 1.0 (0.9978) 80,000, 1.0 (100,000 0.9987)
dblp-googlescholar 600,000, 0.9942 (0.9908) 1,200,000, 0.9953 (0.9908) 2,400,000, 0.9955 (3,000,000, 0.9908)
walmart-amazon 200,000, 0.9809 (0.9913) 400,000, 0.9887 (0.9991) 800,000, 0.9939 (1,000,000, 1)

Recall / Output Size on Megallen Textual Data

Sparkly

Dataset K=10 K=20 K=50
amazon-google 32,260, 0.9807 64,516, 0.9892 161,019, 0.9946
walmart-amazon 220,728, 0.9904 441,448, 0.9939 1,103,578, 0.9965

Similarity Join Blocker

Dataset K=10 K=20 K=50
amazon-google 30,000, 0.9793 (0.99) 60,000, 0.9849 (0.9954) 120,000, 0.9851 (0.9969)
walmart-amazon 200,000, 0.9402 (0.9992) 400,000, 0.9459 (0.9939) 800,000, 0.9551 (0.9965)

Recall / Output Size on Megallen Dirty Data

Sparkly-Auto

Dataset K=10 K=20 K=50
amazon-google 32,260, 0.99 64,511, 0.9953 161,090, 0.9976
dblp-acm 22,940, 0.9991 45,870, 1.0 114,602, 1.0
dblp-googlescholar 639,945, 0.9953 1,278,009, 0.9937 3,186,741, 0.9983
walmart-amazon 220,732, 1.0 441,459, 1.0 1,103,512, 1.0

Similarity Join Blocker

Dataset K=10 K=20 K=50
amazon-google 30,000, 0.9587 (0.9838) 60,000, 0.9602 (0.9915) 120,000, 0.9602 (0.9969)
dblp-acm 20,000, 0.9974 (0.9978) 40,000, 0.9974 (0.9987) 80,000, 0.9979 (0.9991)
dblp-googlescholar 600,000, 0.9805 (0.9935) 1,200,000, 0.9809 (0.995) 2,400,000, 0.9810 (0.9957)
walmart-amazon 200,000, 0.9893 400,000, 0.9927 800,000, 0.9945

Recall / Output Size on SIGMOD Programming Contest Hidden Data

Sparkly-Auto

Dataset K=10 K=20 K=50
secret/1 8,843,915, 0.6275 16,916,084, 0.6888 40,934,530, 0.7804
secret/2 8,823,829, 0.2549 16,916,234, 0.2751 40,814,094, 0.3130

Champion Solution

Dataset K=10 K=20 K=50
secret/1 8,000,000, 0.7386 16,000,000, 0.7463 40,000,000, 0.7482
secret/2 8,000,000, 0.4007 16,000,000, 0.4240 40,000,000, 0.4453

Similarity Join Blocker

Dataset K=10 K=20 K=50
secret/1 8,000,000, 0.7834 16,000,000, 0.8123 32,000,000, 0.8411
secret/2 8,000,000, 0.3531 16,000,000, 0.3824 32,000,000, 0.4150

Recall / Output Size on SIDMOG Programming Contest New Data

Sparkly-Auto

Dataset K=10 K=20 K=50
secret/new/1 8,114,258, 0.5449 15,518,322, 0.6243 37,579,737, 0.7323
secret/new/2 8,531,622, 0.1701 16,467,383, 0.1909 39,739,353, 0.2297
secret/new/3 11,023,375, 0.7222 21,127,273, 0.7637 51,070,261, 0.8290
secret/new/4 10,983,557, 0.3609 21,146,213, 0.3764 51,120,840, 0.4055
secret/new/5 14,621,984, 0.7422 28,151,329, 0.7734 68,017,778, 0.8280
secret/new/6 14,693,455, 0.4300 28,362,519, 0.4431 68,426,150, 0.4689
secret/new/7 2,353,781, 0.5920 4,652,442, 0.6393 11,542,114, 0.7213
secret/new/8 2,264,471, 0.1051 4,421,496, 0.1242 10,735,996, 0.1667
secret/new/9 3,081,204, 0.6081 5,984,783, 0.6603 14,694,013, 0.7482
secret/new/10 2,988,633, 0.1494 5,776,893, 0.1696 13,950,824, 0.2125
secret/new/11 5,237,610, 0.5887 10,052,944, 0.6451 24,448,529, 0.7386
secret/new/12 5,169,850, 0.2188 9,929,842, 0.2403 23,945,011, 0.2826
secret/new/13 17,563,355, 0.7370 33,594,725, 0.7809
secret/new/14
secret/new/15
secret/new/16
secret/new/17
secret/new/18

Champion Solution

Dataset K=10 K=20 K=50
secret/new/1 8,114,258, 0.6931 15,518,322, 0.6992 37,579,737, 0.7026
secret/new/2 8,531,622, 0.3650 16,467,383, 0.3879 39,739,353, 0.4093
secret/new/3 10,000,000, 0.7712 20,000,000, 0.7780 40,000,000, 0.7788
secret/new/4 10,000,000, 0.4525 20,000,000, 0.4710 40,000,000, 0.4854
secret/new/5 14,000,000, 0.7762 28,000,000, 0.7807 56,000,000, 0.7825
secret/new/6 14,000,000, 0.4745 28,000,000, 0.4906 56,000,000, 0.5024
secret/new/7 2,000,000, 0.7098 4,000,000, 0.7355 8,000,000, 0.7372
secret/new/8 2,000,000, 0.2485 4,000,000, 0.3008 8,000,000, 0.3340
secret/new/9 3,000,000, 0.7333 6,000,000, 0.7373 12,000,000, 0.7379
secret/new/10 3,000,000, 0.3290 6,000,000, 0.3631 12,000,000, 0.3868
secret/new/11
secret/new/12
secret/new/13
secret/new/14
secret/new/15
secret/new/16
secret/new/17
secret/new/18

Similarity Join Blocker

Dataset K=10 K=20 K=50
secret/new/1
secret/new/2
secret/new/3
secret/new/4
secret/new/5
secret/new/6
secret/new/7
secret/new/8
secret/new/9
secret/new/10
secret/new/11
secret/new/12
secret/new/13
secret/new/14
secret/new/15
secret/new/16
secret/new/17
secret/new/18

Runtime / Scalability on SIGMOD Programming Contest New Data (last 6)

Similarity Join Blocker

Dataset Elapsed Time
secret/new/11 N/A
secret/new/12 N/A
secret/new/13 N/A
secret/new/14 N/A
secret/new/15 N/A
secret/new/16 N/A

Orthogonal Test on SIGMOD Programming Contest Hidden Data

Sparkly + TA TopK

Dataset K=10 K=20 K=50
secret/1
secret/2

Similarity Join Blocker + TF-IDF TopK

Dataset K=10 K=20 K=50
secret/1
secret/2

Appendix

A. Datasets Statics

RS Join

Amazon-Google: $1.3k \times 3.2k$, gold: $1.3k$.

Walmart-Amazon: $2.5k \times 22k$, gold: $1.1k$.

DBLP-Google: $2.6k \times 64k$, gold: $5.3k$.

DBLP-ACM: $2.6k \times 2.2k$, gold: $2.2k$.

Self Join

Secret/1 and Secret/2: $1.2m$.

Secret/new:

  1. table_a: $1123166$, gold: $18314$
  2. table_a: $1114120$, gold: $741826$
  3. table_a: $1523166$, gold: $58890$
  4. table_a: $1514120$, gold: $1798992$
  5. table_a: $2023166$, gold: $119253$
  6. table_a: $2014120$, gold: $3587374$
  7. table_a: $323166$, gold: $137038$
  8. table_a: $314120$, gold: $4338395$
  9. table_a: $423166$, gold: $71572$
  10. table_a: $414120$, gold: $2280462$
  11. table_a: $723166$, gold: $40477$
  12. table_a: $714120$, gold: $1237586$
  13. table_a: $2423166$, gold: $37911$
  14. table_a: $2414120$, gold: $1206102$
  15. table_a: $6023166$, gold: $67501$
  16. table_a: $6014120$, gold: $1858025$
  17. table_a: $12023166$, gold: $117969$
  18. table_a: $12014120$, gold: $2952282$