We divide the sensitivity analysis for three parts: blocker, matcher and value matcher.
- We will select the longest attribute as the most representative attribute for sample / topk (This may be specified by users if there are mutiple long attributes with nearly the same average length).
- The attributes' types may need to be adjustment according to different datasets. Since the py_entitymatching's attribute type auto-inference may not be precise
Blocker
- We use heap to maintain up to 1e7 results (each thread) for each join algorithms with the largest (smallest for edit distance) sim join values.
- We may restrict the join result set's size in addition to that 1e7 restriction to reduce the final output size. However, on Megallen datasets which are small, we will lose roughly ~8% recall.
- For the final TopK, we may select the word-weighted (by idf) similarity functions or unweighted, the weighted one has a slightly better performance on Megallen datasets, roughly ~1%. But for secret, unweighted is better (Also for the join TopK in 2, we could also has 2 versions).
- For edit distance join, we can not keep a heap to maintain the top result since it will take roughly 8 hours on secret datasets.
- The practically best way is to only maintain value for attributes which are "str_bt_5w_10w" or "str_gt_10w"
Matcher
- parameters are same for all datasets.
- parameters depend on datasets' sizes / type (self-join or RS-join).
- parameters needs to be adjusted according to different datasets.
Parameters: blocking_attr, sample_strategy, training_strategy, move_strategy, num_tree, sample_size, inmemory, ground_truth_label, cluster_tau, sample_tau, step2_tau, num_data.
- blocking_attr As stated above, it needs to be selected by a human sometimes.
- cluster_tau, sample_tau, step2_tau We set different values for Megallen and synethic datasets, but for all datasets with in Megallen (synethic), they share the same values.
- inmemory We set this to be true (false) if the table's size is small (large), or if you could afford the expense for training value matcher, then set it always to be true.
- num_data This parameter simply depends whether the dataset is self-join or RS-join.
Value matcher
- The threshold for value matcher is default to 0.8 for all settings.
Issues
Fixed at this stage
- Set join parallel on large datasets with large threshold (e.g., 0.97+) will fail
this is because of adaptive grouping, but further investigation is needed. See the details in "notes.md"
- there may be some miss-commented lines in set join source files
set join works at this stage
- "buffer" folder should contain: clean_A.csv, clean_B.csv, gold.csv, sample_res.csv, feature_name.txt
"buffer" folder currently is well-origanized
To be careful
- think about one attribute share different types in different datasets
- check all marcos (MAX_PAIR_SIZE & MAX_RES_SIZE refers to the exact same thing)
- make three join algorithm classes consistent, e.g., write a base class for them to make the class declaration consistent
- there are still some warnings