- add patch on modifcation on py_entitymatching
- KNN blocker, part of simjoin blocker
- interchangeable value for blocker
- feature part implements some redundant parts (e.g., sim funcs), could be eliminated in the future
- Modify the SimFuncs return value for empty sets from 1 to NaN
- group.cc should use feature_index
- usage of length filter in calculation features, but what if no pairs pass length filter?
- make the similarity join apis public
- word2vec & glove value matcher
Optimization
- serial string join optimization: using sharing prefix
- overlap join (except parallel self): using small/large case
- vectorized TopK algorithm
- optimization on serial set join index memory allocation
- iterative verification for all 4 string joins
- sampler could support both dlm & qgm
- Add another two sim joins implementation in "simjoin.hpp"
- re-arrange the files in "blocker" folder, should we keep extern global values in "simjoin.hpp"?
- interchangeable values in blocking
- Add namespace
Please refer to developer_notes.md
for remaining issues.