Entity Matching by Similarity Join
 
Loading...
Searching...
No Matches
General

  1. add patch on modifcation on py_entitymatching
  2. KNN blocker, part of simjoin blocker
  3. interchangeable value for blocker
  4. feature part implements some redundant parts (e.g., sim funcs), could be eliminated in the future
  5. Modify the SimFuncs return value for empty sets from 1 to NaN
  6. group.cc should use feature_index
  7. usage of length filter in calculation features, but what if no pairs pass length filter?
  8. make the similarity join apis public
  9. word2vec & glove value matcher

Optimization

  1. serial string join optimization: using sharing prefix
  2. overlap join (except parallel self): using small/large case
  3. vectorized TopK algorithm
  4. optimization on serial set join index memory allocation
  5. iterative verification for all 4 string joins
  6. sampler could support both dlm & qgm
  7. Add another two sim joins implementation in "simjoin.hpp"
  8. re-arrange the files in "blocker" folder, should we keep extern global values in "simjoin.hpp"?
  9. interchangeable values in blocking
  10. Add namespace

Please refer to developer_notes.md for remaining issues.