The Entity Matching (EM) problem involves the identification of tuple pairs from one or two sets of instances that correspond to the same real-world entities. A typical EM solution comprises two main steps: blocking and matching. The blocking step aims to eliminate tuple pairs that are evidently non-matched, while the matching step evaluates the survived pairs to reach a final decision.
Modern EM solutions often prioritize enhancing the accuracy of the matching step, and negelect the recall of the blocking step. Additionally, the popular packages which support various blocking techniques are usually built on Python, which, while ensuring portability, often results in limitations concerning blocker's scalability.
Therefore, we porpose to design an entity matching system that emphasizes the recall during the blocking step. Meanwhile, we maintain a focus on the matcher, to enhance the recall in this phase, we propose to integrate a value matcher in the system. Our proposed solution is designed to be scalable, as the blocker is assembled by state-of-art similarity join algorithms that are written in C++ and enhanced by parallelization, while also maintaining portability through the provision of public APIs written in Python.
Our system encompasses five stpes:
The step 1-3 are implemented in C++ and step 4-5 are implemented in Python. All public APIs are written in Python.
The project layout is:
bin
contains the executable files of sample (step 1), block (step 2) and feature extraction (step 3).shared_lib
contains the compiled dynamic library of the three steps as in bin
.cpp
contains the C++ source code.scripts
contains the bash wrapper for running the binary executable files in bin
.simjoin_entitymatching
contains the public APIs for our system.examples
contains the scripts as examples ro run our system.test
contains the unit tests and experiment scripts, you should not use them.Required Python packages are listed in requirements.txt
. For py_entitymatching
, please refer to their documents for more details. Meanwhile, we have several minor modifications on this package, which are inlcuded in the patch/py_em.patch
. (Note: the patch file has not been tested, you may want to refer to docs/modifications.md
and modify the package manually)
The default C++ compiler is g++
and C++ version is 11
, you may use any other compilers as long as they support OpenMP
. Additionally, if you would like to use parquet
format for input and output, you should install arrow
package and de-comment the marco ARROW_INSTALLED
in cpp/common/config.h
as well as compile settings in all CMakeLists.txt
. But the parquet
io reamins untested at this stage.
This commands will invoke the root CMakeLists.txt
and compiles all the parts (sample, block and feature) to generate the binary files as well as dynamic libraries. The compile log is written in build/compile.log
.
See in the exmples
folder and the corresponding README.md
.
Refer to the light-weight
branch for more information. Coming soon...
docs/binary.md
docs/lib.md
docs/simjoin.md
docs/exp.md
docs/developer_notes.md
Linux.
Author: Yunqi Li, email: ylilo.nosp@m.@con.nosp@m.nect..nosp@m.ust..nosp@m.hk, HKUST. Advised by Prof. Dong Deng, Rutgers University.
Refer to docs/TODO.md
.