Entity Matching by Similarity Join
 
Loading...
Searching...
No Matches
simjoin_entitymatching.blocker.extract_formula.ExtractFormula Class Reference

Public Member Functions

 __init__ (self)
 
 flush_rules (self, trigraph, path, selected=[])
 
 get_tree_nodes (self, trigraph, lower, upper, cur_rules, tree_visited, if_index=False, index=[])
 
 get_rules_cur_comb (self, trigraph, path)
 
 get_connected_tree (self, trigraph, rules)
 
 move_index_basic (self, trigraph, cur_pos, if_all_end, if_drop_left=True)
 
 move_index_greedy (self, trigraph, if_all_end, if_drop_left=True, last_move=-1)
 
 select_partial (self, trigraph, short_attrs)
 
 dfs_optimal (self, trigraph, values, idx, selected, selected_value, selected_rules, max_val, max_selected, is_found)
 
 get_optimal_rules_comb (self, trigraph, path_selected)
 
 extract (self, trigraph, if_drop_left=True, move_strategy=1)
 

Public Attributes

int partial_num_features = 0
 
list formulas = []
 
list feature_index = []
 

Protected Member Functions

 _print_one_rule (self, trigraph, i, buffer_rules)
 

Detailed Description

    The filtering formula is shown as following:
        # of rules
        feature_name sign threshold
        ...
        feature_name sign threshold

    It is temporarily stored in buffer/rules.txt for immediately using.
    It is permantly stored in data/rules/datasetname/rules.txt.

Constructor & Destructor Documentation

◆ __init__()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.__init__ ( self)

Member Function Documentation

◆ _print_one_rule()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula._print_one_rule ( self,
trigraph,
i,
buffer_rules )
protected

◆ dfs_optimal()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.dfs_optimal ( self,
trigraph,
values,
idx,
selected,
selected_value,
selected_rules,
max_val,
max_selected,
is_found )

◆ extract()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.extract ( self,
trigraph,
if_drop_left = True,
move_strategy = 1 )
        If the order of moving for each features influence the final results?

        Args:
            if_drop_left: Drop the unreasonable rules, like 'jac < xxx'
            move_strategy: 0 -> Basic move
                           1 -> Greedy move

◆ flush_rules()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.flush_rules ( self,
trigraph,
path,
selected = [] )
        Save rules and flush them to buffer for blocker to read

        # Rewrite firstly
        # The key for interval in feature_range has changed!
        # The key tuple has also changed

◆ get_connected_tree()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_connected_tree ( self,
trigraph,
rules )
        Return the number of trees connected by a set of rules

◆ get_optimal_rules_comb()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_optimal_rules_comb ( self,
trigraph,
path_selected )
        The optimal rules comb, which is selecting rules to make all of them as tight as possible.
        It can be viewed as a multiple-Knapsack, where are "num_feature" groups of items, each of 
        group can only contributes at most 1 item to the comb.
        Also, the rule nodes which a feature connected are viewed as the weights, and the tighter
        range will result in a larger value.

◆ get_rules_cur_comb()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_rules_cur_comb ( self,
trigraph,
path )

◆ get_tree_nodes()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_tree_nodes ( self,
trigraph,
lower,
upper,
cur_rules,
tree_visited,
if_index = False,
index = [] )
        Deduce current feature nodes can reach how many tree nodes
        Args:
            lower: smallest feature number
            upper: largest feature number
            index: a list of feature id

◆ move_index_basic()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.move_index_basic ( self,
trigraph,
cur_pos,
if_all_end,
if_drop_left = True )

◆ move_index_greedy()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.move_index_greedy ( self,
trigraph,
if_all_end,
if_drop_left = True,
last_move = -1 )
        Move the feature index using greedy.
        Every time move the feature that will introduce the most rule nodes.

◆ select_partial()

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.select_partial ( self,
trigraph,
short_attrs )
        Try to select a fraction of features that are enough for blocking.

        Using two strategies, select the results that are shortest.

Member Data Documentation

◆ feature_index

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.feature_index = []

◆ formulas

list simjoin_entitymatching.blocker.extract_formula.ExtractFormula.formulas = []

◆ partial_num_features

simjoin_entitymatching.blocker.extract_formula.ExtractFormula.partial_num_features = 0

The documentation for this class was generated from the following file: