Public Member Functions | |
__init__ (self) | |
flush_rules (self, trigraph, path, selected=[]) | |
get_tree_nodes (self, trigraph, lower, upper, cur_rules, tree_visited, if_index=False, index=[]) | |
get_rules_cur_comb (self, trigraph, path) | |
get_connected_tree (self, trigraph, rules) | |
move_index_basic (self, trigraph, cur_pos, if_all_end, if_drop_left=True) | |
move_index_greedy (self, trigraph, if_all_end, if_drop_left=True, last_move=-1) | |
select_partial (self, trigraph, short_attrs) | |
dfs_optimal (self, trigraph, values, idx, selected, selected_value, selected_rules, max_val, max_selected, is_found) | |
get_optimal_rules_comb (self, trigraph, path_selected) | |
extract (self, trigraph, if_drop_left=True, move_strategy=1) | |
Public Attributes | |
int | partial_num_features = 0 |
list | formulas = [] |
list | feature_index = [] |
Protected Member Functions | |
_print_one_rule (self, trigraph, i, buffer_rules) | |
The filtering formula is shown as following: # of rules feature_name sign threshold ... feature_name sign threshold It is temporarily stored in buffer/rules.txt for immediately using. It is permantly stored in data/rules/datasetname/rules.txt.
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.__init__ | ( | self | ) |
|
protected |
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.dfs_optimal | ( | self, | |
trigraph, | |||
values, | |||
idx, | |||
selected, | |||
selected_value, | |||
selected_rules, | |||
max_val, | |||
max_selected, | |||
is_found ) |
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.extract | ( | self, | |
trigraph, | |||
if_drop_left = True, | |||
move_strategy = 1 ) |
If the order of moving for each features influence the final results? Args: if_drop_left: Drop the unreasonable rules, like 'jac < xxx' move_strategy: 0 -> Basic move 1 -> Greedy move
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.flush_rules | ( | self, | |
trigraph, | |||
path, | |||
selected = [] ) |
Save rules and flush them to buffer for blocker to read # Rewrite firstly # The key for interval in feature_range has changed! # The key tuple has also changed
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_connected_tree | ( | self, | |
trigraph, | |||
rules ) |
Return the number of trees connected by a set of rules
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_optimal_rules_comb | ( | self, | |
trigraph, | |||
path_selected ) |
The optimal rules comb, which is selecting rules to make all of them as tight as possible. It can be viewed as a multiple-Knapsack, where are "num_feature" groups of items, each of group can only contributes at most 1 item to the comb. Also, the rule nodes which a feature connected are viewed as the weights, and the tighter range will result in a larger value.
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_rules_cur_comb | ( | self, | |
trigraph, | |||
path ) |
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.get_tree_nodes | ( | self, | |
trigraph, | |||
lower, | |||
upper, | |||
cur_rules, | |||
tree_visited, | |||
if_index = False, | |||
index = [] ) |
Deduce current feature nodes can reach how many tree nodes Args: lower: smallest feature number upper: largest feature number index: a list of feature id
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.move_index_basic | ( | self, | |
trigraph, | |||
cur_pos, | |||
if_all_end, | |||
if_drop_left = True ) |
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.move_index_greedy | ( | self, | |
trigraph, | |||
if_all_end, | |||
if_drop_left = True, | |||
last_move = -1 ) |
Move the feature index using greedy. Every time move the feature that will introduce the most rule nodes.
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.select_partial | ( | self, | |
trigraph, | |||
short_attrs ) |
Try to select a fraction of features that are enough for blocking. Using two strategies, select the results that are shortest.
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.feature_index = [] |
list simjoin_entitymatching.blocker.extract_formula.ExtractFormula.formulas = [] |
simjoin_entitymatching.blocker.extract_formula.ExtractFormula.partial_num_features = 0 |