F-MWSP/: Code for solving minimum weight set packing problem with column generation and flexible dual optimal inequalities. Code in this directory is available in matlab. A python version is in preparation. F-MWSP is sometimes referred as ZEUS in the scripts.dedupe/: A python library which provides APIs for blocking and scoring in the entity resolution pipeline. We provide a customized dedupe with blocking and scoring functions separated.patent_sample/,csv_example/,affiliations/,settlements/,music20k/: The datasets on which experiments were conducted.
The code requires a customized dedupe. Please install it in the following way. Refer to the dedupe page for further assistance.
It is recommended to install dedupe in a virtual environment.
conda create --name myenv python=3.6
source activate myenv
cd dedupe
pip install "numpy>=1.9"
pip install -r requirements.txt
cython src/*.pyx
pip install -e .The F-MWSP code requires IBM CPLEX. Please refer to this documentation for it's installation.
Here we provide an example for the patent_sample dataset. Follow the same procedure for other datasets.
- Perform blocking and scoring on patent_sample. Adjust the experiment name within the main.py file.
cd patent_sample
python main.py
cd ..The code takes about a minute to execute.
This step creates F_sample.mat which contains a sparse graph with nodes representing the observations and the edges bearing a similarity measure between the nodes.
- Run F-MWSP algorithm. The code for this step is currently available in matlab. Adjust the dataset name within the main.m file appropriately.
cd F-MWSP/examples/entity_resolution/exec/
matlab --nodesktop --nosplash --nodisplay -r "main; exit"
cd ../../../../This step generates H_sample.mat file in the dataset directory and contains a list of tuples of the form (observationID, clusterID).
- Run the evaluation script.
cd patent_sample
python evaluation.py
cd ..In this step we match the true cluster id's provided in the dataset with the cluster id's obtained by running F-MWSP algorithm. We compare the performance against hierarchical clustering baseline.
The affiliations and music20k dataset take very long to run. Please use main_low_memory.m file for these datasets.
Every dataset directory contains a results folder. This contains the outputs of each step of Quickstart.