Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Code

Overview

  • F-MWSP/: Code for solving minimum weight set packing problem with column generation and flexible dual optimal inequalities. Code in this directory is available in matlab. A python version is in preparation. F-MWSP is sometimes referred as ZEUS in the scripts.
  • dedupe/: A python library which provides APIs for blocking and scoring in the entity resolution pipeline. We provide a customized dedupe with blocking and scoring functions separated.
  • patent_sample/, csv_example/, affiliations/, settlements/, music20k/: The datasets on which experiments were conducted.

Installations

The code requires a customized dedupe. Please install it in the following way. Refer to the dedupe page for further assistance.

It is recommended to install dedupe in a virtual environment.

conda create --name myenv python=3.6
source activate myenv
cd dedupe
pip install "numpy>=1.9"
pip install -r requirements.txt
cython src/*.pyx
pip install -e .

The F-MWSP code requires IBM CPLEX. Please refer to this documentation for it's installation.

QuickStart

Here we provide an example for the patent_sample dataset. Follow the same procedure for other datasets.

  1. Perform blocking and scoring on patent_sample. Adjust the experiment name within the main.py file.
cd patent_sample
python main.py
cd ..

The code takes about a minute to execute. This step creates F_sample.mat which contains a sparse graph with nodes representing the observations and the edges bearing a similarity measure between the nodes.

  1. Run F-MWSP algorithm. The code for this step is currently available in matlab. Adjust the dataset name within the main.m file appropriately.
cd F-MWSP/examples/entity_resolution/exec/
matlab --nodesktop --nosplash --nodisplay -r "main; exit"
cd ../../../../

This step generates H_sample.mat file in the dataset directory and contains a list of tuples of the form (observationID, clusterID).

  1. Run the evaluation script.
cd patent_sample
python evaluation.py
cd ..

In this step we match the true cluster id's provided in the dataset with the cluster id's obtained by running F-MWSP algorithm. We compare the performance against hierarchical clustering baseline.

Notes

The affiliations and music20k dataset take very long to run. Please use main_low_memory.m file for these datasets.

Every dataset directory contains a results folder. This contains the outputs of each step of Quickstart.