Want to find all relevant documents?

HiCAL is a system for efficient high-recall retrieval. The system allows retrieving and assessing relevant documents and provides high data processing performance and a user-friendly document assessment interface.



The system is dockerized into different docker images. Make sure your machine has these installed

Corpus and How to Install

Before you start installing, you need to make sure your corpus is ready and in the right format. This step is necessary to be able run the system. We will work with a toy corpus from https://github.com/hical/sample-dataset

# Checkout the repo
git clone https://github.com/HiCAL/HiCAL.git
cd HiCAL
# Checkout the sample dataset
git clone https://github.com/hical/sample-dataset.git
cd sample-dataset
python process.py athome4_sample.tgz
# Create the data directory which is mounted to the docker containers
mkdir ../data
cp athome4_sample.tgz athome4_sample_para.tgz ../data/
cd ..

process.py is an example of how one might clean the corpus and generate excerpts. We will use the athome4_sample.tgz and the newly generated athome4_sample_para.tgz to generate document features.

# Build and access the shell from the cal container
docker-compose -f HiCAL.yml run cal bash
root@container-id:/# cd src && make corpus_parser
# Generate features
root@container-id:/# ./corpus_parser  --in /data/athome4_sample.tgz --out /data/athome4_sample.bin --para-in /data/athome4_sample_para.tgz --para-out /data/athome4_para_sample.bin
# Exit the shell with Ctrl+D

We will now generate the document and paragraph files which will be showed to the assessors. Access and parsing logic for a corpus is decribed in the functions.py file. HiCAL by default supports the xml documents from the New York Times corpus. We have provided an alternative functions.py which works for athome4.

# Extract the tgz files
cd data
tar xvzf athome4_sample.tgz
mv athome4_test docs
tar xvzf athome4_sample_para.tgz
mv athome4_test para
cd ..
# Use the modified functions.py
cp sample-dataset/functions.py HiCALWeb/hicalweb/interfaces/DocumentSnippetEngine/functions.py

# We are all set! Lets fire up the containers
DOC_BIN=/data/athome4_sample.bin PARA_BIN=/data/athome4_para_sample.bin docker-compose -f HiCAL.yml up -d
# Visit localhost:9000

If you get a 502 Bad Gateway error, please wait few seconds while the containers finish processing.

Port 9001 and 9000 will be used by system. Make sure these ports are not being used by other applications in your machine. If you would like to change these ports, please read the configuration section below.

We strongly recommend installing via docker. If you would like to not use docker, please look at the documentation page for more details.

How to run

Once your docker images are up and running, open your browser to http://localhost:9000/. You should be able to access system's web interface.

Login page of HiCAL.
Figure 1: Login page of HiCAL. You can also use the practice account by clicking on the "click to practice" button.
Homepage HiCAL.
Figure 2: After logging in to HiCAL, create or select a topic of search. This will start a session and train the Machine learning model.
Homepage HiCAL.
Figure 3: After initiating a topic, you can select any of the retrieval component (CAL or Search) from the left side bar.


Most of the configuration can be performed through these two files:

Configuring Search

The system allows integration of a search engine. Any search engine can be integrated with minimal effort.

To integrate a search engine, please read the documentation.

Bugs, issues or feature requests?

Please report here.