Β·Ongoing (ends 28 Mar 2025, 17:00:00 UTC)Β·
38
Sign in or sign up to participate

Questionnaire Parsing Challenge

A competition to help researchers extract and tag survey questions from PDFs accurately πŸ“„

Harmony Questionnare Parsing Challenge


Harmony logo

Discord | Twitter | LinkedIn | Website


The Harmony project aims to develop a tool to help researchers make better use of existing data by harmonising questionnaire items and measures across different studiesβ€”potentially in different languagesβ€”through an approach based on natural language processing (NLP).

Harmony is a collaboration between researchers at Ulster University, University College London, the Universidade Federal de Santa Maria and Fast Data Science. The Harmony project has been funded by Wellcome as part of the Wellcome Data Prize in Mental Health and by the Economic and Social Research Council (ESRC).

Interested in Harmony? Also check out the matching algorithm improvement challenge!

Challenge πŸ’‘

Your challenge is to develop an improved algorithm for identifying mental health survey questions and selectable answers in plain text.

To get started, take a look at our tutorial notebook:

https://github.com/DoxaAI/harmony-parsing-getting-started/blob/main/getting-started.ipynb

For a more minimal example, take a look on our GitHub.

Dataset πŸ’»

The text dataset for this challenge is based on both PDFs that have been converted to plain text, as well as (imperfect) scans of paper documents. Mental health survey questions and answer options have been manually tagged by the Harmony team.

Several different versions of the text dataset have been prepared. In the raw version, questions have been encased in <q> and </q> tags, and potential question answers have been encased in <a> and </a>.

Here is an example:

<q>Which of the following best describes your child's current situation?</q>

a) <a>In full time education</a>
b) <a>In full time employment</a>
c) <a>In part time education only</a>
d) <a>In part time employment only</a>
e) <a>In part time education and part time employment</a>
f) <a>Not in education or employment due to health reasons</a>
g) <a>Not in education or employment due to personal choice</a>
h) <a>None of the above</a>

Alternatively, a clean version with the tags removed but with text index ranges for questions and answers provided separately is also available.

Evaluation πŸ“

When you upload your work to the platform, it will be evaluated against an unseen test set. You will be ranked on the scoreboard based on the multi-class classification accuracy of your submission's predictions.

By default, submissions use the CPU evaluation environment, which has 8 GiB of RAM. If you are running low on memory in evaluation, you may wish to try decreasing your batch size. Reach out to us on the Discord server if you have any questions! 😎

The evaluation environment will come with a number of Python packages pre-installed, including the following:

axial-attention==0.6.1, blis==0.7.11, dm-reverb==0.13.0, dm-tree==0.1.8, einops==0.7.0, fastai==2.7.13, keras==2.14.0, kornia==0.7.0, numba==0.58.1, numpy==1.26.2, opt-einsum==3.3.0, pandas==2.1.3, pytorch-lightning==2.1.2, scikit-learn==1.4.0, scikit-video==1.1.11, scipy==1.11.4, tensorflow==2.14.0, tensorflow-addons==0.22.0,  torch==2.1.1, torchmetrics==1.2.0, transformers==4.35.2, trove-classifiers==2023.11.22, wwf==0.0.16

If you need a package that is not available in the evaluation environment, ping us on the Discord, and we will see what we can do for you! If you only need a small package (e.g. one that provides a PyTorch model class definition), you might want to consider bundling it as part of your submission instead.

Next steps 🚨

Choosing a robust model that is usable in a live production application

Harmony runs as a web application that is open for public use, so it is important that the footprint of your solution is as small as possible to maximise performance.

We suggest also stress-testing your model by adding it to your own fork of the Harmony Python library at https://github.com/harmonydata/harmony.

Prizes πŸ†

This competition will have the following prizes as Amazon vouchers:

  • First place: Β£1,000
  • Second place: Β£500