Questionnaire Parsing Challenge
A competition to help researchers extract and tag survey questions from PDFs accurately 📄
A competition to help researchers extract and tag survey questions from PDFs accurately 📄
The Harmony project aims to develop a tool to help researchers make better use of existing data by harmonising questionnaire items and measures across different studies—potentially in different languages—through an approach based on natural language processing (NLP).
Harmony is a collaboration between researchers at Ulster University, University College London, the Universidade Federal de Santa Maria and Fast Data Science. The Harmony project has been funded by Wellcome as part of the Wellcome Data Prize in Mental Health and by the Economic and Social Research Council (ESRC).
Interested in Harmony? Also check out the matching algorithm improvement challenge!
Your challenge is to develop an improved algorithm for identifying mental health survey questions and selectable answers in plain text.
To get started, take a look at our tutorial notebook:
https://github.com/DoxaAI/harmony-parsing-getting-started/blob/main/getting-started.ipynb
For a more minimal example, take a look on our GitHub.
The text dataset for this challenge is based on both PDFs that have been converted to plain text, as well as (imperfect) scans of paper documents. Mental health survey questions and answer options have been manually tagged by the Harmony team.
Several different versions of the text dataset have been prepared. In the raw version, questions have been encased in <q>
and </q>
tags, and potential question answers have been encased in <a>
and </a>
.
Here is an example:
<q>Which of the following best describes your child's current situation?</q>
a) <a>In full time education</a>
b) <a>In full time employment</a>
c) <a>In part time education only</a>
d) <a>In part time employment only</a>
e) <a>In part time education and part time employment</a>
f) <a>Not in education or employment due to health reasons</a>
g) <a>Not in education or employment due to personal choice</a>
h) <a>None of the above</a>
Alternatively, a clean version with the tags removed but with text index ranges for questions and answers provided separately is also available.
When you upload your work to the platform, it will be evaluated against an unseen test set. You will be ranked on the scoreboard based on the multi-class classification accuracy of your submission's predictions.
By default, submissions use the CPU evaluation environment, which has 8 GiB of RAM. If you are running low on memory in evaluation, you may wish to try decreasing your batch size. Reach out to us on the Discord server if you have any questions! 😎
The evaluation environment will come with a number of Python packages pre-installed, including the following:
axial-attention==0.6.1, blis==0.7.11, dm-reverb==0.13.0, dm-tree==0.1.8, einops==0.7.0, fastai==2.7.13, keras==2.14.0, kornia==0.7.0, numba==0.58.1, numpy==1.26.2, opt-einsum==3.3.0, pandas==2.1.3, pytorch-lightning==2.1.2, scikit-learn==1.4.0, scikit-video==1.1.11, scipy==1.11.4, tensorflow==2.14.0, tensorflow-addons==0.22.0, torch==2.1.1, torchmetrics==1.2.0, transformers==4.35.2, trove-classifiers==2023.11.22, wwf==0.0.16
If you need a package that is not available in the evaluation environment, ping us on the Discord, and we will see what we can do for you! If you only need a small package (e.g. one that provides a PyTorch model class definition), you might want to consider bundling it as part of your submission instead.
Harmony runs as a web application that is open for public use, so it is important that the footprint of your solution is as small as possible to maximise performance.
We suggest also stress-testing your model by adding it to your own fork of the Harmony Python library at https://github.com/harmonydata/harmony.
This competition will have the following prizes as Amazon vouchers: