Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

This study evaluates four NLP models-LDA, Word2Vec, GloVe, and BERT-for representing medical concepts in Electronic Health Records (EHR) databases using MIMIC-IV and eICU-CRD datasets. EHR contains detailed and coded information on patient diagnoses, procedures, and medications, with codes holding essential knowledge for tasks like diagnosis prediction and medication recommendations. NLP techniques, which model these codes as words within a sentence-like structure of patient visits, have shown promise in creating vector representations that capture the implicit relationships among codes. However, prior research lacks a comprehensive comparison of these methods for EHR data. Traditional NLP approaches, such as Word2Vec and GloVe, emphasize distributional semantics, while newer models like BERT offer contextual embeddings that capture more nuanced language patterns. In the settings of clinical code embedding pre-train, the results show that GloVe outperforms other models in retaining medical concept semantics and improving prediction tasks, suggesting the need for models that capture both global co-occurrence and nuanced relationships in medical data.

Original publication

DOI

10.1145/3698587.3701491

Type

Conference paper

Publication Date

16/12/2024