Clinical Text Mining for Entity Extraction Using Classical Machine Learning Approaches

Authors

  • Chiluka Soujanya
  • Sarala Sandhya Rani

Keywords:

Medical Entity Recognition, Clinical Text Mining, Machine Learning TF-IDF Vectorization, Logistic Regression, Natural Language Processing (NLP), Medical Text Classification, Healthcare Informatics, Electronic Health Records (EHR), Named Entity Recognition

Abstract

In the era of digital healthcare, vast volumes of unstructured medical text such as prescriptions, clinical notes, and diagnostic reports are generated daily. Extracting meaningful and structured information from this data is essential for building intelligent healthcare applications, including clinical decision support systems and automated diagnosis tools. This project presents a machine learning-based approach for Medical Entity Recognition (MER), intended to recognize and categorize key entities within medical text into categories such as medications, diseases, procedures, dosages, and administration routes. The system transforms textual data into numerical features using TF-IDF vectorization and employs a Logistic Regression classifier to perform entity classification. A sample dataset is used to instruct and assess the model, with results indicating the feasibility of accurate entity extraction using traditional machine learning methods. In this work, a prototype system has been designed with scalability in mind, allowing future integration of larger datasets and more advanced deep learning models such as BERT or BioBERT. The proposed approach demonstrates significant potential to improve the accuracy and efficiency of medical text analysis, ultimately supporting better clinical decision-making and enhancing patient care outcomes.

Downloads

Download data is not yet available.

References

B. Settles, “Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets,” Proceedings of the COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004.

2. R. I. Dogan, R. Leaman, and Z. Lu, “NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization,” Journal of Biomedical Informatics, vol. 47, pp. 1–10, 2014.

3. G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural Architectures for Named Entity Recognition,” Proceedings of NAACL-HLT, pp. 260–270, 2016.

4. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2019.

5. A. N. Jagannatha and H. Yu, “Bidirectional RNN for Medical Event Detection in Electronic Health Records,” Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016.

6. Y. Si, J. Wang, H. Xu, and K. Roberts, “Enhancing Clinical Concept Extraction with Contextual Embeddings,” Journal of the American Medical Informatics Association (JAMIA), vol. 26, no. 11, pp. 1297–1304, 2019.

7. Y. Wu, M. Jiang, J. Lei, and H. Xu, “Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network,” Studies in Health Technology and Informatics, vol. 216, pp. 624–628, 2015.

8. Y. Peng, S. Yan, and Z. Lu, “Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets,” Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 58–65, 2019.

9. S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

10. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 4171–4186, 2019.

11. S. Johnson, M. Kumar, and R. Banchhor, “An Overview of Named Entity Recognition Techniques for Clinical Text,” Procedia Computer Science, vol. 167, pp. 1530–1539, 2020.

12. D. Demner-Fushman, W. W. Chapman, and C. J. McDonald, “What Can Natural Language Processing Do for Clinical Decision Support?” Journal of Biomedical Informatics, vol. 42, no. 5, pp. 760–772, 2009.

13. A. Roberts, R. Gaizauskas, M. Hepple, and G. Demetriou, “Building a Semi-Structured Corpus for Information Extraction in the Medical Domain,” Proceedings of the LREC 2008, 2008.

14. L. Wang, H. Chu, and C. Xu, “A Survey of Clinical Information Extraction Applications: Current Challenges and Future Directions,” Journal of Biomedical Informatics, vol. 77, pp. 34–49, 2018.

15. S.Shivaprasad Dr.M Sadanandam “ Dialect Identification using modified features with Deep neural networks” Traitement du Signal, Vol. 38, No. 6, December, 2021, pp. 1793-1799,2021. https://doi.org/10.18280/ts.380622.

Downloads

Published

2026-02-05

How to Cite

1.
Soujanya C, Sandhya Rani S. Clinical Text Mining for Entity Extraction Using Classical Machine Learning Approaches. J Neonatal Surg [Internet]. 2026 Feb. 5 [cited 2026 May 24];15(1):88-94. Available from: https://jneonatalsurg.com/index.php/jns/article/view/9964