
Professor of Biomedical Informatics
Chair, Department of Biomedical Informatics, Columbia University
My research focuses on
understanding and using the clinical information stored in the electronic
medical record. This theme has several components:
1. Data mining and knowledge discovery. Machine
learning and visualization are examples of techniques to uncover knowledge from
vast clinical databases. My work focuses on testing and extending existing
discovery methods to improve their performance on clinical databases. Important
issues include training set size, data accuracy, data completeness, and
representation (e.g., how to accommodate diagnostic data, which is nominal with
many categories).
2. Natural language processing. In most
institutions, the vast majority of the richly detailed clinical information is
stored as narrative text, which is not generally amenable to automated
analysis. Natural language processing can parse the narrative text, converting
it to a structured and coded format. At present, natural language processors
can do an excellent job in domains such as radiology, which have fairly focused
language. In broader domains such as admission notes, natural language
processing can do very well if the problem is known ahead of time and the
processor can be tailored to the task.
3. Knowledge and data representation. With the
advent of natural language processing and the improvement in the direct
collection of structured data, we are overrun with complex coded information.
Methods are needed to organize the information for visualization (so human
beings can understand it) and analysis (so data mining tools can derive useful
knowledge). It has been shown, for example, that the representation of the training
set is more important to machine learning accuracy than the particular choice
of learning algorithm.
4. Evaluation methodology. The complexity of
clinical data, the presence of inaccurate and missing values, and the large but
heterogeneous collection of patients conspire to make it difficult to draw
conclusions using traditional statistical methods. Bias that would not affect a
traditional randomized trial can overwhelm the true effect in a retrospective
study of the electronic medical record.
5. Clinical demonstration. Demonstrating the
usefulness of the above methods is critical to gather support and to focus new
work in important areas. The methods can be applied to clinical research
(largely hypothesis refinement) and clinical care (by generating timely advice
and monitoring patient safety).
The above work is carried out
within Columbia University’s Data Mining Group, which includes faculty
and students from several departments.
In a separate area of research, I
have focused on the use of new technology such as wireless networks and
handheld computers to improve communications among health care participants.
Examples include community health information networks, portable computers for
providers, home monitoring, and wearable computers for patients.
· Assessing the suitability of new data mining techniques for clinical data
· Issues in the use of machine learning training sets containing clinical data (accuracy, completeness, size)
· Issues in the representation of clinical data for data mining (complexity, nesting, etc.)
· Formal models of data accuracy
· Issues in de-identifying and scrubbing patient data for clinical research.
· Evaluation methodology
· application of reliability theory to the Delphi technique and to binomial models
· use of the bootstrap to assess variability (e.g., in critical incident technique)
· analysis of clustered data
· characterizing performance (ROC curve; Kappa and prevalence)
· sample size analysis
· Use of admit diagnosis to predict the patient state
· Formal characterization of diagnostic uncertainty
· Mapping clinical states to practice guidelines
· Use of data mining in patient safety research (medical errors)
· Linking of the clinical database to genome knowledge bases and databases
· Use of data mining to assess the breadth of residency training
· Use of data mining to study community acquired pneumonia
· Enhancing communication through wearable computers for patients and providers
· Temporal natural language processing: temporal tagging demo and toolkit
· G4003 Theory and Methods in Biomedical Informatics (lecturer 2005- )
· G4060 Evaluation Methods in Medical Informatics (1997-2004)
· Research Elective in Medical Informatics (1995-1998)
· G4001 Introduction to computer applications in health care and biomedicine (formerly W4501) (1993-1995)
o Online lecture notes (no longer maintained)
I designed and manage WebCIS, the Web-based clinical information system for the Columbia University Medical Center and NewYork Presbyterian Hospital’s Columbia-Presbyterian campus. It is used by over 7000 health care providers to access and enter data for 2,500,000 patients and contains data collected since 1979.
· WebCIS Web-based clinical information system
George Hripcsak, MD, MS
622 West 168th Street, VC-5
New York, NY 10032
hripcsak@columbia.edu