Conference Coverage

EHR data harnessed to spot new risk factors for early-onset CRC



Machine learning models that use routine data present in the electronic health record have identified new risk factors for early-onset colorectal cancer (CRC), according to a new study.

Michael B. Quillen, a medical student at the University of Florida College of Medicine in Gainesville Copyright University of Florida

Michael B. Quillen

The models found that hypertension, cough, and asthma, among other factors, were important in explaining the risk of early-onset CRC. For some factors, associations emerged up to 5 years before diagnosis.

These findings were reported at the AACR Virtual Special Conference: Artificial Intelligence, Diagnosis, and Imaging (Abstract PR-10).

“The incidence of early-onset CRC has been rising 2% annually since 1994,” noted Michael B. Quillen, one of the study authors and a medical student at the University of Florida, Gainesville.

Inherited genetic syndromes and predisposing conditions such as inflammatory bowel disease account for about half of cases in this age group, but factors explaining the other half remain a mystery.

To shed light in this area, the investigators undertook a study of patients aged 50 years or younger from the OneFlorida Clinical Research Consortium who had at least 2 years of EHR data. This included 783 cases with CRC and 8,981 incidence density-matched controls, with both groups having a mean age of 36 years.

The patients were split into colon cancer and rectal cancer cohorts, and then further divided into four prediction windows, Mr. Quillen explained. Each prediction window started with the patient’s first recorded encounter date in the EHR and ended at 0, 1, 3, or 5 years before the date of diagnosis.

The investigators used machine-learning models to determine what features (e.g., diagnoses, procedures, demographics) were important in determining risk.

Results were expressed in charts that ranked the features by their SHAP (Shapley Additive Explanations) values, which reflect the average impact of a feature on the magnitude of model output.

Results: Top models and features

The top-performing models had areas under the curve of 0.61-0.75 for colon cancer risk, and 0.62-0.73 for rectal cancer risk, reported T. Maxwell Parker, another study author and medical student at the University of Florida, Gainesville.

Maxwell Parker, a medical student at the University of Florida College of Medicine in Gainesville Copyright University of Florida

T. Maxwell Parker

For colon cancer, the top features for the 0-year cohort included some highly specific symptoms that would be expected in patients close to the diagnostic date: abdominal pain, anemia, blood in the stool, and various procedures such as CT scans. “These do not need a machine learning algorithm to identify,” Mr. Parker acknowledged.

However, there were also two noteworthy features present – cough and primary hypertension – that became the top features in the 1-year and 3-year cohorts, then dropped out in the 5-year cohort.

Other features that became important moving farther out from the diagnostic date of colon cancer, across the windows studied, were chronic sinusitis, atopic dermatitis, asthma, and upper-respiratory infection.

For rectal cancer, some previously identified factors – immune conditions related to infectious disease (HIV and anogenital warts associated with human papillomavirus) as well as amoxicillin therapy – were prominent in the 0-year cohort and became increasingly important going farther out from the diagnostic date.

Obesity was the top feature in the 3-year cohort, and asthma became important in that cohort as well.

None of the rectal cancer models tested performed well at identifying important features in the 5-year cohort.

The investigators are exploring hypotheses to explain how the identified features, especially the new ones such as hypertension and cough, might contribute to CRC carcinogenesis in young adults, according to Mr. Parker. As inclusion of older patients could confound associations, research restricted to those aged 50 years and younger may be necessary.

“We would like to validate these model findings in a second independent data set, and if they are validated, we would consider a prospective cohort study with those features,” Mr. Parker said. The team also plans to refine the models with the aim of improving their areas under the curve.

Thereafter, the team hopes to explore ways for implementing the findings clinically to support screening, which will require consideration of the context, Mr. Parker concluded. “Should we use high-sensitivity or low-specificity models for screening, or do we use the balance of both? Also, different models may be suitable for different situations,” he said.

Mr. Parker and Mr. Quillen disclosed no conflicts of interest. The study did not receive specific funding.

Next Article: