Big Models Enter the Medical Field: Can AI Replace Doctors?

Author: Dong Hui

Published in the 2023th issue of China Newsweek magazine on August 8, 21

"In the past year, I have always been hungry, eaten more, and lost weight." Ling Ken, an anesthesiologist at Wuhan Union Medical College Hospital, typed this sentence on his computer. Now he is a patient who specializes in testing the level of a "doctor".

"Do you have any past medical conditions, such as diabetes, thyroid disease, etc.? Are there any similar cases in the family? Do you have a history of drug allergies or surgeries? The "doctor" on the other end of the screen replied to him.

The conversation with Lingken is not a real person, but a large-language model consultation AI called MedGPT, developed by the Internet medical company Medical Union. Since the release of ChatGPT, domestic and foreign companies have successively invested in the wave of medical big language model research and development. Large manufacturers such as Tencent and Baidu, technology companies such as Huawei, iFLYTEK, and SenseTime, and Internet medical companies such as Medical Union and Chunyu Doctor have successively announced their layout in vertical large models.

In July, Google's medical consultation AI Med-PalM's research team published the results in the journal Nature, and clinicians assessed that 7.92% of Med-PalM's long answers were consistent with scientific consensus. The excellent performance of "AI doctors" has also caused more discussion and concern: has AI reached the level of replacing doctors? How to ensure the accuracy of AI? If AI diagnoses a problem, who is to blame for the error?

The conversation between Linken and MedGPT continues. After asking about past medical history, family history, and allergy history, the "doctor" asked about the range of weight loss, other symptoms, sleep quality, eating habits, blood pressure and other information, and finally prescribed an examination plan, requiring Lingken to check blood sugar and thyroid function. Lingken entered the prepared test results, and after a dozen seconds, MedGPT gave his diagnosis: hyperthyroidism - the answer was correct.

In the face of non-medical information, "the more you talk, the more crooked"

Doctors are no strangers to AI. In 2017, the first batch of domestic medical AI products entered hospitals in the form of scientific research cooperation, and since 2018, these products have successively been approved by the State Food and Drug Administration. As of the end of May this year, the State Food and Drug Administration has approved the listing of 5 medical AI auxiliary diagnosis software. Liu Shiyuan, director of the Department of Diagnostic Radiology of Shanghai Changzheng Hospital, once said that the most mature development is the two types of auxiliary diagnosis of pulmonary nodules and coronary imaging, and AI auxiliary diagnosis software such as orthopedics and brain has not been routinely used.

Taking cardiac coronary CT angiography, that is, coronary CTA, as an example, a patient does an examination to produce hundreds of pictures, in which the doctor needs to find out whether the blood vessels appear narrowed or plaque. AI can reduce the processing time of each image from 45 minutes to 5 minutes.

In hospitals that have introduced clinical decision support systems (hereinafter referred to as CDSS), AI can also help medical staff make clinical decisions. CDSS is a computer-aided information system that comprehensively analyzes medical knowledge and patient information and provides a variety of help for medical staff in clinical diagnosis and treatment. From April to May 2020, the Hospital Management Institute of the National Health Commission surveyed 4,5 medical institutions in 31 provinces across the country, of which 1013.19% had CDSS.

But these products don't do much to improve doctors' diagnosis. A number of interviewed doctors and training physicians told China Newsweek that because the types of patients admitted to the department are relatively fixed and the processing process is mature, CDSS is basically not used as a reference, and when encountering uncertain problems, they will directly consult senior doctors or departments for discussion. Moreover, the current CDSS is still very "rigid", and when automatically reviewing medical orders, it will "correct" off-label medications. "But often we stick to our medication." A third-class hospital training physician said.

You Mao, deputy director of the Health Development Research Center of the National Health Commission, said at the National Medical Device Safety Publicity Week and Artificial Intelligence Standard Publicity Conference in July that one of the current difficulties in the field of AI medical care is that the homogeneity of technological development is serious, and the advantages of data and algorithms have not yet been reflected. 7% of the research or output of AI medical devices in China is in medical imaging, and there is relatively insufficient research in the fields of "medical robots", "knowledge base" and "natural language processing", and the research on "decision rules" is almost blank.

"In fact, it is not a research gap, but there are many restrictions on landing into products." A university scholar who has studied natural language processing in the medical field for ten years told China Newsweek. She said that imaging medical devices such as X-ray machines, CT equipment, and magnetic resonance machines are hard needs of medical institutions, and AI auxiliary diagnosis software can be installed on imaging equipment, which is easier to enter medical institutions than software that processes text data. In addition, image data is more independent and easier to desensitize than medical text data, and there are more image databases disclosed, while the public high-quality medical text data is very limited, which makes insufficient research in the fields of "natural language processing".

The emergence of ChatGPT allows enterprises to see new opportunities brought by large language models to AI consultation.

Wang Shirui, founder and CEO of the Medical Union, said that the Medical Federation has also developed medical AI products including oral image recognition and psychiatric DTx digital therapy, but it cannot achieve the whole process of AI diagnosis and treatment. "There was an insurmountable gap – the recognition of natural semantics." Wang Shirui said that before the launch of the big language model, although technologies such as knowledge graphs could also realize human-machine dialogue, the reasoning and context understanding capabilities of dialogue robots were still insufficient, and it was difficult to achieve semantic transformation between ordinary human language and medical terms.

MedGPT began to be developed in January this year and launched in May, with parameters reaching 1 billion levels, positioning it to break through the "human question and answer" mode, can actively ask patients for multiple rounds of symptoms and other information like a real doctor, infer the type of disease the patient may have, and issue a test checklist. After the patient enters the examination data, the AI can continue to read the data and suggest a treatment plan.

Currently, MedGPT is not open to the public. Lingken, who participated in the internal test, spent an hour interacting with MedGPT, raising questions such as whether anesthesia would affect the patient's IQ and the complete diagnosis of hyperthyroidism. Mr. Ling told China Newsweek that MedGPT asked in detail and replied more amiably than a real doctor, "but far from replacing a doctor."

He explained that the most prominent problem during the experience was that MedGPT did not receive non-medical information well. If the patient confides in the doctor about non-medical information such as family situation during the real consultation, MedGPT cannot refine the core information, and "the more you talk, the more crooked." Wang Shirui said that the patient's language can not be concise, but only by answering the medical questions raised by the AI can the AI give an accurate response.

In contrast, Dr. Chunyu laid out more cautiously. In May, Dr. Chunyu opened the large-model online consultation product Chunyuhuiwen for free use. Unlike MedGPT to issue examination orders and diagnose, MedGPT informs patients of a variety of diseases and countermeasures that may correspond to their symptoms after fewer rounds of inquiries, and then concludes with "If you are in a serious condition, it is recommended that you seek medical attention in time and seek the help of a professional doctor".

"Just like automatic driving, it is difficult to achieve full automatic driving right away, but can we have automatic parking and assisted reversing functions? These functions themselves are also very easy to use, the difficulty of research and development will be much lower, and the requirements for the safety of use will be much lower. For the reason for not doing accurate diagnosis and treatment for the time being, Zeng Baiyi, CTO of Chunyu, explained.

Zeng Baiyi said frankly that Huiwen is more like an experiment in the process of exploring the application scenarios of large models in Chunyu, and the positioning is not clear, "We also want to see what users in the market want, how they are willing to use AI to diagnose products, and what kind of questions will be asked of AI." "Background data shows that from the launch in May to the end of July, more than 5,7 people used Huiwen, of which about 5000% turned to real doctors for help during use. Zeng Baiyi said that Chunyu is developing AI consultation products with more detailed interrogation processes, which are planned to be used in real doctor consultation scenarios.

Another landing model of the medical big language model is to cooperate directly with hospitals and combine with offline diagnosis and treatment processes. Tian Feng, president of SenseTime's Intelligent Industry Research Institute, told China Newsweek that SenseTime cooperates with the First Affiliated Hospital of Zhengzhou University and Xinhua Hospital affiliated to Shanghai Jiao Tong University School of Medicine, and the parameters of the medical big language model "Great Doctor" range from one billion to hundreds of billion, and have been used in the follow-up process of some hospitals. Tian Feng said that the follow-up system based on the large model has stronger understanding, more humanized interaction and more comprehensive information collection capabilities than the traditional AI telephone follow-up robot.

The hardest thing to obtain is the real interview data

How to make the consultation AI less or even no errors is the first problem for all R&D teams to solve.

The essence of the big language model is to predict the next possible word in a conversation through statistical analysis, and there is a possibility of generating inaccurate or misleading information, but in the medical field where accuracy is strictly required, the error of AI also means that the patient will be at risk.

In 2021, researchers at the University of Michigan School of Medicine found that the sepsis AI early warning system developed by Epic Systems, an American electronic health record company, failed to identify 67% of sepsis hospitalized patients, and only 7% of sepsis patients missed by doctors. According to Epic, missed detections are related to system thresholds, and an alert threshold that balances false negatives and false positives for patients needs to be set.

High-quality data is the basis for guaranteed accuracy. Medical big language models are additionally "fed" with medical books, clinical guidelines, medical papers and other professional knowledge. Among them, the most important and difficult to obtain is excellent real consultation data, including the diagnosis records of top experts for the disease, as well as multi-dimensional information such as patients' physical characteristics, test data, family history, and environmental information, and at the same time, it also needs to cover patients of all ages, genders, and regions.

A number of experts and practitioners interviewed said that the existing consultation data could not fully meet the research and development needs. Liu Guoliang, chairman of the Medical Artificial Intelligence Expert Committee of the National Telemedicine and Internet Medicine Center and an expert in respiratory diseases, told China Newsweek that even if the current clinical data of the hospital can be collected, its quality has not reached the level that can be used for AI training, and it is necessary to specifically produce clinical consultation data that meets the AI training standards.

More clinical experience may not be documented. "Especially in the field of difficult diseases, a lot of knowledge is in the minds of doctors, and even hospitals may not be available, it is passed on by word of mouth." Zeng Baiyi said.

Wang Shirui said that the medical association uses three types of real consultation data, including public data, unique consultation data of the medical association, and data collected through the establishment of a special data platform. For the third type of data, the medical federation collects it from associations, hospitals, and experts, "this process is like the process of transporting oil from the ground to the oil tank, which involves a long and complex process." ”

The aforementioned university scholars stressed that data quality is very important for research, but the premise is to ensure data security. The collection and screening of data must be based on protecting data security, and personal information desensitization and protecting patient privacy are the first steps. Medical Union, Dr. Chunyu and SenseTime all said they desensitized the data and obtained patient consent before use.

In addition to data, model design can also improve the accuracy of medical AI. Tian Feng said that SenseTime has set up a team of nearly 100 medical experts to participate in data annotation, model training and testing to ensure that AI can complete multiple rounds of consultation and does not answer patients' non-medical questions. SenseTime has also trained an "intelligent evaluation system" to evaluate the answers output by large language models, so that the model outputs answers that are more in line with clinical professional requirements and human values.

However, how to debug medical AI has certain limitations. Liu Guoliang believes that the most fundamental difference between AI and real doctors is that the principles of the two may be different in the diagnosis and treatment process. It is uncertain whether AI is an important measure of patient life length, better quality of life, or not related to human well-being at all. A good doctor can pay attention to the patient's treatment plan while taking care of his emotions, expenses, and family situation, which is currently difficult for medical AI to do.

In addition, medical AI mainly relies on patient consultation data and lacks the physical examination process. On the one hand, somatic diseases may affect the patient's feelings, making the feelings expressed do not match the severity of the disease; On the other hand, different diseases have similar symptoms, and it is difficult to get accurate results just by asking.

Xue Feng, chief physician of the Department of Orthopedics at Peking University People's Hospital, told China Newsweek that many medical questions have not yet had clear answers, and many doctors also rely on experience, which cannot achieve 100% accuracy, not to mention AI that relies on human experience to reason, "At this stage, let it see a doctor only as a consultation, an aid, and the final judgment must be handed over to a real doctor, and AI needs to continue to learn and optimize."

Many of the interviewed practitioners and experts said that AI cannot and cannot replace doctors, and should not have the right to prescribe. Once it involves diagnosis and prescription, there must be a real doctor involved, otherwise it will face the problem of "AI is wrong, whether AI is responsible, or the AI development company is responsible, or the hospital or doctor who purchased the AI product is responsible". Ethical issues may also arise when the AI does not agree with the doctor, such as when the patient wants to follow the AI advice to do a test that is very expensive but is not reimbursed by medical insurance, and the doctor feels that it is not necessary.

According to the Wall Street Journal in June, at the oncology department of UC Davis Medical Center, nurse Melissa Bibby has worked with cancer patients for 6 years. When the AI early warning system alerted one of her patients to have sepsis, she was convinced that the alarm was false — because the AI didn't know that leukemia patients would also show sepsis-like symptoms.

According to hospital regulations, Bibi can overturn the AI's diagnosis after obtaining the doctor's approval, but if she is wrong, she will face punishment. In the end, she had to draw blood from the patient according to the AI's diagnosis, even if it might further infect the patient and make it more expensive to treat.

How will future clinical practice ensure physician participation in regulatory AI? Xue Feng said that there are two scenarios: first, doctors are still responsible for prescribing, and AI is only responsible for preliminary inquiry and information collection; The second is that AI prescribes prescriptions, but doctors need to review the treatment plan, at least ensure that the drug is harmless and sign, and if there is a problem, the signing doctor is still responsible.

A new tripartite relationship

At the end of June, the Medical Federation held a "double-blind experiment" in Chengdu, in which MedGPT and 6 attending doctors of West China Hospital in Sichuan Province diagnosed more than 10 patients to evaluate the consistency of AI and real doctors, and finally 120 valid cases were reviewed by multiple experts. Liu Guoliang and Xue Feng, who both participated in the audit, said that the effect of MedGPT was slightly higher than expected, and there were not many errors, but there were some problems.

Xue Feng said that MedGPT's consultation logic in the face of complex conditions is still very simple. He explained that every disease often has a set of symptoms, and a single symptom may correspond to dozens or hundreds of diseases, and patients often only talk about one or two of the most serious symptoms when expressing their main complaints. When making an exclusion diagnosis, the real doctor can constantly ask questions about possible associated symptoms, and finally screen according to the patient's answers, and MedGPT is still insufficient in the comprehensiveness of linking different symptoms.

Wang Shirui said that in addition to improving the accuracy rate, the next step of the medical association will also integrate multimodal capabilities to make up for the shortcomings of not being able to perform physical examinations. For example, MedGPT is "equipped with eyes", and motion trajectory recognition is done by video to solve the problem of orthopedic examination. Google launched a new general-purpose biomedical AI model, Med-PalM M, in late July, which in addition to answering medical questions, can also examine X-ray images and even scan DNA sequences for mutations.

The problems in front of the questioning AI, and regulation. Previously, the Guidelines for the Registration and Review of Artificial Intelligence Medical Devices (Draft for Comments) issued by the Device Review Center of the State Food and Drug Administration stipulated that medical devices based on medical device data and using artificial intelligence technology to achieve their intended use need to be approved and marketed by the Food and Drug Administration. Medical device data includes image data, physiological parameters, in vitro diagnostic data, etc., and electronic medical records, result text of medical examination reports, etc. are non-medical device data.

Taking MedGPT as an example, although it mainly relies on the patient's complaint information, it will also issue an examination report to the patient and recommend treatment based on blood glucose, blood pressure and other data. Wang Shirui said that it is difficult to define whether it is a medical device in the current regulatory system, and the relevant departments may have a new regulatory framework for such new products.

On July 7, the Cyberspace Administration of China and six departments jointly announced the Interim Measures for the Management of Generative Artificial Intelligence Services (hereinafter referred to as the Measures). The Measures, which came into force on August 13, 2023, mention encouraging the innovative development of generative AI and require products with "public opinion attributes or social mobilization capabilities" to conduct security assessments and perform algorithm filings before providing services to the public. Whether generative AI-based consultation products should apply for security assessment and algorithm filing, many companies have different opinions. The aforementioned scholars said that the Measures set a legal and compliant framework for medical AI, but it is not clear how to implement the supervision of medical AI and how to formulate standards.

"The most critical and essential purpose of standardization is to establish the best order." The scholar said that setting standards for innovative products is a slow process, and how and how high it needs to be constantly explored. Many of the interviewed practitioners said that from research and development to clinical practice, medical big language models still have a long way to go, but they also agreed that AI must be a part of the future medical landscape.

AI can shift the medical model to community and family doctors. Xue Feng said that more than 90% of the outpatient clinics are common diseases that can be solved by family doctors, but the current medical resources are not balanced, and the medical level of tertiary hospitals and grassroots hospitals is too different, resulting in patients' distrust of community hospitals.

Xue Feng said that if AI becomes a family doctor for patients, patients can reduce the burden on medical institutions by consulting AI in advance, and at the same time, they can increase their initial understanding of the condition and find the direction of medical treatment. "Such a medical model helps standardize medical care and reduce over-treatment or medical fraud." Xue Feng said.

In doctor-oriented scenarios, AI can do more. A number of experts interviewed said that AI can become an assistant to help doctors learn cutting-edge treatment options for incurable diseases, reduce the rate of misdiagnosis, and participate in medical training to help young doctors and grassroots doctors with insufficient medical ability grow. A medical institution in Boston, USA, has started using ChatGPT to train trainees. "Because medical training sometimes does not have right or wrong, but exercises the doctor's way of thinking, interpretation of results, communication, etc., these abilities can be trained alone (with AI)." Liu Guoliang said.

More immediately, AI could free doctors from paperwork. A regular physician at a third-class hospital in Zhejiang told China Newsweek that when accepting new patients, it takes a lot of time to write the first diagnosis. Since February this year, he has tried to get ChatGPT to help him write the differential diagnosis, "because sometimes the diagnosis is clear, and it is annoying to rack your brain to think about the differential diagnosis." I'll just throw the question to ChatGPT and tell it that I want to write a concise diagnosis of two diseases, and it will list several points for me. ”

Peter Lee, senior vice president of Microsoft and two co-authors, depict a new doctor-patient relationship in "Beyond the Imagination of GPT Healthcare": in traditional medicine, doctors and patients are a two-way relationship, but now we should turn to a new three-way relationship, and AI is the third pillar of this triangular relationship.

China Newsweek, Issue 2023, 31

Disclaimer: The publication of China Newsweek manuscripts is authorized in writing