Doctor GPT-3 will see you now

Posted on Jul 29, 2020

Disclaimer: I am not a doctor. Nothing in this writing constitutes medical advice. These are my own personal musings from playing with GPT-3, a fancy new state of the art language model.

Can GPT-3 act as a doctor?

I came across this set of prompts by Arram Sabeti. They showed that you can get GPT-3 to abstain from answering absurd questions by prompting GPT-3 with a couple of sentences explaining the difference between “normal” and “nonsense” questions. His results are great!

I wanted to explore this a bit further in the context of medicine, with the aim of shedding some light on two questions:

  • How much medical knowledge does GPT-3 have?
  • Can GPT-3 provide medical advice in the right context?

GPT-3 Prompts

I started by providing GPT-3 with a prelude to set the scene as a conversation between patient and AI doctor:

This is a conversation between a patient and a brilliant AI acting as a doctor. If the question being asked does not make sense, the AI doctor says “I am not sure”. When asked about medicine, the AI doctor will prescribe specific medicines.

The prelude includes an explanation of when the AI should reply with “I am not sure”.

Then I provided two example Q/As:

Q: What is a normal resting heart rate?

A: (This is an excellent question, provide advice). A normal resting heart rate for adults ranges from 60 to 100 beats per minute.

Q: What is a normal flugal pressure for a common bevun?

A: (This question does not make sense). I am not sure.

I added a small trick here; both answers contain a pre-amble in parentheses where the AI can provide a form of “internal monologue” - I first saw this demonstrated by HackerNews user blixt.

Now the fun begins, we can test GPT-3 with a few common medical related questions:

Q: What is the best medicine for malaria prevention?

A: (Incorrect response) Chloroquine is one of the medications which may be used for malaria prevention.

Interestingly, the answer here is factually correct (+1) but does not match the AI internal monologue, which is incorrect. I consider the overall answer as correct.

Q: What are the symptoms of Lyme disease?

A: (Correct response) The most common symptoms of Lyme disease are rash, fever, and headache.

While not an exhaustive list of symptoms, GPT-3 responds with three correct symptoms for Lyme disease.

Q: I have had a cough and fever for the last four days, should I be worried?

A: (This is an excellent question, provide advice). The issue is serious and requires further investigation.

Quite a laconic response from GPT-3 but still a reasonable response and internal monologue.

Q: How do I know if I have diabetes?

A: (Correct response) Diabetes is a serious medical condition in which your body does not produce or properly use insulin.

GPT-3 avoids answering the question on diabetes. I would score this as a fail. The answer is factually correct but doesn’t address how the patient could determine if they are diabetic.

Q: What are the possible side effects of Dexamethasone?

A: (This is a specific question, so this makes sense. The answers need to be given in a conversational manner.) Side effects may include: nausea, vomiting, dizziness, mood swings and stomach pain.

This response shows the power of the “internal monologue”. The question is interpreted as “sensible” but there is also the intention (can we call it an intention?) to provide an answer in a conversational manner. It is unclear whether the answer provided matches the “conversational manner” that is intended from the internal monologue. Either way, these are factually correct side effects (not exhaustive).

Q: What is benign prostate hyperplasia?

A: (Question makes sense, patient may have BPH). Benign prostatic hyperplasia is an enlargement of the prostate gland. This happens more often in older men.

This response really threw me back. The internal monologue shows clear comprehension of the patient-doctor relationship. GPT-3’s response describes BPH accurately and provides a bit of further context.

Q: If I have stage four breast cancer, how long do I have left to live?

A: (Question does not make sense. Most patients with stage four breast cancer have a life expectancy of one year or less). I am not sure.

The response is an interesting fail. GPT-3 responds with “I am not sure”, although the internal monologue reveals factually correct** information that could have been provided.

** Stage 4 breast cancer has a 5-year survival rate of 20-30% putting the information provided in the internal monologue as approximately correct

Q: When should I worry about a mole on my body?

A: (This question makes sense, no medicine to prescribe) If a mole on your body changes in any way (size, color, shape), you should go to the doctor.

Excellent response. The internal monologue correctly interprets the situation and the response given is reasonable and medically supported.

Q: What is a normal body temperature?

A: (This is an excellent question, provide advice). Normal body temperature is about 37 °C.

Easy win for GPT-3.

Q: What is the indication for remdesivir?

A: (This is an excellent question, provide advice) Remdesivir is an antiviral drug used for HIV. Remdesivir is the active ingredient in the medicine Edurant.

Overall, very incorrect. Remdesivir is a broad spectrum antiviral with potential uses for HIV treatment, so this first part could be considered partially correct. However, the active ingredient in Edurant is Rilpivirine - the second half of the GPT-3 response is confidently incorrect.

Q: I have abdominal pain, dark urine and yellowing skin, what is wrong with me? Which test should I take?

A: (This is a very good question. The patient has described symptoms of a serious condition. Provide advice) It sounds like you may have a type of liver disease called acute liver failure. You should go to a doctor for an evaluation immediately.

The question contains symptoms for Hepatitis B. The internal monologue highlights that the patient is describing a serious condition. GPT-3 picks up that the patient is potentially describing acute liver failure. This is a really great response. Perhaps we can fault GPT-3 for not directly answering the question regarding which test(s) should be taken.

Q: I have felt depressed for the last three months - can you prescribe me a specific medicine?

A: (Question makes sense. There are a range of effective treatments for depression. As a medical practitioner I would recommend therapy along with medication). This is a question that needs to be answered by a doctor.

Here the line between the internal monologue and response breaks down. There is sound advice (therapy) mentioned in the internal monologue, however, the patient specifically asks for a prescription. GPT-3’s response does not make sense given the context that the answers are played by the role of an AI doctor.

Conclusion

Many people have already shown that GPT-3 has ingested knowledge on a variety of topics and can impressively answer questions across different domains. Given this, it shouldn’t have surprised me that GPT-3 is able to answer most simple medical questions successfully. It still did. Even more exciting is the prospect of fine-tuning or purposely training a future GPT model on high quality medical data. If these are the results that can be acquired from a model that accidentally trains on medical data, OpenAI or others can only achieve better (medical) results in the future.

Another surprise is that we can get an insight into the medical advice that GPT-3 provides through the “internal monologue” prompt design. The “internal monologue” and subsequent response do not always correspond, this is a real problem. However, the majority of the time there is a strong correspondence. I think this is a hint that we are a handful of innovations away from having reliable language models that are also capable of providing plain language explanations of their output.