Reza Hosseini Ghomi, MD, MS on Detecting and Tracking Depression Through Voice Samples

A study recently published in the journal Depression and Anxiety suggests automatic language processing may be able to play a role in detecting and tracking depression.

In this video, researcher and neuropsychiatrist Reza Hosseini Ghomi, MD, MSE discusses what led up to the study, its design and key findings, and the possible clinical applications of the findings.

Dr. Ghomi focuses on neurodegenerative disorders in his work at EvergreenHealth, a health care system in the Seattle, Washington metropolitan area. He is also a partner at Avicenna Telepsychiatry, chief medical officer for Braincheck, a faculty member in the University of Washington's Department of Neurology & Institute for Neuroengineering, and an affiliate at the university's eScience Institute.

Additional Resources:

Zhang L, Duvvuri R, Chandra KKL, Nguyen T, Ghomi RH. Automated voice biomarkers for depression symptoms using an online cross-sectional data collection initiative. Depression and Anxiety. 2020 May 7;[Epub ahead of print].


My name is Reza Hosseini Ghomi. I am a neuropsychiatrist practicing clinically about half-time and doing research the other half. I have a primarily memory clinic with some general psychiatry at my local public health hospital called EvergreenHealth, here in the Seattle area. I also practice telepsychiatry across the state of Montana.

I'm involved in research across a few different areas, including the University of Washington and some private sector projects I have going with Then previously, the work I'm speaking about today, although I'm not day-to-day working with NeuroLux, a voice diagnostics company, I was previously doing research with them for 4 years.

What led you and your colleagues to conduct this research?

When we first started this, the goal was to provide a larger and more sensitive data set to introduce some voice technology to actually help us as clinicians and also patients directly potentially, with tracking and detecting depression symptoms.

There've been a number of papers that have come out looking at various data sets, all with limitations, of course, as we all know in research, and many of them very small. A lot of the research results have been hard to interpret with such small data sets.

Also, limitations in terms of generalizability, looking at research that's collected in a very controlled environment versus naturally in the real world. The idea for this study came about when we wanted to capture, what are people providing in terms of data from their homes, from their own environments, not in a controlled, clinical setting?

Also, could we actually find a way that is somewhat easier and less costly to scale, to collect larger data sets? That's how the idea came about. We also were able to make a relationship with an organization that has a huge footprint in the depression world, and that's Mental Health America, a very large, mental health nonprofit that operates nationally.

They happened to have the website, at least, at the time what I could find, that had the most traffic related to depression. At the time when we spoke, their traffic was pretty consistently in the order of 40,000 to 60,000 completions of a PHQ‑9 online, every month. By far, that was the largest data stream I was aware of. This is comparing to, PHQ‑9 studies that are typically a few dozen people, maybe 100 or 200 people.

As we talk about machine learning and artificial intelligence, and we look at how much things have progressed in voice technology, for example, us using Siri data, DataTalk, Google Voice, or Microsoft, or any of those options, they've been able to improve because they have access to millions and millions, and actually billions of data points.

In health care, and mental health specifically, it's a lot harder to get data of that size. One of our goals was, how do we get access to larger data sets, so that we can actually use some of these tools in a reliable fashion?

Please briefly describe how the study was conducted and your key findings.

We conducted the study in as simple of a way as possible. Our IRB, of course, was involved.

We designed it to be anonymous, and we also wanted to keep it as low-cost as possible. The way we did it was because Mental Health America already has a great website, with traffic bringing people in asking – and this is mainly from search results. If you're on a search engine, and you type in "I'm feeling depressed," generally, you will end up at Mental Health America frequently, depending on what you type in. That can bring you to their PHQ‑9 page that says, "Fill out this questionnaire."

They collect some optional demographics, they collect the PHQ‑9, and at the end of that, what made the most sense was having a little advertisement, a little banner that says, "Hey, if you're willing, think about enrolling in this study to donate your voice, so we can do this research."

We saw, of course, a percentage of people that got to that point would then click on that, would give their voice sample after reading the consent and signing up. The process was pretty simple on the patient side. We're not going to require a phone call. We're not going to require interaction.

We kept them anonymous. All we ended up having was their demographics and PHQ‑9 score and subscores from the Mental Health America website, and then the voice samples we collected, which was primarily 30 seconds of free speech. That's how we conducted the study, and we let it run for close to a year, on and off. We had to do some redesigns.

We had some months where it was a little more active than others. We wanted to make a couple of changes, see if we could improve participation, but overall, it ran over a course of about a year.

We wanted to build models. We actually did some very basic, introductory linguistic modeling with something called N‑grams. We looked at prosody features, which we commonly comment on in our mental status exam as clinicians, in terms of the fluidity of how someone's talking. Then we also looked at straight voice features. We had a lot of features related to the acoustic measures, things like the tone of the voice and how to mathematically represent that, various frequency measures of the voice.

Those were the three main areas we looked at. Across all three, we found very good predictive value, good correlation with depression and both the total depression score, the PHQ‑9 severity, the suicidality question, question 9. Those were the two main areas.

We also did look at psychomotor disturbance, which one of the questions does ask about. In that case, the findings were not significant, although we have some theories. That primarily may have to do with the fact that that question captures both psychomotor retardation or agitation. It's clustering two physically different things into one question and that may have been why. It's actually reassuring to see that it was not significant because, as we can all imagine, a patient who's agitated versus a patient who's extremely just not moving and is very depressed, physically, probably the impact on their voice is very different.

Those were some of the key findings that we found that suggest that there is a role to be used in automatic language processing in depression tracking and detection.

The performance of our speech algorithms outperformed some of the previous studies we had looked at, both in terms of accuracy and predictive value. Those previous papers, a lot of them had access to full transcripts, full interviews of patients, so much larger samples. We were only using 30 seconds. Part of what we were validating was, again, to make it more naturalistic, can you grab a piece of data from a patient in the course of their regular day, maybe 30 seconds, maybe less, maybe a little more?

The point was, is that realistic? Can that actually have some predictive value? This paper is the first step to showing that there is some signal there.

Were any of the outcomes particularly surprising?

Nothing was all that surprising. The psychomotor disturbance again was somewhat expectedly not significant. That I think is one of the limitations of using the PHQ‑9 because the agitation is clustered with the psychomotor retardation. That's something that's probably relevant. As a clinician, it's really left up to us to tease that out for a patient, which category do they fall into.

Agitated depression or depression on the bipolar spectrum, very different than depression on the unipolar. It's biologically different, we treat it differently. That's a pretty important thing to separate out. Of course, the PHQ‑9 is a very old tool. There are many reasons why it's designed the way it is, and so, that's one of the limitations. That wasn't so surprising.

I was actually a little bit pleasantly surprised with how well the algorithm was able to predict the suicidality, question 9, and the total depression severity. We reached ROCs of well into the 80s in terms of percentile accuracy. When you do the sensitivity specificity curve, we had a pretty good performance, pretty good curve. That was pleasantly surprising.

I guess I didn't have huge expectations either way, but I was surprised to see that it was as good and, in fact, better than those studies previously that had used larger transcripts, had more voice and transcript data. It's performed better than I expected.

What are the possible clinical applications of these findings?

Clinically, many of us likely agree that PHQ‑9 has a lot of limitations. In mental health, we still use and have to rely on, for many reasons, subjective questionnaires. We want to be able to move things into more of an objective arena. We want to, like many of our colleagues in other fields, have tests or these questionnaires that are capturing somewhat more objective data to help us track things.

Also, depression is such a multifaceted and diverse disease that we need something that's a little more sensitive to these different aspects of it. Clinically, when we get the PHQ‑9, of course, it gives us a general view. Mostly, it gives us a general view of our eyes are drawn to total score.

Then if there's any outlier sometimes, people have 3s on a couple things and then lower on another, so it draws you to those big outliers. Of course, the suicidality question is a conversation starter. Clinically, I don't see the use of these questions changing necessarily anytime soon.

Especially now, which is quite interesting during COVID, which, of course, none of us anticipated, with so much push to telehealth, so much more to remote care, it's even more important. It's speeding up the timeline for the need to track people remotely. How do I have someone fill out a PHQ‑9 at home? It's a little bit harder. Of course, there's ways. It's not, of course, impossible.

Like many of you, I'm sure you've had this experience. In my clinic, all of a sudden, we've had a huge reduction in the number of questionnaires completed because they're online, and that workflow is just a little more challenging.

This brings up the whole point of, if I can have my patient complete a questionnaire online or even better, just a few seconds of voice and I can get that into my clinical report or see that data, that's going to really replace and also streamline a lot.

Also, the goal is once you have a voice sample, that can actually biologically then be built to be more related to that person's depression than a PHQ‑9 can ever do, because a PHQ‑9 is only asking these simple questions. It's never going to biologically characterize what's going on in someone's brain and body. Whereas, a voice sample is an actual sample from that individual.

The goal with this was to start with the PHQ‑9, but eventually go beyond it. It's a starting point, but I actually want to see if I can track that person's unique depression.

Clinical practice going forward, I have no doubt over the next several years, we are going to be seeing a huge influx of products and options to be used harnessing voice technology. We already are. Several of these, some of them are already on the market. None that I'm directly related to, but that are using this research to grab a patient's voice sample, provide clinical decision‑making support, provide with some feedback directly on the content, but also using the voice signal to predict the depression or, more importantly, track the depression.

What might be more important than even saying, "Oh, this is really good at just detecting depression," what we call depression in general, which is hard to define. Maybe what's the best place to start is, and what we already are seeing is, here's an individual. Here's their baseline. Here's their little voice sample now and their depression scores. And here's their progression with treatment, with intervention. That's probably going to be the most important thing rather than, like in other fields where we can say, "A brain scan is this good at detecting a brain tumor" or something that's a little more consistent across populations.

What we know is, in mental health, with many of our diseases, including depression, it's just not consistent enough. The big reason we've been unable to find extremely effective treatments is because it's so different for the individual. This is going to provide us a way to provide more customized and patient‑centered care.

Are there other studies planned on this topic, or further research you feel is needed?

In terms of other studies planned, personally, my focus has been transitioning to tracking cognitive changes. Dementia is my clinical focus, but still using voice actually to augment that. There's a big body of evidence though behind using voice in mental health. There's great evidence in schizophrenia, tracking psychosis, depression, and anxiety.

There's groups all over the world looking at this, I've worked with colleagues from Turkey, China to all over Europe, South America, Brazil, all over the US. This is quite a big topic. In fact, if you look at some of the big conferences, like the Audio/Visual Emotion Challenge every year, that's an international conference, other voice conferences specifically, there's usually a few dozen. We're talking a few dozen, each of these focused on detecting mental health symptoms. It's a big area of research. There's a number of studies ongoing. One of the big challenges is research that can harness bigger data sets. How do we get more data? There's, of course, a number of challenges. You're asking somebody to record their voice. It's very personal. It's identifiable information.

It is really considered as like a thumbprint. It's protected health information. It's a harder thing to handle. We have to be very deliberate about the ethical implications about this. If we can detect someone's mood, of course there's consequences to that we have to be careful of. I'm just going to throw out some ideas that I worry about.

For example, if insurance companies, they start running your voice when you call in through their little models, and they think, "Oh, you're pretty depressed, I'm going to your fees every month," that's concerning. We have to really be deliberate about where this goes. I don't think that's going to happen anytime soon, but it is, of course, a concern.

There's studies from multiple angles. We need to also do lots of studies to still use those larger data sets. We have to be really careful that we're answering a discrete clinical question, and that we're using the results of any of these products or algorithms correctly. We don't want to make incorrect assumptions.

Artificial intelligence is only as good as the data you give it and is only as good as how well you know that data. We can't just say, "Here's 1,000 voice samples, tell me if the person is depressed." We have to be really deliberate about what exactly is that machine learning.

That can sometimes get us in trouble. There's, nothing, of course, has been implemented clinically, but examples where a machine can learn the wrong thing, give you some of the right answers so you think it's working correctly, but then actually, it turns out it was using the wrong pattern and it can be very harmful. We have to be very, very careful when implementing this.

Of course, with COVID, I would point out, again, right now the FDA is, all of a sudden, very, very open and fast‑tracking products that could be helpful and beneficial in the short term. I'm already seeing a few studies out there. The group I was working with previously is actually doing a study to track COVID symptoms using just your voice sample. Many, many applications with this technology.

Especially because of COVID, we're going to see an acceleration. We're going to see lots of different options using voice technology. Significantly further research is needed, of course, to really bring this into clinic and make it useful. We're going to see this over these next few years. It's still going to be very much in the research and development stage.

Anticipate to see this plugged into trials. It's already, a lot of big pharma companies are plugging in voice collection to their trials. It's well on its way now. Certainly in our lifetimes, we're going to see this as part of our practice.