Using Supervised Learning To Identify Descriptions Of Personal Experiences Related To Chronic Disease On Social Media

Open Access
Murphy, William P
Graduate Program:
Information Sciences and Technology
Master of Science
Document Type:
Master Thesis
Date of Defense:
March 26, 2014
Committee Members:
  • John Yen, Thesis Advisor
  • machine learning
  • social media
  • data mining
  • health informatics
Patients are increasingly turning to online communities for health information and emotional support. In 2012, a study by the Pew Research Center found that more than 70% of Internet users in the United States, or 180 million adults, have searched the web for medical information. According to the same study, 18% of Internet users have sought others online with similar medical conditions, and 3-4% have posted about their medical treatments. Healthcare providers are also using the Internet to deliver various types of health interventions, including stress management courses, breast cancer coping groups, anti-smoking treatments, and weight loss therapy. These trends have led to a surplus of patient data on the web, including patients’ descriptions of their experiences of different ailments and the effects of treatment. Sentiment analysis and social network analysis are powerful computational tools with which to make sense of this ever-growing corpus of medical data that is accumulating in online communities and social media. With sentiment classification algorithms, researchers can aggregate thousands or even millions of pieces of text to perform tasks such as predicting stock market movements, aggregating product reviews, and even gauging national mood. These same methods can also be applied to healthcare to improve the quality of healthcare services. Some researchers are already advocating for more data mining in the healthcare domain, arguing that this will create a new “digital epidemiology” that will improve the healthcare system. Nevertheless, there are significant technical challenges involved in mining social media data. This data is often difficult for text mining systems to parse due to its disorganized nature and the presence of slang, and developing useful features to accurately classify texts in this domain is an open problem. Additionally, before measuring the sentiment of online texts about healthcare, it is important to understand whether these messages represent attitudes or descriptions of personal experiences. This thesis examines a relatively unexplored supervised machine learning task in the healthcare domain, automatic identification of social media messages pertaining to cancer-related personal experiences. We demonstrate that supervised learning methods can be used to accurately predict whether Twitter posts contain descriptions of personal experiences using four datasets of tweets about breast cancer, lung cancer, prostate cancer, and diabetes. Despite the unbalanced nature of this classification problem (of 4,821 labeled tweets, fewer than 20% of Twitter posts contain descriptions of personal experiences), these methods are able to classify with high F- Measure (>70%). We also show that content-based are more effective than context-based features. This thesis also discusses novel data filtering techniques and natural language processing- based feature engineering methods that significantly improve classification of these short Twitter messages. These features take advantage of slang and other information that is typically ignored by text mining systems. Finally, this thesis demonstrates that this personal experience identification task is amenable to a transfer of learning approach, as knowledge about social media post content from one type of cancer can be transferred to another type of cancer or another type of chronic disease. This technology has a number of applications in today’s information-driven healthcare industry, including aggregating experiences with different treatments and medications, which could lead to more patient-centric delivery of healthcare.