Data Mining on Microblogged Information: Gender Recognition and Suicide Prevention

Open Access
Kim, Hyun-Woo
Graduate Program:
Information Sciences and Technology
Master of Science
Document Type:
Master Thesis
Date of Defense:
May 13, 2010
Committee Members:
  • John Yen, Thesis Advisor
  • Bernard James Jansen, Thesis Advisor
  • data mining
  • suicide prevention
  • gender classification
  • gender recognition
  • feature selection
  • attribute selection
  • support vector machine
  • SVM
With the prosperity of Web 2.0 technologies, microblogging has become one of the most popular services on the Internet. Twitter is currently the most popular microblogging service in the world. Millions of people’s thoughts, opinions, and emotions melt into billions of short posts, or tweets, on Twitter. Most tweets are accessible through the web and Twitter’s application programming interface (API). Twitter has become a living history and a repository of human thoughts thanks to its gigantic amount of tweets. Analyzing microblogged messages is therefore helpful in understanding and predicting human behavior; computational data mining techniques can be used to recognize common patterns of certain groups of people. Support Vector Machine (SVM) is a nonparametric supervised learning method constructing a hyper-plane in a high-dimensional space that has the largest functional margin to achieve a good separation of multiple groups of microbloggers. In general the larger the margin results in the lower the generalization error of the classifier. This thesis adopts Support Vector Machine together with several feature selection methods including SVM-RFE (SVM Recursive Feature Elimination), Relief, and InfoGain algorithms to systematically recognize the genders of microbloggers and shows what gender-specific features are, and how the feature selection process affects the overall classification accuracy. The gender classifier can be helpful in preventing a serious social problem: suicide. It is known that risk factors for suicidal thoughts vary with gender and age. Suicide is the third leading cause of death among people ages 10 to 24 according to the Centers for Disease Control and Prevention (CDC). Also, 15 percent of high school students have seriously considered suicide. There is a consistent finding that more than 90% of people who committed suicide had shown a diagnosable psychiatric disorder. Mental health services can help people at high risk for suicide to relieve. However, we cannot solely rely on mental treatment to solve this problem given the fact that two third of the people who committed suicide had not received any appropriate treatment. The tweets posted by people before they committed suicide over a year clearly showed that they were suffering from a profound depression. Moreover, some of them posted their final messages to the world on their microblogs. We study how a theoretical model describes the nature of many types of suicide, investigate the microblogging behavior of teenagers who killed themselves, and discuss how a future research may build a statistical model to measure the degree of one’s depression that may be used for suicide prevention.