AN API FOR AUTHOR NAME DISAMBIGUATION
Open Access
- Author:
- Dudhbhate, Gauravi Uday
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- June 26, 2017
- Committee Members:
- Dr. Lee Giles, Thesis Advisor/Co-Advisor
- Keywords:
- API
Disambiguation
Author Name Disambiguation
Machine Learning
Web service information extraction
scholarly big data
Web service
information extraction
Web Service
Information Extraction
Random Forest
Clustering - Abstract:
- In digital libraries, there are ambiguities present in an author’s name primarily when one name can have multiple variations, when multiple authors can share the same name and when the ambiguity exists due to incorrect input of data or due to incorrect extraction by automated software. Especially, in digital libraries, when this problem for author name ambiguity is persistent, it can be inconvenient for users. Authors would then be required to manually sort through the serach result for a scholarly document or an article written by a particular author, in the absence of author name disambiguation techniques. With great amount of research underway for author name disambiguation, where techniques are achieving almost 90-95 percent accuracy in displaying the accurate articles written by a particular author, the querying latency and the return of results, of such algorithms is slow. Further although such algorithms exist there are very few, if any, end-to-end services that provide a web accessible platform to submit a query and obtain the disambiguated articles with simply the click of a button. In this thesis, we propose a hierarchical approach to attain a faster querying latency while maintaining the sameaccuracy of 90-95 percent. We further propose an end-to-end service that provides an API to achieve ease in use of the algorithm and user satisfaction. We show that our hierarchical method outperforms the most accurate random forest approach. Finally, we also provide a comparitive analysis of the query latency and the classification accuracy of the two methods.