Access to data for research has never been easier than in the modern age.
With the introduction of online sources such as social media APIs, Amazon Mechanical Turk, and Kaggle, we are curious about how the prevalence of data gathering methodologies changes over time and what the implications may be for longitudinal changes in common data collection practices.
From existing literature, we can gather the potentiality of issues that arise from online data sources and how this data may affect the scientific findings that society relies on to make informed decisions.
We performed a longitudinal mixed methods analysis on the data gathering methodologies utilized in research papers published in the ACM Conference on Human Factors in Computing Systems (CHI).
To achieve this, we employed a large language model to analyze the research papers and classify according to our taxonomy of data methodologies.
Based on these results we provide insight into the current state of data gathering methodologies, what that may mean for science, and raise awareness of the importance of being conscious of how scientists collect data.