Within MICO, we use text analysis as a part of the general MICO effort to help web community administrators better understand their users. In particular, we are looking at the texts volunteers write in the Zooniverse Snapshot Serengeti forums, discussing the images they have looked at and their respective classifications.
In an effort to go beyond traditional sentiment analysis, we are trying to use the textual data as an indicator of classification proficiency. To this end, we have labelled users as proficient or less proficient, using the Zooniverse databases and looking at the majority classifications. After labelling each forum post with the label corresponding to its author, we trained classifiers that should try to predict a users proficiency based only on the comments they have written. We have used tools such as Apache OpenNLP and RapidMiner.
Thus far, the results are mixed. The OpenNLP maximum entropy classifier we trained did achieve an accuracy of 0.70 and an F1-score of 0.81. Looking closer at the results, however, these numbers are in part due to the fact that the classifier has a bias towards high proficiency and most of the test data forum posts were from high proficiency users. Still, the results are encouraging and we will continue developing these techniques. In particular, we will try to get more training data, test new tools, and hone the methods we use. In the end, the proficiency classifier will be integrated as a MICO extractor.
Photo: Copyrighted free use, https://commons.wikimedia.org/w/index.php?curid=685747