Exploring the design and evaluation of systems that leverage the crowdsourcing and online communities to solve complex tasks.
My research in Human Computation and Crowdsourcing explores the design and evaluation of systems that leverage the collective intelligence of online communities to solve complex tasks. I have investigated various applications of crowdsourcing, contributed with specific techniques and empirical results, as well as methodological contributions.
Key Contributions:
Exploration of Task Designs for Popular Crowdsourcing Tasks: Investigating task designs and quality control mechanisms to effectively leverage crowdsourcing for complex classification tasks, labelling, and diversity-aware approaches for paraphrase generation.
Methodological Advancements in Crowdsourcing Experimentation: Developing tools like ‘CrowdHub’ and guidelines for improving the rigor and reporting of controlled crowdsourcing experiments.
Crowdsourcing for AI Training Data Generation: Developing techniques for crowdsourcing diverse paraphrases for chatbot training and creating datasets for supporting classification tasks, directly supporting the development of robust and effective AI models.
Democratizing Research Support: Exploring and implementing crowdsourcing applications to provide feedback to researchers, especially early-stage researchers, and to support cognitively intensive research tasks like systematic literature reviews.
Code and Tools
Publications
Innovation cockpit: a dashboard for facilitators in idea management
Marcos Baez, and Gregorio Convertino
In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work Companion, Seattle, Washington, USA, Jul 2012
We present the design of a dashboard for facilitators in Idea Management Systems (IMS), an emerging class of collaborative software for business organizations or local geographic communities. In these systems, users can generate, share, judge, refine, and select ideas as part of a grassroots process. However, a class of users that lacks adequate support in current IMS are the facilitators. Their role is to help the best ideas to emerge and grow, while balancing the judgments of the crowd with those of the managers or the community leaders. We show how the dashboard helps facilitators in making more efficient and effective decisions in situations where the selection and judgment become prohibitively lengthy and time consuming.
Designing a facilitator’s cockpit for an idea management system
Marcos Baez, and Gregorio Convertino
In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work Companion, Seattle, Washington, USA, Jul 2012
We present the design of a dashboard for facilitators in Idea Management Systems (IMS). IMS are an emerging class of collaborative software tools aimed at business organizations or local geographic communities. Through these systems users can generate, share, judge, refine, and select ideas as part of a grassroots process. However, a class of users that is lacking adequate support in current IMS are the facilitators. Their role is to help the best ideas to emerge and grow, while balancing the judgments of the crowd with those of the managers or the community leaders. In this paper we point to the unmet needs of these users, describe the design of a system prototype, and the evaluation of a first version of this prototype to test our design.
Idea Management Communities in the Wild: An Exploratory Study of 166 Online Communities
Jorge Saldivar, Marcos Baez, Carlos Rodriguez, and 2 more authors
In 2016 International Conference on Collaboration Technologies and Systems (CTS), Oct 2016
Idea Management (IM) communities have the potential to transform business and communities through innovation. However, building successful communities is a difficult endeavor that requires a significant amount of both community management and technological support. Doing this requires a good understanding of how IM systems are used and how users behave, as these are fundamental aspects for the design of effective technological support as well as devising community management strategies. In this paper, we study 166 IM communities in the “wild” — communities openly available on Ideascale, one of today’s leading IM software platforms — to better understand how they are used in practice, and by whom. We do this via i) a qualitative analysis of community properties to identify community archetypes; ii) a quantitative analysis of user activity logs to identify patterns of collective and individual user behavior.
Investigating Crowdsourcing as a Method to Collect Emotion Labels for Images
Olga Korovina, Fabio Casati, Radoslaw Nielek, and 2 more authors
In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal QC, Canada, Oct 2018
Labeling images is essential towards enabling the search and organization of digital media. This is true for both "factual", objective tags such as time, place and people, as well as for subjective labels, such as the emotion a picture generates. Indeed, the ability to associate emotions to images is one of the key functionality most image analysis services today strive to provide. In this paper we study how emotion labels for images can be crowdsourced and uncover limitations of the approach commonly used to gather training data today, that of harvesting images and tags from social media.
CrowdRev: A platform for Crowd-based Screening of Literature Reviews
Jorge Ramirez, Evgeny Krivosheev, Marcos Baez, and 3 more authors
In this paper and demo we present a crowd and crowd+AI based system, called CrowdRev, supporting the screening phase of literature reviews and achieving the same quality as author classification at a fraction of the cost, and near-instantly. CrowdRev makes it easy for authors to leverage the crowd, and ensures that no money is wasted even in the face of difficult papers or criteria: if the system detects that the task is too hard for the crowd, it just gives up trying (for that paper, or for that criteria, or altogether), without wasting money and never compromising on quality.
Combining Crowd and Machines for Multi-predicate Item Screening
Evgeny Krivosheev, Fabio Casati, Marcos Baez, and 1 more author
This paper discusses how crowd and machine classifiers can be efficiently combined to screen items that satisfy a set of predicates. We show that this is a recurring problem in many domains, present machine-human (hybrid) algorithms that screen items efficiently and estimate the gain over human-only or machine-only screening in terms of performance and cost. We further show how, given a new classification problem and a set of classifiers of unknown accuracy for the problem at hand, we can identify how to manage the cost-accuracy trade off by progressively determining if we should spend budget to obtain test data (to assess the accuracy of the given classifiers), or to train an ensemble of classifiers, or whether we should leverage the existing machine classifiers with the crowd, and in this case how to efficiently combine them based on their estimated characteristics to obtain the classification. We demonstrate that the techniques we propose obtain significant cost/accuracy improvements with respect to the leading classification algorithms.
Crowdsourcing for reminiscence chatbot design
Svetlana Nikitina, Florian Daniel, Marcos Baez, and 1 more author
In HCOMP 2018 Works in Progress and Demonstration Papers, Nov 2018
In this work-in-progress paper we discuss the challenges in identifying effective and scalable crowd-based strategies for designing content, conversation logic, and meaningful metrics for a reminiscence chatbot targeted at older adults. We formalize the problem and outline the main research questions that drive the research agenda in chatbot design for reminiscence and for relational agents for older adults in general.
Understanding the impact of text highlighting in crowdsourcing tasks
Jorge Ramı́rez, Marcos Baez, Fabio Casati, and 1 more author
In Proceedings of the AAAI conference on human computation and crowdsourcing, Nov 2019
Text classification is one of the most common goals of machine learning (ML) projects, and also one of the most frequent human intelligence tasks in crowdsourcing platforms. ML has mixed success in such tasks depending on the nature of the problem, while crowd-based classification has proven to be surprisingly effective, but can be expensive. Recently, hybrid text classification algorithms, combining human computation and machine learning, have been proposed to improve accuracy and reduce costs. One way to do so is to have ML highlight or emphasize portions of text that it believes to be more relevant to the decision. Humans can then rely only on this text or read the entire text if the highlighted information is insufficient. In this paper, we investigate if and under what conditions highlighting selected parts of the text can (or cannot) improve classification cost and/or accuracy, and in general how it affects the process and outcome of the human intelligence tasks. We study this through a series of crowdsourcing experiments running over different datasets and with task designs imposing different cognitive demands. Our findings suggest that highlighting is effective in reducing classification effort but does not improve accuracy - and in fact, low-quality highlighting can decrease it.
CrowdHub: Extending crowdsourcing platforms for the controlled evaluation of tasks designs
Jorge Ramı́rez, Simone Degiacomi, Davide Zanella, and 3 more authors
We present CrowdHub, a tool for running systematic evalua- tions of task designs on top of crowdsourcing platforms. The goal is to support the evaluation process, avoiding potential experimental biases that, according to our empirical studies, can amount to 38% loss in the utility of the collected dataset in uncontrolled settings. Using CrowdHub, researchers can map their experimental design and automate the complex process of managing task execution over time while controlling for returning workers and crowd demographics, thus reducing bias, increasing utility of collected data, and making more efficient use of a limited pool of subjects
Reliability of crowdsourcing as a method for collecting emotions labels on pictures
In this paper we study if and under what conditions crowdsourcing can be used as a reliable method for collecting high-quality emotion labels on pictures. To this end, we run a set of crowdsourcing experiments on the widely used IAPS dataset, using the Self-Assessment Manikin (SAM) emotion collection instrument, in order to rate pictures on valence, arousal and dominance, and explore the consistency of crowdsourced results across multiple runs (reliability) and the level of agreement with the gold labels (quality). In doing so, we explored the impact of targeting populations of different level of reputation (and cost) and collecting varying numbers of ratings per picture.
Idea spotter and comment interpreter: Sensemaking tools for idea management systems
Gregorio Convertino, A Sándor, and Marcos Baez
In ACM Communities and Technologies Workshop: Large-Scale Idea Management and Deliberation Systems Workshop, Nov 2013
Regular contributors and facilitators using current idea management systems face the problem of information overload. With large numbers of ideas to be assessed or refined, they lack adequate support to efficiently make sense of unstructured idea descriptions and comments. We propose two sensemaking tools as enhancements of current idea management systems: the idea spotter and the comment interpreter. Both integrate interactive user interfaces that use the output of automatic linguistic analysis of ideas and comments. In this workshop paper we present the prototypes, their preliminary evaluation, and the next steps in this research.
Crowdsourced dataset to study the generation and impact of text highlighting in classification tasks
Jorge Ramı́rez, Marcos Baez, Fabio Casati, and 1 more author
Text classification is a recurrent goal in machine learning projects and a typical task in crowdsourcing platforms. Hybrid approaches, leveraging crowdsourcing and machine learning, work better than either in isolation and help to reduce crowdsourcing costs. One way to mix crowd and machine efforts is to have algorithms highlight passages from texts and feed these to the crowd for classification. In this paper, we present a dataset to study text highlighting generation and its impact on document classification.
🏆DREC: towards a Datasheet for Reporting Experiments in Crowdsourcing
Jorge Ramı́rez, Marcos Baez, Fabio Casati, and 2 more authors
In Companion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing, Virtual Event, USA, Nov 2020
Factors such as instructions, payment schemes, platform demographics, along with strategies for mapping studies into crowdsourcing environments, play an important role in the reproducibility of results. However, inferring these details from scientific articles is often a challenging endeavor, calling for the development of proper reporting guidelines. This paper makes the first steps towards this goal, by describing an initial taxonomy of relevant attributes for crowdsourcing experiments, and providing a glimpse into the state of reporting by analyzing a sample of CSCW papers.
On the impact of predicate complexity in crowdsourced classification tasks
Jorge Ramı́rez, Marcos Baez, Fabio Casati, and 4 more authors
In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Event, Israel, Nov 2021
This paper explores and offers guidance on a specific and relevant problem in task design for crowdsourcing: how to formulate a complex question used to classify a set of items. In micro-task markets, classification is still among the most popular tasks. We situate our work in the context of information retrieval and multi-predicate classification, i.e., classifying a set of items based on a set of conditions. Our experiments cover a wide range of tasks and domains, and also consider crowd workers alone and in tandem with machine learning classifiers. We provide empirical evidence into how the resulting classification performance is affected by different predicate formulation strategies, emphasizing the importance of predicate formulation as a task design dimension in crowdsourcing.
Challenges and strategies for running controlled crowdsourcing experiments
Jorge Ramı́rez, Marcos Baez, Fabio Casati, and 2 more authors
In 2020 XLVI Latin American Computing Conference (CLEI), Oct 2020
This paper reports on the challenges and lessons we learned while running controlled experiments in crowdsourcing platforms. Crowdsourcing is becoming an attractive technique to engage a diverse and large pool of subjects in experimental research, allowing researchers to achieve levels of scale and completion times that would otherwise not be feasible in lab settings. However, the scale and flexibility comes at the cost of multiple and sometimes unknown sources of bias and confounding factors that arise from technical limitations of crowdsourcing platforms and from the challenges of running controlled experiments in the “wild”. In this paper, we take our experience in running systematic evaluations of task design as a motivating example to explore, describe, and quantify the potential impact of running uncontrolled crowdsourcing experiments and derive possible coping strategies. Among the challenges identified, we can mention sampling bias, controlling the assignment of subjects to experimental conditions, learning effects, and reliability of crowdsourcing results. According to our empirical studies, the impact of potential biases and confounding factors can amount to a 38% loss in the utility of the data collected in uncontrolled settings; and it can significantly change the outcome of experiments. These issues ultimately inspired us to implement CrowdHub, a system that sits on top of major crowdsourcing platforms and allows researchers and practitioners to run controlled crowdsourcing projects.
🏆On the State of Reporting in Crowdsourcing Experiments and a Checklist to Aid Current Practices
Jorge Ramı́rez, Burcu Sayin, Marcos Baez, and 4 more authors
Crowdsourcing is being increasingly adopted as a platform to run studies with human subjects. Running a crowdsourcing experiment involves several choices and strategies to successfully port an experimental design into an otherwise uncontrolled research environment, e.g., sampling crowd workers, mapping experimental conditions to micro-tasks, or ensure quality contributions. While several guidelines inform researchers in these choices, guidance of how and what to report from crowdsourcing experiments has been largely overlooked. If under-reported, implementation choices constitute variability sources that can affect the experiment’s reproducibility and prevent a fair assessment of research outcomes. In this paper, we examine the current state of reporting of crowdsourcing experiments and offer guidance to address associated reporting issues. We start by identifying sensible implementation choices, relying on existing literature and interviews with experts, to then extensively analyze the reporting of 171 crowdsourcing experiments. Informed by this process, we propose a checklist for reporting crowdsourcing experiments.
Understanding How Early-Stage Researchers Perceive External Research Feedback
Yuchao Jiang, Marcos Baez, and Boualem Benatallah
In ACM Collective Intelligence Conference 2021, Oct 2021
In this paper, we report on an online survey to answer two questions: (i) What are the types of external feedback that are in stongest need and perceived to be most useful by ESRs? and (ii) What are the top challenges and barriers in getting and adopting external feedback on research for ESRs?
Crowdsourcing Diverse Paraphrases for Training Task-oriented Bots
Jorge Ramı́rez, Auday Berro, Marcos Baez, and 2 more authors
A prominent approach to build datasets for training task-oriented bots is crowd-based paraphrasing. Current approaches, however, assume the crowd would naturally provide diverse paraphrases or focus only on lexical diversity. In this WiP we addressed an overlooked aspect of diversity, introducing an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse.
Crowdsourcing syntactically diverse paraphrases with diversity-aware prompts and workflows
Jorge Ramı́rez, Marcos Baez, Auday Berro, and 2 more authors
In Advanced Information Systems Engineering, Oct 2022
Task-oriented bots (or simply bots) enable humans to perform tasks in natural language. For example, to book a restaurant or check the weather. Crowdsourcing has become a prominent approach to build datasets for training and evaluating task-oriented bots, where the crowd grows an initial seed of utterances through paraphrasing, i.e., reformulating a given seed into semantically equivalent sentences. In this context, the resulting diversity is a relevant dimension of high-quality datasets, as diverse paraphrases capture the many ways users may express an intent. Current techniques, however, are either based on the assumption that crowd-powered paraphrases are naturally diverse or focus only on lexical diversity. In this paper, we address an overlooked aspect of diversity and introduce an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse. We introduce a workflow and novel prompts that are informed by syntax patterns to elicit paraphrases avoiding or incorporating desired syntax. Our empirical analysis indicates that our approach yields higher syntactic diversity, syntactic novelty and more uniform pattern distribution than state-of-the-art baselines, albeit incurring on higher task effort.
Understanding how early-stage researchers leverage socio-technical affordances for distributed research support
Early-stage researchers (ESRs) are often challenged to learn research skills with sufficient support from a small circle of advisors and colleagues. Meanwhile, emerging socio-technical systems (STSs) are now available for social interactions among the general public and people in particular interest topics, such as research. However, how STSs can effectively support ESRs in developing research skills is not yet well understood. In this paper, we report on a series of interviews and surveys with ESRs. We found that online research communities held the potentials for ESRs to learn from diverse perspectives and experience. But the adoption of research communities for learning was still limited. We identified unmet needs in the design of these systems limiting the adoption. We then provide design implications for future STSs to support learning research skills with socio-technical affordances.
Effective feedback is crucial for early-stage researchers (ESRs) to develop their research skills. While feedback from supervisors and colleagues is important, additional feedback from external helpers can be beneficial. However, obtaining diverse and high-quality feedback outside of a research group can be challenging. In this work, we designed and prototyped Rsourcer, a crowdsourcing-based pipeline that simplifies the process of requesting, offering, evaluating, and adopting feedback. We evaluated Rsourcer with a concept validation study and a pilot study, which showed its potential. This work contributes with insights into crowdsourcing support with social technologies and extends research on scaling support for skills development.
Towards Scaling External Feedback for Early-Stage Researchers: A Survey Study
Yuchao Jiang, Marcos Baez, and Boualem Benatallah
In Cooperative Information Systems: 29th International Conference, CoopIS 2023, Groningen, The Netherlands, October 30-November 3, 2023, Proceedings, Groningen, The Netherlands, Oct 2023
Feedback on research artefacts from people beyond local research groups, such as researchers in online research communities, has the potential to bring in additional support for early-stage researchers and complementary viewpoints to research projects. While current literature has focused primarily on early-stage research seeking or getting support for research skills development in general, less is known about, more specifically, empirical understanding of how early-stage researchers exchange feedback with external researchers. In this paper, we focus on understanding the critical types of external feedback that early-stage researchers desire and the prevalent challenges they face with exchanging feedback with external helpers. We report on a large-scale survey conducted with early-stage researchers of diverse backgrounds. Our findings lay the empirical foundation for informing the designing of socio-technical systems for research feedback exchange.