Commonsense knowledge about object properties, human behavior and general concepts is crucial for robust AI applications. However, automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources.
This website hosts demos of several Commonsense Knowledge Base Construction projects developed by Tuan-Phong Nguyen, Simon Razniewski, and co-authors during their research at the Max Planck Institute for Informatics and TU Dresden.
In the following, we list the projects and their corresponding demos and publications.
Ascent
Ascent (Advanced Semantics for Commonsense Knowledge Extraction) is a pipeline for automatically collecting, extracting and consolidating commonsense knowledge (CSK) from the web. Ascent is capable of extracting facet-enriched assertions, overcoming the common limitations of the triple-based knowledge model in traditional knowledge bases (KBs). Ascent also captures composite concepts with subgroups and related aspects, supplying even more expressiveness to CSK assertions.
Publications:
- Tuan-Phong Nguyen, Simon Razniewski, Gerhard Weikum. Advanced Semantics for Commonsense Knowledge Extraction. WWW 2021. [pdf]
- Tuan-Phong Nguyen, Simon Razniewski, Gerhard Weikum. Inside ASCENT: Exploring a Deep Commonsense Knowledge Base and its Usage in Question Answering. ACL 2021 - System Demonstrations. [pdf]
Ascent++
Ascent++, a successor of the previous Ascent method, is a pipeline for automatically collecting, extracting and consolidating commonsense knowledge (CSK) from any English text corpus. Ascent++ is capable of extracting facet-enriched assertions, overcoming the common limitations of the triple-based knowledge model in traditional knowledge bases (KBs). Ascent++ also captures composite concepts with subgroups and related aspects, supplying even more expressiveness to CSK assertions.
Ascent++ KB is a CSKB extracted from the C4 crawl using the Ascent++ pipeline. It consists of 2 million CSK assertions about 10K popular concepts. The CSKB comes with two variants: one with open predicates (e.g., "be", "have", "live in", etc.) and one with the established ConceptNet schema with 19 pre-specified predicates (e.g., AtLocation, CapableOf, HasProperty, etc.).
Publications:
- Tuan-Phong Nguyen, Simon Razniewski, Julien Romero, Gerhard Weikum. Refined Commonsense Knowledge from Large-Scale Web Contents. In IEEE Transactions on Knowledge and Data Engineering, 2022. [pdf]
Materialized Comet
Starting from the COMET methodology by Bosselut et al. (2019), generating commonsense knowledge from commonsense transformers has recently received significant attention. Surprisingly, up to now no materialized resource of commonsense knowledge generated this way is publicly available. This paper fills this gap, and uses the materialized resources to perform a detailed analysis of the potential of this approach in terms of precision and recall. Furthermore, we identify common problem cases, and outline use cases enabled by materialized resources. We posit that the availability of these resources is important for the advancement of the field, as it enables an off-the-shelf-use of the resulting knowledge, as well as further analyses on its strengths and weaknesses.
Publications:
- Tuan-Phong Nguyen and Simon Razniewski. Materialized Knowledge Bases from Commonsense Transformers . CSRR Workshop @ ACL 2022. [pdf]
Candle
Structured knowledge is important for many AI applications. Commonsense knowledge, which is crucial for robust human-centric AI, is covered by a small number of structured knowledge projects. However, they lack knowledge about human traits and behaviors conditioned on socio-cultural contexts, which is crucial for situative AI. In this project, we present Candle, an end-to-end methodology for extracting high-quality cultural commonsense knowledge (CCSK) at scale. Candle extracts CCSK assertions from a huge web corpus and organizes them into coherent clusters, for 3 domains of subjects (geography, religion, occupation) and several cultural facets (food, drinks, clothing, traditions, rituals, behaviors). Candle includes judicious techniques for classification-based filtering and scoring of interestingness. Experimental evaluations show the superiority of the Candle CCSK collection over prior works, and an extrinsic use case demonstrates the benefits of CCSK for the GPT-3 language model.
The output of Candle is a set of 1.1M CCSK assertions, organized into 60K coherent clusters. The set is organized by 3 domains of interest – geography, religion, occupation – with a total of 386 instances, referred to as subjects (or cultural groups). Per subject, the assertions cover 5 facets of culture: food, drinks, clothing, rituals, traditions (for geography and religion) or behaviors (for occupations). In addition, we also annotate each assertion with its salient concepts.
Publications:
- Tuan-Phong Nguyen, Simon Razniewski, Aparna Varde, and Gerhard Weikum. Extracting Cultural Commonsense Knowledge at Scale. WWW 2023. [pdf]
Mango
Despite recent progress, large language models (LLMs) still face the challenge of appropriately reacting to the intricacies of social and cultural conventions.
We propose Mango, a methodology for distilling high-accuracy, high-recall assertions of cultural knowledge. We judiciously and iteratively prompt LLMs for this purpose from two entry points, concepts and cultures. Outputs are consolidated via clustering and generative summarization.
Running the Mango method with GPT-3.5 as underlying LLM yields:
- 167K assertions
- 30K concepts
- 11K cultures.
Our resource surpassing prior resources by a large margin in quality and size.
Publication:
- Tuan-Phong Nguyen, Simon Razniewski, and Gerhard Weikum. Cultural Commonsense Knowledge for Intercultural Dialogues. CIKM 2024. [pdf]