Meta Hopes To Increase Accuracy Of Wikipedia With New AI Model - InfoQ.com - Free Websites, Share News And Posts Publicly

Learn the emerging software trends you should pay attention to. Attend online QCon Plus (Nov 30 – Dec 8, 2022). Register Now
Facilitating the Spread of Knowledge and Innovation in Professional Software Development

In this article, we’ll look at how to use the gin framework to create a simple Go application. We will also learn how to use CircleCI, a continuous deployment tool, to automate testing and deployment.
Susanne Kaiser is a software consultant working with teams on microservice adoption. Recently, she’s brought together Domain-Driven Design, Wardley Mapping, and Team Topologies into a conversation about helping teams adopt a fast flow of change. Today on the podcast, Wes Reisz speaks with Kaiser about why she feels these three approaches to dealing with software complexity are so complementary.
In this article, author discusses data pipeline and workflow scheduler Apache DolphinScheduler and how ML tasks are performed by Apache DolphinScheduler using Jupyter and MLflow components.
Many organizations in the software industry have fallen into a state where they have set processes that are used across the organization and teams. Every team is not the same, so why are their processes all the same? In this article we’re going to explore what it can mean for teams to have individualized processes that are formed by the context of the work they are doing and of the team itself.
In a GitOps work process, Git is the single source of truth for the system’s intended state. Observability can provide the missing piece: the single source of truth for the system’s actual state.
Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.
Adopt the right emerging trends to solve your complex engineering challenges. Register Now.
Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.
InfoQ Homepage News Meta Hopes to Increase Accuracy of Wikipedia with New AI Model
Jul 26, 2022 4 min read
by
Claudio Masolo
Meta AI’s research and advancements team recently developed a neural-network-based system, called SIDE, that is capable of scanning hundreds of thousands of Wikipedia citations at once and checking whether they truly support the corresponding contents.
Wikipedia is a multilingual free online encyclopedia written and maintained by volunteers through open collaboration and a wiki-based editing system. Wikipedia has some 6.5 million articles. Wikipedia is crowdsourced, so it usually requires that facts be corroborated; quotations, controversial statements, and contentious material about living people must include a citation. Volunteers double-check Wikipedia’s footnotes, but, as the site continues to grow, it’s challenging to keep pace with the more than 17,000 new articles added each month. Readers commonly wonder about the accuracy of the Wikipedia entries they read. The human editors need help from the technology to identify gibberish or statements that lack citations but understand that determining whether or not a source backs up a claim is a complex task for AI, because it needs a deep understanding to perform an accurate analysis.
For this purpose, Meta AI research team created a new dataset of 134 million public webpages (split into 906 million passages of 100 tokens each), an order of magnitude more data than the knowledge sources considered in current NLP research and significantly more intricate than ever used for this sort of research. The next largest dataset in terms of passages/documents is the Internet Augmented Dialog generator, which pulls data from 250 million passages and 109 million documents.
This new dataset is the knowledge source of the neural network model which finds the citations that seem to be irrelevant and suggests a more applicable source event, pointing to the specific passage that supports the claim. Natural-language-understanding (NLU) techniques are used to perform the tasks that allow the system to evaluate a citation. In NLU, a model translates human sentences (or words, phrases, or paragraphs) into complex mathematical representations. The tool is designed to compare these representations in order to determine whether one statement supports or contradicts another.
The new dataset also serves as one of the system’s main components: Sphere, which is a web-scale retrieval library and is already open-sourced.
The decision flow of SIDE, from a claim on Wikipedia to a suggestion for a new citation, works as follows:

SIDE workflow. From paper: Improving Wikipedia Verifiability with AI
The claim is sent to the Sphere Retrieval Engine, which produces a list of potential candidate documents from the Sphere corpus. The sparse retrieval sub-system uses a seq2seq model to translate the citation context into query text, and then matches the resulting query (a sparse bag-of-words vector) on a BM25 index of Sphere. The seq2seq model is trained using data from Wikipedia itself: the target queries are set to be web page titles of existing Wikipedia citations. The dense retrieval sub-system is a neural-network which learns from Wikipedia data to encode the citation context into a dense query vector. This vector is then matched against the vector encodings of all passages in Sphere and the closest ones are returned.
The verification engine then ranks the candidate documents and the original citation with reference to the claim. A neural network takes the claim and a document as input, and predicts how well it supports the claim. Due to efficiency reasons, it operates on a per passage level and calculates the verification score of a document as the maximum over its per-passage scores. The verification scores are calculated by a fine-tuned BERT transformer that uses the concatenated claim and passage as input.
In other words, the model creates and compares mathematical representations of the meanings of entire statements rather than of individual words. Because webpages can contain long stretches of text, the models assess content in chunks and consider only the most relevant passage when deciding whether to recommend a URL.
The indices pass potential sources to an evidence-ranking model, which compares the new text with the original citation. Using fine-grained language comprehension, the model ranks the cited source and the retrieved alternatives according to the likelihood that they support the claim. If the original citation is not ranked above the candidate documents, then a new citation from the retrieved candidates is suggested.
Sphere was tested on the Knowledge Intensive Language Tasks benchmark, and surpassed the state of the art on two.
A computer system that has a human-level comprehension of language isn’t yet designed, but projects like this, which teach algorithms to understand dense material with an ever-higher degree of sophistication, help AI make sense of the real world. Meta AI’s research and advancements team says the goal of this work is to build a platform to help Wikipedia editors systematically spot citation issues and quickly fix the citation or correct the content of the corresponding article at scale. SIDE is open sourced and can be tested here.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We’d love to have more people join our team.

Join us for IMPACT 2022: The Data Observability Summit, spotlighting the industry’s most prominent data leaders paving the way forward for reliable data.
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or Login or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
Real-world technical talks. No product pitches.
Practical ideas to inspire you and your team.
QCon Plus – Nov 30 – Dec 8, Online.

QCon Plus brings together the world’s most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices.
Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now
InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy

source

Chat read-only to anonymous users. Chat with Anyone and Anywhere. Only registered users are allowed to send messages.

Loading the chat ...

31536 Register Login

Leave a Reply Cancel reply