Wikipedia Offers Dataset For AI Training To Deter Bot Scraping

Wikipedia is releasing a dataset for training AI models as a means to dissuade bot scraping on its online encyclopedia. On Wednesday, The Wikimedia Foundation announced that it has partnered with Google-owned platform Kaggle to publish a beta dataset tailored for machine learning applications.

According to the organisation, the dataset in question comprises “structured Wikipedia content in English and French”, and as of 15 April includes openly licensed research summaries, short descriptions, image links, infobox data, and article sections. The dataset is intended to enable developers to gain access to machine-readable article data for modeling, fine-tuning, benchmarking, alignment, and analysis with ease.

Previously, Wikimedia reported that AI bot crawlers were bombarding Wikipedia since the beginning of last year, causing strain on the online encyclopedia’s servers. The release of a dataset that’s specifically optimised for AI development would be a solution to this issue, as it should be a better alternative to scraping raw text. Additionally, the partnership with Kaggle would allow the data to become more accessible to independent entities and smaller companies.

For the uninitiated, Kaggle is a platform for machine learning practitioners, researchers, and data enthusiasts. Its partnerships lead, Brenda Flynn, stated that the company is eager to play a role in keeping the data hosted by the Wikimedia Foundation accessible.

(Source: The Verge)

Updated 4:57 pm, Fri, 18 April 25

Wikipedia Offers Dataset For AI Training To Deter Bot Scraping

And to ease the strain on its online encyclopedia's servers.

AKASO 360 Lands In Malaysia; Starts From RM899

RM100 SARA: How To Redeem, And Everything Else You Need To Know

Someone Patented This Controller Design

Malaysia Aims For MLFF Toll System Implementation By 2027

Intel To Consolidate Chip Assembly And Test Operations In Malaysia

NETWORK

ABOUT

Wikipedia Offers Dataset For AI Training To Deter Bot Scraping

And to ease the strain on its online encyclopedia's servers.

TRENDING THIS WEEK

AKASO 360 Lands In Malaysia; Starts From RM899

RM100 SARA: How To Redeem, And Everything Else You Need To Know

Someone Patented This Controller Design

Malaysia Aims For MLFF Toll System Implementation By 2027

Intel To Consolidate Chip Assembly And Test Operations In Malaysia

NETWORK

ABOUT