Monday, November 25, 2024
HomeTechnologyGPT-4o’s Chinese language token-training information is polluted by spam and porn web...

GPT-4o’s Chinese language token-training information is polluted by spam and porn web sites


The brand new tokenizer has 200,000 tokens in whole, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to depend the variety of tokens in several languages, and the highest languages, apart from English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s primary influence, in my view, is you get the fee down in these languages, not that the standard in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it could actually analyze the prompts quicker and cost customers much less for a similar reply. With the brand new tokenizer, “you’re taking a look at nearly 4 occasions price discount,” he says.

Das, who additionally speaks Hindi and Bengali, took a take a look at the longest tokens in these languages. The tokens mirror discussions occurring in these languages, so that they embrace phrases like “Narendra” or “Pakistan,” however widespread English phrases like “Prime Minister,” “college,” and “worldwideadditionally come up incessantly. Additionally they don’t exhibit the problems surrounding the Chinese language tokens.

That possible displays the coaching information in these languages, Das says: “My working idea is the web sites in Hindi and Bengali are very rudimentary. It’s like [mostly] information articles. So I might anticipate this to be the case. There usually are not many spam bots and porn web sites making an attempt to occur in these languages. It’s principally going to be in English.”

Polluted information and an absence of cleansing

Nevertheless, issues are drastically totally different in Chinese language. In keeping with a number of researchers who’ve appeared into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese language are nearly completely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese language phrases, mirror these subjects to a major diploma.

“The issue is obvious: the corpus used to coach [the tokenizer] will not be clear. The English tokens appear fantastic, however the Chinese language ones usually are not,” says Cai from Princeton College. It isn’t uncommon for a language mannequin to crawl spam when amassing coaching information, however often there will probably be vital effort taken to scrub up the information earlier than it’s used. “It’s attainable that they didn’t do correct information clearing in terms of Chinese language,” he says.

The content material of those Chinese language tokens might counsel that they’ve been polluted by a selected phenomenon: web sites hijacking unrelated content material in Chinese language or different languages to spice up spam messages. 

These messages are sometimes ads for pornography movies and playing web sites. They may very well be actual companies or merely scams. And the language is inserted into content material farm web sites or generally respectable web sites to allow them to be listed by search engines like google and yahoo, circumvent the spam filters, and are available up in random searches. For instance, Google listed one search consequence web page on a US Nationwide Institutes of Well being web site, which lists a porn website in Chinese language. The identical website identify additionally appeared in at the least 5 Chinese language tokens in GPT-4o. 

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments