Start Up No.2232: Google search algorithm leaks, the AI disinformation problem, spying on the ICC, Chinese GPT spam?, and more

iFixit and Samsung have broken off their device repair partnership after many stormy years. CC-licensed photo by spline splinson on Flickr.

You can sign up to receive each day’s Start Up post by email. You’ll need to click a confirmation link, so no spam.

There’s another post coming this week at the Social Warming Substack on Friday at 0845 UK time. Free signup.

A selection of 10 links for you. Can’t fix this. I’m @charlesarthur on Twitter. On Threads: charles_arthur. On Mastodon: https://newsie.social/@charlesarthur. Observations and links welcome.

Google won’t comment on a potentially massive leak of its search algorithm documentation • The Verge

Mia Sato:

»

an explosive leak that purports to show thousands of pages of internal documents appears to offer an unprecedented look under the hood of how Search works — and suggests that Google hasn’t been entirely truthful about it for years. So far, Google hasn’t responded to multiple requests for comment on the legitimacy of the documents.

Rand Fishkin, who worked in SEO for more than a decade, says a source shared 2,500 pages of documents with him with the hopes that reporting on the leak would counter the “lies” that Google employees had shared about how the search algorithm works. The documents outline Google’s search API and break down what information is available to employees, according to Fishkin.

The details shared by Fishkin are dense and technical, likely more legible to developers and SEO experts than the layperson. The contents of the leak are also not necessarily proof that Google uses the specific data and signals it mentions for search rankings. Rather, the leak outlines what data Google collects from webpages, sites, and searchers and offers indirect hints to SEO experts about what Google seems to care about, as SEO expert Mike King wrote in his overview of the documents.

The leaked documents touch on topics like what kind of data Google collects and uses, which sites Google elevates for sensitive topics like elections, how Google handles small websites, and more. Some information in the documents appears to be in conflict with public statements by Google representatives, according to Fishkin and King.

“‘Lied’ is harsh, but it’s the only accurate word to use here,” King writes. “While I don’t necessarily fault Google’s public representatives for protecting their proprietary information, I do take issue with their efforts to actively discredit people in the marketing, tech, and journalism worlds who have presented reproducible discoveries.”

«

Seems to have been exposed accidentally by Google when it was posted to Github and made public for a couple of months. It seems to be the biggest leak of documents ever from inside Google search. Hell of a thing: for all that the search algorithm has been kept incredibly secret for 30 years, the hacking was done by.. Google itself.
unique link to this extract

Google researchers say AI now leading disinformation vector (and are severely undercounting the problem) • 404 Media

Emanuel Maiberg:

»

As an endless stream of entirely wrong and sometimes dangerous AI-generated answers from Google are going viral on social media, new research from Google researchers and several fact checking organizations have found that most image-based disinformation is now AI-generated, but the way researchers collected their data suggests that the problem is even worse than they claim.

The paper, first spotted by the Faked Up newsletter, measures the rise of AI-generated image-based disinformation by looking at what fact checkers at Snopes, Politifact, and other sites have claimed were image-based disinformation. Overall, the study looks at a total of 135,838 fact checks which date back to 1995, but the majority of the claims were created after 2016 and the introduction of ClaimReview, a tagging system that allows fact checkers and publishers to flag disinformation for platforms like Google, Facebook, Bing, and others.

The most telling chart in the study shows the “prevalence of content manipulation types as a function of overall content manipulations.” In other words, it shows the different types of image-based disinformation and how common they are over time.

As you can see from the chart, AI-generated image-based disinformation was just not a thing until late 2023, when AI image generators became widely available and popular, at which point they basically almost replaced all other forms of image-based disinformation. The chart also shows that there’s a slight increase in the total samples of image-based disinformation that corresponds with the rise of AI images, but only slightly.

“Interestingly, the rise of AI images did not produce a bump in the overall proportion of misinformation claims that depend on images during this period, and image-based misinformation continued to decline on a relative basis as video-based misinformation grew,” the paper says.

«

The problem is in reality worse, because fact-checked content is only a small part of it.
unique link to this extract

Surveillance and interference: Israel’s covert war on the ICC exposed • +972 Magazine

Yuval Abraham and Meron Rapoport:

»

For nearly a decade, Israel has been surveilling senior International Criminal Court officials and Palestinian human rights workers as part of a secret operation to thwart the ICC’s probe into alleged war crimes, a joint investigation by +972 Magazine, Local Call, and the Guardian can reveal.

The multi-agency operation, which dates back to 2015, has seen Israel’s intelligence community routinely surveil the court’s current chief prosecutor Karim Khan, his predecessor Fatou Bensouda, and dozens of other ICC and UN officials. Israeli intelligence also monitored materials that the Palestinian Authority submitted to the prosecutor’s office, and surveilled employees at four Palestinian human rights organizations whose submissions are central to the probe.

According to sources, the covert operation mobilized the highest branches of Israel’s government, the intelligence community, and both the civilian and military legal systems in order to derail the probe.

The intelligence information obtained via surveillance was passed on to a secret team of top Israeli government lawyers and diplomats, who traveled to The Hague for confidential meetings with ICC officials in an attempt to “feed [the chief prosecutor] information that would make her doubt the basis of her right to be dealing with this question.” The intelligence was also used by the Israeli military to retroactively open investigations into incidents that were of interest to the ICC, to try to prove that Israel’s legal system is capable of holding its own to account.

Additionally, as the Guardian reported earlier today, the Mossad, Israel’s foreign intelligence agency, ran its own parallel operation which sought out compromising information on Bensouda and her close family members in an apparent attempt to sabotage the ICC’s investigation. The agency’s former head, Yossi Cohen, personally attempted to “enlist” Bensouda and manipulate her into complying with Israel’s wishes, according to sources familiar with his activities, causing the then-prosecutor to fear for her personal safety.

«

Hell of an operation for a court that Israel feigns not to be concerned about. Also, perhaps we should simply assume that all operations are being surveilled all the time.
unique link to this extract

GPT-4o’s Chinese token-training data is polluted by spam and porn websites • MIT Technology Review

Zeyi Yang:

»

Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases.

On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in large language models like those that power such chatbots, accessed GPT-4o’s public token library and pulled a list of the 100 longest Chinese tokens the model uses to parse and compress Chinese prompts.

Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. Besides dictionary words, they also include suffixes, common expressions, names, and more. The more tokens a model encodes, the faster the model can “read” a sentence and the less computing power it consumes, thus making the response cheaper.

Of the 100 results, only three of them are common enough to be used in everyday conversations; everything else consisted of words and expressions used specifically in the contexts of either gambling or pornography. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops.

«

This was so, so predictable.
unique link to this extract

Analysis: monthly drop hints that China’s CO2 emissions may have peaked in 2023 • Carbon Brief

Lauri Myllyvirta:

»

China’s carbon dioxide (CO2) emissions fell by 3% in March 2024, ending a 14-month surge that began when the economy reopened after the nation’s “zero-Covid” controls were lifted in December 2022.

The new analysis for Carbon Brief, based on official figures and commercial data, reinforces the view that China’s emissions could have peaked in 2023.

The drivers of the CO2 drop in March 2024 were expanding solar and wind generation, which covered 90% of the growth in electricity demand, as well as declining construction activity.

Oil demand growth also ground to a halt, indicating that the post-Covid rebound may have run its course.

A 2023 peak in China’s CO2 emissions is possible if the buildout of clean energy sources is kept at the record levels seen last year.

However, there are divergent views across the industry and government on the outlook for clean energy growth. How this gap gets resolved is the key determinant of when China’s emissions will peak – if they have not done so already.

Other key findings from the analysis include:
• Wind and solar growth pushed fossil fuels’ share of electricity generation in China down to 63.6% in March 2024, from 67.4% a year earlier, despite strong growth in demand
• The ongoing contraction of real-estate construction activity in China saw steel production fall by 8% and cement output by 22% in March 2024
• Electric vehicles (EVs) now make up around one-in-10 vehicles on China’s roads, knocking around 3.5 percentage points off the growth in petrol demand
• Some 45% of last year’s record solar additions were smaller-scale “distributed” systems, creating an illusory “missing data problem”.

«

So fossil fuels are still a very, very big source of generation – lots of it coal – but the picture keeps improving. And the EV stat helps too.
unique link to this extract

What does the public in six countries think of generative AI in news? • Reuters Institute for the Study of Journalism

Dr Richard Fletcher and Professor Rasmus Kleis Nielsen:

»

Based on an online survey focused on understanding if and how people use generative artificial intelligence (AI), and what they think about its application in journalism and other areas of work and life across six countries (Argentina, Denmark, France, Japan, the UK, and the USA), we present the following findings.

Findings on the public’s use of generative AI: ChatGPT is by far the most widely recognised generative AI product – around 50% of the online population in the six countries surveyed have heard of it. It is also by far the most widely used generative AI tool in the six countries surveyed. That being said, frequent use of ChatGPT is rare, with just 1% using it on a daily basis in Japan, rising to 2% in France and the UK, and 7% in the USA. Many of those who say they have used generative AI have used it just once or twice, and it is yet to become part of people’s routine internet use.

In more detail, we find:
• While there is widespread awareness of generative AI overall, a sizable minority of the public – between 20% and 30% of the online population in the six countries surveyed – have not heard of any of the most popular AI tools
• In terms of use, ChatGPT is by far the most widely used generative AI tool in the six countries surveyed, two or three times more widespread than the next most widely used products, Google Gemini and Microsoft Copilot
• Younger people are much more likely to use generative AI products on a regular basis. Averaging across all six countries, 56% of 18–24s say they have used ChatGPT at least once, compared to 16% of those aged 55 and over
• Roughly equal proportions across six countries say that they have used generative AI for getting information (24%) as creating various kinds of media, including text but also audio, code, images, and video (28%)
• Just 5% across the six countries covered say that they have used generative AI to get the latest news.

«

That last 5% worries me, since generative AI has no business offering news. Notable too how it doesn’t seem to stick: people try it out and give it up.
unique link to this extract

iFixit ends Samsung deal as oppressive repair shop requirements come to light • Ars Technica

Ron Amadeo:

»

IFixit and Samsung were once leading the charge in device repair, but iFixit says it’s ending its repair partnership with Samsung because it feels Samsung just isn’t participating in good faith. iFixit says the two companies “have not been able to deliver” on the promise of a viable repair ecosystem, so it would rather shut the project down than continue. The repair site says “flashy press releases and ambitious initiatives don’t mean much without follow-through.”

iFixit’s Scott Head explains: “As we tried to build this ecosystem we consistently faced obstacles that made us doubt Samsung’s commitment to making repair more accessible. We couldn’t get parts to local repair shops at prices and quantities that made business sense. The part prices were so costly that many consumers opted to replace their devices rather than repair them. And the design of Samsung’s Galaxy devices remained frustratingly glued together, forcing us to sell batteries and screens in pre-glued bundles that increased the cost.”

…Samsung has also reportedly been on the attack against repair, even while it partners with iFixit. On the same day that iFixit announced it was dropping the partnership, 404 Media reported that Samsung requires independent repair shops to turn over customer data and “immediately disassemble” any device found to be using third-party parts. Imagine taking your phone to a shop for repair and finding out it was destroyed by the shop as a requirement from Samsung. The report also says Samsung’s contracts require that independent companies “daily” upload to a Samsung database (called G-SPN) the details of each and every repair “at the time of each repair.”

«

(Thanks G for the link.)
unique link to this extract

Apple wants to know if you’re hearing things because of tinnitus • The Verge

Justine Calma:

»

More than 77% of people who participated in a big Apple-sponsored study have experienced tinnitus at some point in their lives, according to preliminary data. Around 15% say they’re affected daily by tinnitus, perceiving ringing or other sounds that other people can’t hear.

In one of the largest surveys of its kind, researchers at the University of Michigan gathered data from more than 160,000 participants who responded to survey questions and completed hearing assessments on Apple’s Research app since 2019. The goal is to study the effects of sound exposure through headphones, how tinnitus impacts people, and perhaps develop new methods for managing the symptoms.

“The trends that we’re learning through the Apple Hearing Study about people’s experience with tinnitus can help us better understand the groups most at risk, which can in turn help guide efforts to reduce the impacts associated with it,” University of Michigan environmental health sciences professor Rick Neitzel said in a press release.

«

unique link to this extract

Forget retirement. Older people are turning to gig work to survive • Rest of World

Laís Martins, Kimberly Mutandiro, Lam Le and Zuha Siddiqui:

»

Most gig workers around the world are relatively young: Research published in 2021 by the International Labour Organization (ILO), a United Nations agency focused on improving working conditions, puts the average age for delivery workers at 29 and the average age for ride-hailing drivers at 36. But Rest of World reporting suggests that older individuals are turning to gig work, too — and their numbers are expected to grow in the coming years.

Over the past three months, Rest of World spoke to 52 gig workers between the ages of 50 and 75 years in Latin America, Africa, South Asia, and Southeast Asia. Some chose gig work to keep up with rising living costs or to make up for threadbare social security systems; others say it’s impossible to find employment elsewhere once they near retirement age. Still others say that gig work is a second job, one they can transition into full-time once they retire or are no longer employed. Many reported low earnings, long hours, and health complications from their work. Without being able to put enough savings aside for retirement, others communicated the feeling that there was no alternative. But for all its challenges and pitfalls, gig work can represent an accessible, flexible, low-barrier activity that enables older workers to offset expenses and continue to be active.

… Older gig workers is a demographic that’s expected to grow in the coming decades. The global population of people 65 or older is expected to double by 2050, surpassing 1.6 billion, according to the U.N. At the same time, family units around the world are transforming, often requiring older people to support themselves for longer.

«

unique link to this extract

How researchers cracked an 11-year-old password to a $3m crypto wallet • WIRED

Condé Nast:

»

Two years ago when “Michael,” an owner of cryptocurrency, contacted Joe Grand to help recover access to about $2m worth of bitcoin he stored in encrypted format on his computer, Grand turned him down.

Michael, who is based in Europe and asked to remain anonymous, stored the cryptocurrency in a password-protected digital wallet. He generated a password using the RoboForm password manager and stored that password in a file encrypted with a tool called TrueCrypt. At some point, that file got corrupted and Michael lost access to the 20-character password he had generated to secure his 43.6 BTC (worth a total of about €4,000, or $5,300, in 2013). Michael used the RoboForm password manager to generate the password but did not store it in his manager. He worried that someone would hack his computer and obtain the password.

“At [that] time, I was really paranoid with my security,” he laughs.

…last June he approached Grand again, hoping to convince him to help, and this time Grand agreed to give it a try, working with a friend named Bruno in Germany who also hacks digital wallets.

Grand and Bruno spent months reverse engineering the version of the RoboForm program that they thought Michael had used in 2013 and found that the pseudo-random number generator used to generate passwords in that version—and subsequent versions until 2015—did indeed have a significant flaw that made the random number generator not so random. The RoboForm program unwisely tied the random passwords it generated to the date and time on the user’s computer—it determined the computer’s date and time, and then generated passwords that were predictable. If you knew the date and time and other parameters, you could compute any password that would have been generated on a certain date and time in the past.

If Michael knew the day or general time frame in 2013 when he generated it, as well as the parameters he used to generate the password (for example, the number of characters in the password, including lower- and upper-case letters, figures, and special characters), this would narrow the possible password guesses to a manageable number. Then they could hijack the RoboForm function responsible for checking the date and time on a computer and get it to travel back in time, believing the current date was a day in the 2013 time frame when Michael generated his password. RoboForm would then spit out the same passwords it generated on the days in 2013.

There was one problem: Michael couldn’t remember when he created the password.

«

So, so many tales of bitcoin people who lose their password.
unique link to this extract

• Why do social networks drive us a little mad?
• Why does angry content seem to dominate what we see?
• How much of a role do algorithms play in affecting what we see and do online?
• What can we do about it?
• Did Facebook have any inkling of what was coming in Myanmar in 2016?

Read Social Warming, my latest book, and find answers – and more.

Errata, corrigenda and ai no corrida: none notified

The Overspill: when there's more that I want to say

Charles Arthur's site for links, observations and writing