March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010
November 2010
December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
News Every Day |

Google DeepMind has a new way to look inside an AI’s “mind”

AI has led to breakthroughs in drug discovery and robotics and is in the process of entirely revolutionizing how we interact with machines and the web. The only problem is we don’t know exactly how it works, or why it works so well. We have a fair idea, but the details are too complex to unpick. That’s a problem: It could lead us to deploy an AI system in a highly sensitive field like medicine without understanding that it could have critical flaws embedded in its workings.

A team at Google DeepMind that studies something called mechanistic interpretability has been working on new ways to let us peer under the hood. At the end of July, it released Gemma Scope, a tool to help researchers understand what is happening when AI is generating an output. The hope is that if we have a better understanding of what is happening inside an AI model, we’ll be able to control its outputs more effectively, leading to better AI systems in the future.

“I want to be able to look inside a model and see if it’s being deceptive,” says Neel Nanda, who runs the mechanistic interpretability team at Google DeepMind. “It seems like being able to read a model’s mind should help.”

Mechanistic interpretability, also known as “mech interp,” is a new research field that aims to understand how neural networks actually work. At the moment, very basically, we put inputs into a model in the form of a lot of data, and then we get a bunch of model weights at the end of training. These are the parameters that determine how a model makes decisions. We have some idea of what’s happening between the inputs and the model weights: Essentially, the AI is finding patterns in the data and making conclusions from those patterns, but these patterns can be incredibly complex and often very hard for humans to interpret.

It’s like a teacher reviewing the answers to a complex math problem on a test. The student—the AI, in this case—wrote down the correct answer, but the work looks like a bunch of squiggly lines. This example assumes the AI is always getting the correct answer, but that’s not always true; the AI student may have found an irrelevant pattern that it’s assuming is valid. For example, some current AI systems will give you the result that 9.11 is bigger than 9.8. Different methods developed in the field of mechanistic interpretability are beginning to shed a little bit of light on what may be happening, essentially making sense of the squiggly lines.

“A key goal of mechanistic interpretability is trying to reverse-engineer the algorithms inside these systems,” says Nanda. “We give the model a prompt, like ‘Write a poem,’ and then it writes some rhyming lines. What is the algorithm by which it did this? We’d love to understand it.”

To find features, or categories of data that represent a larger concept, in its AI model, Gemma, DeepMind ran a tool known as a “sparse autoencoder” on each of its layers. You can think of a sparse autoencoder as a microscope that zooms in on those layers and lets you look at their details. For example, if you prompt Gemma about a chihuahua, it will trigger the “dogs” feature, lighting up what the model knows about “dogs.” The reason it is considered “‘sparse” is that it’s limiting the number of neurons used, basically pushing for a more efficient and generalized representation of the data.

The tricky part of sparse autoencoders is deciding how granular you want to get. Think again about the microscope. You can magnify something to an extreme degree, but it may make what you’re looking at impossible for a human to interpret. But if you zoom too far out, you may be limiting what interesting things you can see and discover. 

DeepMind’s solution was to run sparse autoencoders of different sizes, varying the number of features they want the autoencoder to find. The goal was not for DeepMind’s researchers to thoroughly analyze the results on their own. Gemma and the autoencoders are open-source, so this project was aimed more at spurring interested researchers to look at what the sparse autoencoders found and hopefully make new insights into the model’s internal logic. Since DeepMind ran autoencoders on each layer of their model, a researcher could map the progression from input to output to a degree we haven’t seen before.

“This is really exciting for interpretability researchers,” says Josh Batson, a researcher at Anthropic. “If you have this model that you’ve open-sourced for people to study, it means that a bunch of interpretability research can now be done on the back of those sparse autoencoders. It lowers the barrier to entry to people learning from these methods.”

Neuronpedia, a platform for mechanistic interpretability, partnered with DeepMind in July to build a demo of Gemma Scope that you can play around with right now. In the demo, you can test out different prompts and see how the model breaks up your prompt and what activations your prompt lights up. You can also mess around with the model. For example, if you turn the feature about dogs way up and then ask the model a question about US presidents, Gemma will find some way to weave in random babble about dogs, or the model may just start barking at you.

One interesting thing about sparse autoencoders is that they are unsupervised, meaning they find features on their own. That leads to surprising discoveries about how the models break down human concepts. “My personal favorite feature is the cringe feature,” says Joseph Bloom, science lead at Neuronpedia. “It seems to appear in negative criticism of text and movies. It’s just a great example of tracking things that are so human on some level.” 

You can search for concepts on Neuronpedia and it will highlight what features are being activated on specific tokens, or words, and how strongly each one is activated. “If you read the text and you see what’s highlighted in green, that’s when the model thinks the cringe concept is most relevant. The most active example for cringe is somebody preaching at someone else,” says Bloom.

Some features are proving easier to track than others. “One of the most important features that you would want to find for a model is deception,” says Johnny Lin, founder of Neuronpedia. “It’s not super easy to find: ‘Oh, there’s the feature that fires when it’s lying to us.’ From what I’ve seen, it hasn’t been the case that we can find deception and ban it.”

DeepMind’s research is similar to what another AI company, Anthropic, did back in May with Golden Gate Claude. It used sparse autoencoders to find the parts of Claude, their model, that lit up when discussing the Golden Gate Bridge in San Francisco. It then amplified the activations related to the bridge to the point where Claude literally identified not as Claude, an AI model, but as the physical Golden Gate Bridge and would respond to prompts as the bridge.

Although it may just seem quirky, mechanistic interpretability research may prove incredibly useful. “As a tool for understanding how the model generalizes and what level of abstraction it’s working at, these features are really helpful,” says Batson.

For example, a team lead by Samuel Marks, now at Anthropic, used sparse autoencoders to find features that showed a particular model was associating certain professions with a specific gender. They then turned off these gender features to reduce bias in the model. This experiment was done on a very small model, so it’s unclear if the work will apply to a much larger model.

Mechanistic interpretability research can also give us insights into why AI makes errors. In the case of the assertion that 9.11 is larger than 9.8, researchers from Transluce saw that the question was triggering the parts of an AI model related to Bible verses and September 11. The researchers concluded the AI could be interpreting the numbers as dates, asserting the later date, 9/11, as greater than 9/8. And in a lot of books like religious texts, section 9.11 comes after section 9.8, which may be why the AI thinks of it as greater. Once they knew why the AI made this error, the researchers tuned down the AI’s activations on Bible verses and September 11, which led to the model giving the correct answer when prompted again on whether 9.11 is larger than 9.8.

There are also other potential applications. Currently, a system-level prompt is built into LLMs to deal with situations like users who ask how to build a bomb. When you ask ChatGPT a question, the model is first secretly prompted by OpenAI to refrain from telling you how to make bombs or do other nefarious things. But it’s easy for users to jailbreak AI models with clever prompts, bypassing any restrictions. 

If the creators of the models are able to see where in an AI the bomb-building knowledge is, they can theoretically turn off those nodes permanently. Then even the most cleverly written prompt wouldn’t elicit an answer about how to build a bomb, because the AI would literally have no information about how to build a bomb in its system.

This type of granularity and precise control are easy to imagine but extremely hard to achieve with the current state of mechanistic interpretability. 

“A limitation is the steering [influencing a model by adjusting its parameters] is just not working that well, and so when you steer to reduce violence in a model, it ends up completely lobotomizing its knowledge in martial arts. There’s a lot of refinement to be done in steering,” says Lin. The knowledge of “bomb making,” for example, isn’t just a simple on-and-off switch in an AI model. It most likely is woven into multiple parts of the model, and turning it off would probably involve hampering the AI’s knowledge of chemistry. Any tinkering may have benefits but also significant trade-offs.

That said, if we are able to dig deeper and peer more clearly into the “mind” of AI, DeepMind and others are hopeful that mechanistic interpretability could represent a plausible path to alignment—the process of making sure AI is actually doing what we want it to do.

Москва

Сергей Собянин. Главное за день

Jake Paul vs Mike Tyson weigh-in: Date, start time, how to watch & stream FREE as boxers prepare for huge Netflix clash

What is Ceramic Coating?

When I was 11, I made a friend who changed the trajectory of my life. She inspired me to go to college and try harder.

No leader can fix Nigeria with 1999 constitution – Anyaoku

Ria.city






Read also

‘Sell or you get pushed out.’ Texas property owners face 14-day eminent domain deadline

Loudoun Co. could lower rates on car tax, eliminate vehicle license fee

Who is Michael Hamilton? Good Enough podcast co-host and Footasylum Locked In housemate

News, articles, comments, with a minute-by-minute update, now on Today24.pro

News Every Day

Jake Paul vs Mike Tyson weigh-in: Date, start time, how to watch & stream FREE as boxers prepare for huge Netflix clash

Today24.pro — latest news 24/7. You can add your news instantly now — here


News Every Day

What is Ceramic Coating?



Sports today


Новости тенниса
ATP

Футболисты «Ювентуса» сфотографировались с Медведевым на Итоговом турнире ATP



Спорт в России и мире
Москва

ИТ-олимпиада Росгвардии проходит в Санкт-Петербурге



All sports news today





Sports in Russia today

Москва

Первые в эфире – первые в горах!


Новости России

Game News

Take-Two boss gets philosophical about 'entropy' and life after Grand Theft Auto: 'If we're not trying new things ... we're really running the risk of burning the furniture to heat the house'


Russian.city


Киев

Траур Зеленского: команда Трампа стала бить по Киеву


Губернаторы России
Елена Волкова

Семья из Пермского края победила в конкурсе Ирины Дубцовой «Главное – Семья»


Компания ICDMC стала “Выбором потребителей” в 2024 году

Спазмы и закупорка желчных проходов: от какой пищи стоит отказаться при болезнях ЖКТ

Журнал MODA topical и Abakumov clinic представили 16-ю ежегодную звездную премию «Topical Style Awards 2024»

Товаровед Алексеева: при покупках помогут специальные приложения


«Поэт и правдоруб» Николай Рассадин: Песня о прекрасной стране

«Все-таки он»: Джиган обвинил P. Diddy в убийстве Тупака

Валерий Гергиев высказался о высоких ценах на билеты в Большой театр

Для вопроса поднимите руку, а лучше ногу: Николай Цискаридзе о балете, родителях и квадроберах


Рублев поднялся на одну строчку в рейтинге ATP

Роковой форхенд: Рублёв четырежды взял свою подачу под ноль, но уступил Звереву на старте Итогового турнира ATP

Вторая ракетка Казахстана получил плохие новости от ATP после развода с российской теннисисткой

Ни один теннисист не вышел в полуфинал Итогового турнира ATP — 2024 по итогам двух туров



Семья из Пермского края победила в конкурсе Ирины Дубцовой «Главное – Семья»

В Подмосковье при силовой поддержке СОБР Росгвардии задержан подозреваемый в незаконном обороте кокаина (видео)

В Подмосковье при силовой поддержке СОБР Росгвардии задержан подозреваемый в незаконном обороте кокаина

Семья из Пермского края победила в конкурсе Ирины Дубцовой «Главное – Семья»


Парад любимых спектаклей в Бурятском театре оперы и балета в декабре

МТС Travel: россияне бронируют размещения в Анапе на Новый год с июля

В Москве завершился Кубок России по фитнес-аэробике

В Подмосковье при силовой поддержке СОБР Росгвардии задержан подозреваемый в незаконном обороте кокаина


Сердце на разрыв. Как судили подозреваемую в убийстве мамы Гелю Пикалову

РЖД предварительно одобрили продажу Рижского вокзала в Москве

На следующей неделе в Москве ожидается значительное снижение атмосферного давления

Московский метрополитен отсудил у рязанского ИП 20 тысяч рублей



Путин в России и мире






Персональные новости Russian.city
Жанна Бичевская

Там всё делается на пьяную голову: Жанна Бичевская о влиянии Аллы Пугачёвой на российский шоубизнес



News Every Day

Jake Paul vs Mike Tyson weigh-in: Date, start time, how to watch & stream FREE as boxers prepare for huge Netflix clash




Friends of Today24

Музыкальные новости

Персональные новости