Add news
March 2010
April 2010
May 2010June 2010July 2010
August 2010
September 2010October 2010
November 2010
December 2010
January 2011
February 2011March 2011April 2011May 2011June 2011July 2011August 2011September 2011October 2011November 2011December 2011January 2012February 2012March 2012April 2012May 2012June 2012July 2012August 2012September 2012October 2012November 2012December 2012January 2013February 2013March 2013April 2013May 2013June 2013July 2013August 2013September 2013October 2013November 2013December 2013January 2014February 2014March 2014April 2014May 2014June 2014July 2014August 2014September 2014October 2014November 2014December 2014January 2015February 2015March 2015April 2015May 2015June 2015July 2015August 2015September 2015October 2015November 2015December 2015January 2016February 2016March 2016April 2016May 2016June 2016July 2016August 2016September 2016October 2016November 2016December 2016January 2017February 2017March 2017April 2017May 2017June 2017July 2017August 2017September 2017October 2017November 2017December 2017January 2018February 2018March 2018April 2018May 2018June 2018July 2018August 2018September 2018October 2018November 2018December 2018January 2019February 2019March 2019April 2019May 2019June 2019July 2019August 2019September 2019October 2019November 2019December 2019January 2020February 2020March 2020April 2020May 2020June 2020July 2020August 2020September 2020October 2020November 2020December 2020January 2021February 2021March 2021April 2021May 2021June 2021July 2021August 2021September 2021October 2021November 2021December 2021January 2022February 2022March 2022April 2022May 2022June 2022July 2022August 2022September 2022October 2022
123
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
News Every Day |

ISIS Executions and Non-Consensual Porn Are Powering AI Art

ISIS Executions and Non-Consensual Porn Are Powering AI Art

Some of the image-generating AI tools that have taken over the internet in recent months are powered in part by some of the worst images that have ever been posted to the internet, including images of the Islamic State executing people, photoshopped nudes of celebrities, and real nudes that were hacked from celebrities’ phones in the 2014 incident that came to be known as “The Fappening.”

AI text-to-image tools like DALL-E 2, Midjourney, and Stable Diffusion have all gone mainstream in recent months, allowing people to generate images in a matter of seconds. These tools (and other, less popular ones) rely on training data, which comes in the form of massive datasets of images scraped from the internet. These datasets are usually impossible for an average person to audit because they contain hundreds of millions and in some cases billions of images, with no easy way to sort through them. 

Recently, however, a site called Have I Been Trained allowed people to search the LAION-5B open source dataset, which contains 5.8 billion images scraped from the internet. LAION-5B and LAION 400M (which contains 400 million images) are not used by DALL-E 2 or Midjourney, but are used in part by several other projects, including the text-to-image AI Stable Diffusion and Google’s similar tool Imagen, which is not yet public.

Have I Been Trained was created by artist and musician Holly Herndon to help artists know whether their work is included in the datasets these AI models were trained on, and in theory request these images be removed. These datasets often include artwork made by artists who are not compensated or credited for their work in any way; artists are worried that some of their work will be replaced by these AI tools, which are ultimately fueled by photos and artworks unwittingly contributed by a large swath of humanity. 

When presented with specific images of ISIS executions and non-consensual pornography, Stability AI said it could not say whether Stable Diffusion was trained on them.

The Have I Been Trained site allows you to search through the LAION-5B dataset using a method called clip-retrieval. Though LAION has its own search function, a spokesperson for Spawning, the group that created Have I Been Trained, told Motherboard that the site opted for simplicity, with the purpose of focusing on helping people “opt out of, or opt into, the dataset for future model training.”

To check whether their art is included in the LAION dataset, all an artist needs to do is type in their name or the title of an image they made and see if it turns up any of their work. Motherboard used the same search function with terms like “ISIS execution” and celebrity names and immediately found photos of extreme violence and non-consensual pornography included in the same datasets.

image (8).png
An example of images Stable Diffusion generated when given the prompt "ISIS."

On the LAION Discord, users have noted the amount of NSFW and violent content included in the database, and have discussed having a "violence detector" in the database. Users there have expressed concerns that some of the dataset could include child sex abuse imagery, and have at least attempted to remove "illegal" content, which a Discord user who works on the project describes as "content that was marked as being likely to be both about children and unsafe was completely filtered out at collection time." 

A user asked "is it possible CP [child porn] is still in the dataset?" 

"It is possible," a person who works on the project responded. "As for now, nobody reported any such sample and I saw none while trying to find unsafe samples in LAION."

The existence of Have I Been Trained shows clearly the many problems with large datasets that power various AI projects. There are copyright issues, privacy issues, bias issues, problems with scraping in illegal content, stolen content, and non-consensual content. The images from the dataset pose questions about what this means for people as their images are not only being unknowingly used, but also transformed in the form of AI-generated images. 

AI is progressing at an astonishing speed, but we still don’t have a good understanding of the datasets that power AI, and little accountability for whatever abusive images they contain.

Google, which used the LAION-400M dataset to train its Imagen image-generating AI, told Motherboard that it has several systems in place to minimize—but not eliminate—the risk of using violent or abusive images.

"We’ve designed our systems to automatically detect and filter out words or phrases that violate our policies, which prohibit users from knowingly generating content that is sexually explicit; hateful or offensive; violent, dangerous, or illegal; or divulges personal information,” a Google spokesperson told Motherboard in an email. “Additionally, Imagen relies on two types of filters--safety filters for prompts entered by users, and state-of-the-art machine vision models to identify unsafe images after generation. We also eliminate risks of exposing personally identifiable information, by avoiding images with identifiable human faces.”

Stable Diffusion, which is developed by Stability AI and trained on the LAION-5B dataset (and which Motherboard previously reported is already generating porn) similarly told Motherboard that it took several steps to reduce potential harms included in the dataset.

“Stable Diffusion does not regurgitate images but instead crafts wholly new images of its own just like a person learns to paint their own images after studying for many years,” a Stability AI spokesperson told Motherboard. “The Stable Diffusion model, created by the University of Heidelberg, was not trained on the entire LAION dataset. It was created from a subset of images from the LAION database that were pre-classified by our systems to filter out extreme content from training. In addition to removing extreme content from the training set, the team trained the model on a synthetic dataset to further reduce potential harms.”

When presented with specific images of ISIS executions and non-consensual pornography, Stability AI said it could not say whether Stable Diffusion was trained on them, and distanced itself from the LAION dataset.

“The [Heidelberg] Lab undertook multiple steps in training the models and we did not have visibility or input on those steps—so we are not in a position to confirm (or deny) which images made it through each filter. Those decisions were made by the Lab, not Stability AI,” the spokesperson said. “From [the] LAION dataset (which is also not Stability AI), we understand that the dataset pointers are labeled for unsafe nudity and violence and we believe these were part of the filtering process.”

Christoph Schuhmann, the organizational lead for the group developing the LAION dataset, said that the group takes safety and privacy very seriously, and that while the dataset is available to the public, it is specifically designed for the “research community” rather than the general public, and that it is merely pointing to images that are publicly available elsewhere on the internet.

“First of all, it is important to know that we as LAION, are not providing any images, but just links to images already available in the public internet on publicly available websites,” Schuhmann told Motherboard in an email. “In general, our data set is targeted at users from the research community and should always, as stated in our disclaimer, be used responsibly and very thoughtfully,”  Schuhmann told Motherboard in an email.

Schuhmann said that to avoid any kind of unwanted content, LAION implemented filters to not only tag potential NSFW content, but instantly remove images that could potentially contain such content.

“However, since we are providing data for scientific use, it is important to know that NSFW-images are very valuable for developing better detectors for such undesirable contents and can contribute to AI safety research,” he said. “In our opinion, any image generation model must ALWAYS ONLY be trained on filtered subsets of our LAION datasets.”

Schuhmann said that LAION can’t say that people who used its dataset actually used the filters it has, “but we can say with certainty that we have always been safety oriented & transparent about the nature of our data sets.”

LAION-5B was released in March, expanding upon its original LAOIN-400M. According to a press release about LAION-5B, the new dataset was “collected using a three-stage pipeline.” First, machines gathered data from Common Crawl, which is an open repository of web crawl data composed of over 50 billion web pages, to collect all HTML image tags that had alt-text attributes. Then, language detection was performed on the alt-text. Raw images were downloaded from the URLs and sent to a CLIP model, which is an AI model that connects text and images to calculate the similarity between the image and text. Duplicate images and pairs with low similarity were discarded. If the text was less than five characters or the image resolution was too large, the image was also removed. 

The LAION team is aware of the copyright issues that they may face. LAION claims that all data falls under Creative Common CC-BY 4.0, which allows people to share and adapt material as long as you attribute the source, provide a link to the license and indicate if changes were made. On its FAQ, it says that it’s simply indexing existing content and after computing similarity scores between pictures and texts, all photos are discarded. The FAQ also states that if your name is on the ALT text data and the corresponding image does not contain your image, it is not considered personal data. However, there is a takedown form on the LAION site that people can fill out and the team will remove the image from all data repositories that it owns if it violates data protection regulations. LAION is supportive of the Spawning team’s efforts, who have been sending them bulk lists of links of images to remove. 

LAION’s site does not go into detail about the NSFW and violent images that appear in the dataset. It mentions that “links in the dataset can lead to images that are disturbing or discomforting depending on the filter or search method employed,” but does not address whether or not action would be taken to change this. The FAQ says, “We cannot act on data that are not under our control, for example, past releases that circulate via torrents.” This sentence could potentially apply to something such as Scarlett Johansson’s leaked nudes, which already exist across the internet, and relinquishes responsibility from the dataset creators. 

Have I Been Trained aims to empower people to see whether there are images they want to be removed from the datasets that power AI, but Motherboard’s reporting shows that it’s more difficult to find anyone who will take responsibility for scraping these images from the internet in the first place. When reached for comment, Google and Stability AI referred us to the maintainers of the LAION dataset. LAION, in turn, pointed to Common Crawl. Everyone we contacted explained that they have methods to filter abusive content, but admitted that they are not perfect. 

All of this points to an increasingly obvious problem: AI is progressing at an astonishing speed, but we still don’t have a good understanding of the datasets that power AI, and little accountability for whatever abusive images they contain.

Motherboard has written extensively about internet platforms’ inability to stop the spread of non-consensual pornography and how new forms of machine learning introduce a new form of abuse, especially for women and people in marginalized communities. The inclusion of image-text pairings that perpetuate exploitation and bias as well as take images without explicit permission forms a harmful foundation for AI art models. 

Google’s Imagen has acknowledged these shortcomings and cite them as the reason why it has not allowed public usage of the program: “While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.” 

Stable Diffusion also acknowledges the potential harm of the results, saying “Despite how impressive being able to turn text into image is, beware to the fact that this model may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography and violence.”

Jason Koebler contributed reporting.









Read also

Arguments to challenge Maryland’s early counting of mail-in ballots set for this week

Samsung Odyssey Neo G8 Review: God Help Me, I’m A Curved Monitor Guy Now

The Virtual Terrace • Re: This Sintellins team.....


News, articles, comments, with a minute-by-minute update, now on Today24.pro

News Every Day

BMW M4 CSL v Porsche 911 GT3: DRAG RACE

Today24.pro — latest news 24/7. You can add your news instantly now — here


News Every Day

The most luxurious Mercedes!



Sports today


Новости тенниса
Марин Чилич

Чилич встретится с Джоковичем в финале турнира в Тель-Авиве



Спорт в России и мире
Москва

Спортсмены Центрального округа Росгвардии завоевали золото на Всероссийской спартакиаде по летним видам спорта


Загрузка...

All sports news today




Загрузка...

Sports in Russia today

Москва

«Спартак» одержал волевую победу над «Нижним Новгородом» в матче РПЛ


Новости России

Game News

Hyenas could have real potential if it wasn't knee-deep in metacringe


Russian.city


Москва

В увлекательное путешествие вместе с пожарными отправились школьники


Губернаторы России
Локомотив

«Локомотив» в Москве проиграл «Уралу» в матче 11-го тура РПЛ


Строгинский мост стал одним из самых длинных в Москве

Самарский выпускник стал лицом профориентационного проекта «Билет в будущее»

Долой каши: в школьных столовых Москвы детей кормят наггетсами и блинчиками

На ВДНХ начался монтаж катка


Певец Николай Басков приехал проводить в последний путь Бориса Моисеева

Рэпер ST рассказал, что добивался своей жены

Z-Town открыл двери для проекта «Школьные красавицы».

Киркоров рассказал, как Моисеев стал для него крестным отцом на телевидении


Борщев не справился с Дэвисом на турнире UFC

Джокович выиграл 89-й турнир ATP в карьере, став лучшим в Тель-Авиве

Бывший физиотерапевт Серены Уильямс рассказал о специфике её тела

МИД Польши вызвал российского посла Андреева из-за действий Москвы на прошлой неделе



«Спартак» одержал волевую победу над «Нижним Новгородом» в матче РПЛ

Угрожавший "устроить резню" в вузе первокурсник задержан в Москве

Корпоратив Мезонпроект прошел на выезде в городе Нижний Новгород.

Филипп Киркоров покаялся на похоронах Бориса Моисеева


Артисты Театра кукол Карелии сыграли 10 спектаклей на гастролях в Барнауле

На Алтае построят комплекс по переработке отходов на 60 тысяч тонн в год

Сегодня: Сати Казановой - 40

Росгвардейцы обеспечили охрану общественного порядка при проведении


Дожди в столице утихнут к среде, 5 октября, рассказал Александр Синенков, ведущий специалист центра погоды «Фобос»

Правило эксплуатации отопительных приборов в гаражах

Головин попал в список лучших игроков тура Лиги 1

Дождь и порывистый ветер ожидаются в Петербурге в первый рабочий день октября



Путин в России и мире







Персональные новости
Russian.city

Семён Слепаков

Покинувший РФ Семён Слепаков обратился к россиянам из Израиля



News Every Day

BMW M4 CSL v Porsche 911 GT3: DRAG RACE




Friends of Today24


Загрузка...
Музыкальные новости

Персональные новости
Moscow.media