Sound2Space

Auditory Vistas: From Soundwave to Neural Imagery

Team Members:

Selin Dursun

Merve Akdogan

Yinghou Wang

Type: Project for Enactive Design

Instructor: Jose Luis Garcia del Castillo Lopez

Time Frame: November 2023

Keywords: Audio Classification, Human-Computer Interaction, Artificial Intelligence, Deforum, Stable Diffusion, YAMNet, AudioSet, Text2Light, ChatGPT-API, sound-to-space

Overview

Sound2Space is a translation service that turns the language of sound into the language of imagery. This project allows us to experience our sonic environment in a completely new way: instead of just hearing the hustle and bustle of a city street or the tranquil sounds of nature, "Sound2Space" enables us to see these sounds as vivid visuals.

It's transforming our daily lives' ephemeral and often overlooked background noises into something we can visually appreciate, presenting a new perspective on what it means to experience our environment.

1. Environmental Sound Recording: The journey begins with collecting audio samples. These are the environmental sounds that form the dataset for the entire project.

2. YAMNet Sound Classification: YAMNet is a deep-learning model that can identify a wide range of sound events. The recorded sounds are fed into YAMNet, which analyzes the audio and detects the types of sounds present, classifying them based on its training.

3. ChatGPT API for Descriptive Prompt Generation: The output from YAMNet, essentially a list of detected sound labels, is then sent to the ChatGPT API. ChatGPT uses this information to create a text prompt that vividly describes the sounds in a way that evokes imagery – essentially painting a picture with words that represent the spatial and environmental context of the sounds.

4. Deforum Stable Diffusion for Spatial Video Generation: With the descriptive prompt from ChatGPT, the Deform Stable Diffusion model then gets to work. This model is an extension of the original Stable Diffusion model that's optimized for generating videos with spatial coherence. It interprets the ChatGPT prompt and crafts a sequence of images—a video—that visually represents the described sound environment over time.

5. Text2Light for HDR Image Creation: The project doesn't stop with video. The same prompts used for video generation are also used for creating static High Dynamic Range (HDR) images with Text2Light. This model generates detailed and lighting-rich pictures that can be used as environments in 3D graphics or simply as standalone high-quality images that reflect the essence of the sounds in a single frame.

Each stage of "Sound2Space" uses AI to transform raw sound data into complex visual outputs, engaging multiple senses and blurring the line between hearing and seeing.

Useless Machine: "Blob"

↑Back to Top