Last weekend, with the help of AI Tools we composed a video for the song RageEhLittleBit in about 15 hours and for approx $55.00.
The dichotomy of creating an AI augmented video for a song that Rages about the impacts of AI on artists illustrates both the challenges and benefits of the recent explosion of generative AI.
As both an AI practitioner, and the father of a skilled musician like Tori, I need to have a deep understanding in not only how these tools work under the hood, but also the impacts of the tools on those in the creative domain positive and negative. As a blog post I wrote on the challenges of generative AI points out (The Challenge of Generative AI to the Artistic Ecosystem | Arcurve), some of the impacts will be negative: particularly if consideration for the data that are used to train the models is not factored into the monetization and copyright aspect of these sorts of tools.
There are many positives to the rise of Generative AI tools though, which I will focus on in this post.
Runway (https://runwayml.com ) is the developer of a powerful AI augmented content creation suite. The video editing capabilities of Runway's are extremely impressive - particularly given the price and the fact they run in a browser.
What is more impressive though is how fast and easy it is to do things that used to be expensive and hard.
The 'AI Magic Tools' are indeed like magic - or would be if I didn't happen to work in the domain of AI/ML. The toolset is remarkable, allowing a person with little or no training to do things such as background removal of virtually any full motion video, removing unwanted objects from video, upscaling, background replacing, alpha matting and motion tracking with ease and speed.
Tools are also provided to do image generation from prompt, image expansion, prompt based image transformation and - training your own AI image generator from portraits or objects. Overall, the runway suite is really nicely setup to allow for seamless workflow from asset generation to incorporation of assets on the editing timeline.
Some of the tools are a bit rough in spots, and the browser does start to lag out once the video reaches sufficient complexity. I had up to 10 individual overlapping clips in spots for the RageEhLittleBit video, which did cause things to slow down quite a bit. I can't complain too much though as I was deliberately pushing the tools hard to find their effective limits.
While many of the tools perform functions that were possible to accomplish with more traditional methods, the the GEN1 prompt based video generation or video augmentation tool is something quite different. The GEN1 tool can generate new video segments from a prompt, or it can take an existing video and re-imagine it based on the prompt and the settings it’s given. This opens a whole new world of creative possibilities.
In this case of this video, I used it to take video of myself playing guitar in the basement, and transform it to be in 'TRON' style.
The amazing thing here is I didn't need to use a green screen, good lighting or really any other setup besides a stand to put my ipad on. While I have a lighting rig and a green screen, I deliberately went with the most basic setup to see what I could do with a minimal amount of equipment.
The GEN1 tool worked extremely well in terms of giving results that very much resonated in terms of what I was attempting to accomplish.
The only downside to this tool currently is it has a 5 second length limit, which meant I had to break my original 2 minute long video into 5 second segments, then run the generation process for each individual segment, which was time consuming. This likely will be addressed with GEN2, which looks to also bring even more powerful capabilities to the table.
I leveraged the fact I had to cut the videos into small chunks to help support the story associated with the video and the song. For the initial segments, I locked the randomization seed to keep the style from evolving for a certain number of segments. I also gradually stepped up the style up so that the generator would factor more 'reality' into the output as the segments progressed, thus starting to bring out facial and other features as the song progressed.
I also had to buy more tokens to get the full guitar solo and bridge renderedd, but even then it was still only about 25 bucks to do this for the whole thing, and I had tokens left over.
For segments later in the video, I turned the randomizer on so that it would start to evolve the character and the surroundings.
You can see clearly the effect:
For the scenes where I play guitar in 'Reality' as the tron-used avatar, I used the Remove Background magic tool to isolate the avatars generated by the GEN1 tool.
Even though it was somewhat time consuming to break this part of the video down into 5 second segments, and then re-align them with the audio, its still amazing what is possible already in terms of both time, cost and resources. I made this video over a few days, with no additional support, tools or staff. The footage was shot with an Ipad. I used my laptop to edit it.
When I did this sort of computer animation style work back in the late 90s at an ad agency that specialized in corporate videos, doing even one 30 second segment of this level of complexity would take several weeks. Rendering alone required on the order of 200,000 worth of equipment, including a high end betamax deck, a specialized video memory buffer card called a PAL card, a special controller card, a dedicated high-cost controller PC and a bank of rendering computers to process the animation frames. A 30 second segment could take as long as a week just to render! It was brutal as if you made even one small mistake with a texture or keyframe, it meant a full week before you would get it fixed.
The video below is my animation demo reel, with the 8 minutes shown here representing a full 2 years of work!
The generative segments, while not as sophisticated quite yet as some of these animations were, were rendered in seconds. Even more impressive - the green screen background removal and other Magic AI tools can be used directly in the editing software in real time. While the capabilities in terms of the control of the generative output of the GEN1 tool is currently limited, it is clear where this is heading, and it will no doubt evolve extremely rapidly.
Going beyond the actual tools, Runway also publishes the research associated with the work they are doing on the 'AI Magic Tools' with full explanations, the scientific papers, and also the source code. They also have a zone inside the Runway tool itself where you can experiment with your own models which is really nice.
I also made use of several other current AI tools including DALL-E and midjourney for the generation of various background elements from prompts. I combined the outputs of these by running variations in sequence with the Runway interpolation tool to make animated sequences. This method has been pretty commonly used by a few bands already including King Gizzard and the Lizard Wizard for Iron Lung for example (https://www.youtube.com/watch?v=Njk2YAgNMnE ) .
It’s also possible to create custom models with Runway. The Tori avatar base images were created using the Runway custom model training tool. It only requires about 15-20 images in various poses and lighting, cropped to the area of interest to train a well functioning model. It takes about 30 minutes to process.
Once the model is trained, you can then use it any time to generate specific outputs from a prompt, although it can take a number of iterations to dial the image in the way you want as it does with many of these types of generators currently.
In the case of the RageEhLittleBit video, I wanted to have a few different versions of a front-facing headshot that was recognizably Tori, but with a clearly generated AI look.
I then used the Avatar tool from D-ID to create the animated versions of the avatars. D-ID even has a free trial that gives you 18 or so tokens, which was sufficient for this particular use case, as I only needed a few short segments. It took a bit of work to synchronize the audio with the song track in runway, as the lag due to the browser made it hard to align the track. The Runway tool also doesn’t allow for full waveform display, so I wasn’t able to use the waveform to align the avatar video with the main audio segment. There are a few other ways I could have dealt with this, but I was able to get it working fairly well.
D-ID also allows for streaming generation of avatars using their developer tools, which is something I will be looking into as there would be several use cases I can see for this type of thing.
There are for example a number of adjacent advancements occurring with AI augmented 3d animation, blending the ability to create custom human avatars with other AI capabilities. MetaHuman - Unreal Engine
For the next video we do, I will likely explore this avenue in terms of building an understanding of what is now possible.
These impressive capabilities are now available to anyone. Its a lot like what happened with audio where it used to be necessary to have expensive hardware and a specially equipped space to make a decent recording.
For all the clear ethical, and employment concerns, these powerful and rapidly evolving AI enhanced tools very much democratizes the ability to produce high quality video so that anyone with a good idea can bring it to reality very quickly.
コメント