Episode 5: “More Experiments and Thoughts + More Visual Tools”
Listen to this episode:
Episode description:
In this episode I go over some of my own recent generative AI experiments with readily available tools, ask some more provocative questions, and review a sampling of generative AI visual tools currently available.
My experiment with ElevenLabs voice synthesis and D-ID photo animation:
https://studio.d-id.com/share?id=7c16e6fe728afa5ab8deb9604ebca1aa
D-ID:
https://www.d-id.com
My Midjourney experiments on Instagram:
https://www.instagram.com/imscotpansing
Getty Images Claims Photo Theft, Demands $1.8 Trillion from AI Generator:
https://metanews.com/getty-images-claims-photo-theft-demands-1-8-trillion-from-ai-generator
Ben’s Bites:
https://bensbites.beehiiv.com
Cleanup.pictures:
https://cleanup.pictures
Illustroke:
https://illustroke.com
PatternedAI:
https://www.patterned.ai
Stockimg.ai:
https://stockimg.ai
Iconify.ai:
https://www.iconifyai.com
Looka:
https://looka.com
Gen-1 by Runway:
https://research.runwayml.com/gen1
Episode script:
Hello everyone. My name is Scot and I’ve been sharing what I’m learning about generative artificial intelligence. I’m taking a deep plunge into these revolutionary new tools, and trying to distill down the practical information for others. I hope you’ve found it helpful.
In episode three I briefly experimented with a voice synthesis tool from ElevenLabs and created a voice “profile” that mimics my own speech. I then had it read a brief overview of the debate on free will vs. determinism, as explained to me by ChatGPT. Basically, “deep faking” myself putting out content – I didn’t write it, and that’s not me actually speaking. At times, the voice – in my opinion – does not sound like me and is awkward. Additionally, where there should be a bit of a pause, there isn’t, and where there should not be a pause, there sometimes is a long one. But other times in the episode, it sounded 100% convincing.
Steffen, a listener from Denmark with a keen ear, astutely observed an occasional faint distortion artifact in the background. Still, this is the cheap version of voice synthesis for the masses, and the quality is pretty wild in that context.
I experimented with another generative AI tool, this one from a company called D-ID. This tool basically creates a video of a speaking avatar after you provide a still image and type in what you want it to say. I uploaded a photo of myself, and instead of typing in a script and choosing from their preset voices, I uploaded a quick audio sample I created with ElevenLabs of my own voice. The result is fascinating, not because it is a convincing representation of me speaking on video – it’s actually not – it sort of looks like an “unconvincing deep fake” version of me speaking; the mouth and teeth are noticeably odd. But what is amazing to me is that it took less than a minute to go through the process. Seriously, it was about as easy as uploading a file to an email as an attachment. A link to the video, as well as to D-ID, is in the episode description.
There are so many interesting plotlines and stories currently underway with the generative AI revolution, and one that I keep going back to is the democratization and proliferation of these tools to the masses. Not long ago, to play with things like this you had to be at a visual effects company, likely sitting at a powerful workstation – and maybe even messing with a bit of code, or at least a complicated software tool. Now in many cases it’s as easy as 1-2-3. And the cost is anywhere from cheap to free. Back in 2010, a lot of money and time was spent on the movie Tron: Legacy to have a computer generated, young Jeff Bridges speak with weird teeth. And I just did it in 60 seconds for free. We are just around the corner from a time when talented creative people will produce compelling content with quality visual effects using minimal hardware, at almost no cost, other than their time. There is a creative explosion on the horizon.
Something else I did recently was start up an Instagram account where I have posted many of the experiments I have run with the prompt-to-image tool Midjourney. This link is also in the episode description. Midjourney has been an interesting, well, journey so far.. I am finding that even after days of leaving it alone, I am always compelled at some point to revisit and throw it another prompt. I can even do this on-the-go, since I added the Discord app to my phone. Remember, Midjourney requires using Discord as its interface. With Discord on my phone, any time I have the urge, I can just open up the Discord app, go to my personal server where things are quiet and less confusing, type “/imagine” and my prompt, and watch the images come into focus. As a reminder I provide a quick walkthrough of how to use Midjourney in my first episode. Lately I’ve been able to get some interesting results when prompting things like “1950’s people eating on the beach in Italy, black and white” as well as “Darth Vader on a beach vacation.” Eventually, it seemed only natural to place Darth Vader on the Italian beach with the 1950’s people.
Using Midjourney lately has me often pondering another major plot thread in this wild mega-tech rock opera. What should be allowed, or not be allowed, to train the models? Broadly, my understanding is this is meant to be public information that is not subject to any conflict of ownership or use. I’m not an intellectual property attorney, but things are moving so quickly here, it looks like they will be all over this.
For example, I mentioned in a previous episode that Getty Images was suing Stable Diffusion, a text prompt-to-image tool, for alleged copyright violations. This case comes down to copyright infringement vs. fair use. Getty Images even has a price tag for the infringements based on some simple math. 12 million allegedly stolen copyrighted images, multiplied by $150,000 for each violation = 1.8 trillion, yes trillion, dollars.
Getty Images can point to examples of image results coming out of Stable Diffusion that contain a garbled “Getty Images” watermark. Again, I’m not a lawyer, but that can’t help the optics here for Stable Diffusion. It’s like a smoking gun. Or is it? Let’s do a quick thought exercise. Let’s say a human spends time walking through museums, flipping through magazines, and in general consuming imagery both in the public domain and subject to copyright and trademark and other forms of ownership, including watermarked Getty Images. After some time, this human begins to create their own images. The output from the human is influenced not only by the imagery the human saw, but also the entire life experience of the human somehow plays a part. Are these new machines drawing on other information and learning beyond just a set of image training materials?
Back to the human. After viewing all of these images, the human is permitted to create something that depicts someone else’s intellectual property. The police don’t break down the door if the human paints a portrait of Darth Vader. Again, not a lawyer, but my understanding is that the law gets applied if the human tries to sell or otherwise make money from their portrait of Darth Vader.
If you play with Midjourney for more than a little bit, it costs money. Not that much; the basic plan is $10 per month, and the most expensive is $60 per month. Money is being made. And, I’m able to have it create images that contain Darth Vader and other content that is in some way related to someone or something’s intellectual property. Have I mentioned that I’m not a lawyer? I can’t provide answers here but as you can see I can’t stop asking questions.
Big picture, are the entities who own content that is used to train these models going to be compensated? If so, how? I say “entities” because this could be an individual working artist, or a mega corporation that owns the rights to something. And public sentiment always has a part to play in these debates. It’s one thing to highlight living artists in this situation; it’s an entirely different narrative to debate whether a work is derivative or not, and whether a multi-billion dollar company should receive additional monies for their trademarked property.
Okay, that’s enough with the big questions for now. Let’s get into some visual generative AI tools that are currently available. Everything I mention will have a link in the episode description.
Like some of my other episodes, I’m only going to give a sampling of what is currently out there, mainly because the space is changing so quickly. Also many of the tools are duplicative. I’ll tell you one of the places I learn about all of these tools, it’s a newsletter called Ben’s Bites. Ben publishes regularly, about every few days, and each newsletter contains dozens of links to new tools. I can barely keep up!
Let’s start simple. You’ve got an image, and you want some element, object, or person removed. What do you do? You hit up Jimmy or Sally, your friends who are pretty good at Photoshop right? Wrong, you visit Cleanup.pictures, upload your image, click a few times, and you're done. That beach shot that just needed the dude in the background removed is all set, in seconds.
For any graphic designers out there who would be interested in a text-to-vector graphics tool, look no further than Illustroke. It’s not free, but an initial bundle of 50 “tokens” to get started is only $6. The quality of the examples is pretty high; I can imagine this tool being a huge help for projects that require scalable vector graphics.
Things can get even more niche. Stockimg.AI creates stock images from text prompts, Iconify.AI creates app icons from text prompts, you get the idea. There is a tool called PatternedAI, which as the name suggests, generates only patterns from text prompts. Their pricing structure does have a free option, so if you are looking for royalty-free patterns to use for your products, this could be an interesting resource.
A tool called Looka asks you for a company name, industry, and a few more questions, and generates corporate logos and marketing materials.
On the video side of things, there is one tool I have to mention that is not available yet as of this recording. I can’t wait to actually see it in action, or use it myself. It’s called Gen-1 by a company called Runway, and you have to see the demo video. Essentially the tool accepts words and images as prompts to generate new videos out of existing ones. Meaning, you take your source video, which can be of low quality lighting or set design, even a phone camera video shot in low light, and then transform the entire video to a visual style based on a text or image prompt. The new video retains all of the motion and cuts, but the overall visual style is completely changed. As of this episode recording Runway is giving access to a few creators, and everyone else gets to sign up on a waiting list. Runway, if you’re listening, throw me a bone here!
If you’ve made it this far, maybe you’re wondering, “Hey, I thought this guy was going to teach me how to use tools with each episode” – or maybe you’re not, but regardless I want to address this. Yes, when I started the podcast this was my intention. But as I record these episodes, it’s become obvious that the vast majority of these tools are ridiculously easy to use. Just click the link in the episode description. Usually you just type in a box or upload an image and click twice here and there and you’re done. And you may have noticed that I can’t help myself from asking macro questions about the effects of these tools on society, policy, economics, and much more.
Basically what I’m saying is, the format is going to be a little fluid. I still intend on covering categories of tools, but I want to keep pushing beyond just curating from what’s out there. This podcast has already led to introductions with all kinds of interesting people, and I’d like to throw in a few interviews as well. I hope you’ll continue with me on my journey. Buckle up!