New AI app for describing images and video: PiccyBot

auto orientation seems to work now,

i don't know what happened. wouldn't it be better if the camera function invokes the native ios camera when taking photos? it seems a better interface for voiceover users, one can adjust zoom, exposure time, focus lock, etc. i don't know if one can do this from the current interface, but it's certainly not doable for vo users.

Variety of models

The availability of multiple models is great and I found the price reasonable, thanks. If one model is less good or refuses to describe something it's handy being able to switch to another, and the refusal can absolutely be a legitimate issue when shopping for clothing whether for yourself or a friend/partner. GPT4O is very quick to say lets talk about something else.

Native camera

LaBoheme: You are right in that the native iOS camera offers great features. But the use of custom camera helps to keep the app’s workflow simple and efficient. It lets you take photos instantly with the volume button, avoids the extra "retake/use" screen, and ensures front-camera images aren’t flipped by default. But I'll keep it in mind for updates, as the new iOS versions might further enhance the default camera.

Audio description merged with video when sharing

Hi guys,

I released an update today that adds the audio description to the video when sharing it, using the share button on the home screen. For subscribed users only. This was a feature requested often, and I feel it will be quite helpful.

I am looking at adding a live video mode as well, either with OpenAI or Gemini or both. But have to figure out if this is feasible, OpenAI's live speech mode was horribly expensive as a third party developer. But now that they already have competition it may be more economical from the start.

I am also working at running a Whatsapp service to describe videos, so PiccyBot can be indirectly used with Meta Raybans and possibly other smartglasses.

Exciting times indeed!

This is fantastic. I've not…

This is fantastic. I've not been out a great deal, hibernating, so not got much content to parse.

I'll have a play and a great call re the WhatsApp solution. They do seem to be moving in lockstep with the video service so they should become more competitive cost wise. It might be worth looking at the frames per second. I'm assuming that is the defining cost factor.

to be honest, if you included a means of adding in one's own API, for now at least, in the paid version, it's something I'd be happy with. Even if you switched to a subscription model, I think I had a life time purchase, though give those with lifetime a break in terms of free for a year etc.

I think we all understand that there are continuous background costs here. I know few of us have much disposable, but this tech is life changing. I'm comfortable with having a premium plus service that's a few bucks a month, if it works and is easy to use.This is fantastic. I've not been out a great deal, hybernating, so not got much content to parse.

I'll have a play and a great call re the whatsapp solution. They do seem to be moving in lockstep with the video service so they should become more competative

Also, voted for your cool app in the apple vis awards. You deserve it.

Absolutely!

Yes, it's more than worth it; and you got almost all the models you can think of as far as visual processing is concerned in there. And if you find a model that isn't there and which is really good, the dev is been really, really responsive so far.

Gaming with Meta Ray-Bans and PiccyBot!

So I finally gave this app a whirl, used my Meta Smart glasses to record myself playing a round of Mortal Kombat 11 on my PC, and then used PiccyBot to describe the video. It was awesome!
I am using the monthly subscription, but will absolutely be purchasing the lifetime access. This is a marvelous application!
Also, and I hope I do not get in trouble for this, but I may or may not have voted for PiccyBot for the Golden Apples of 2024. 😇

PS I went with Gemini Pro for my AI engine, not sure if that one is any better or any worse than the others, but it is what I went with.

Yes it is worth it

The life-time subscription is a bargain. So yes it's definitely worth it, if only to be able to disable personality mode. The dev deserves all the credit he gets for his dedication to the app.

I'm also really excited by the thought of being able to use this via WhatsApp on the Meta Ray-bans.

Video recording bug and a suggestion

Sometimes it happens that the "Record video" button is greyed out in the video recording interface. I tried to find any system in when this takes place, but to no avail, this seems entirely random. I first noticed this with 2.6, but waited in order to see whether it would go away. But no, today I updated to 2.8 and still saw this multiple times.
When this happens, the "record video" button stays greyed out and is stuck in this state regardless how many times I press "cancel" and so return to the main screen and retry recording video. The only really reliable method to bring it back to life is to close Piccybot from the app switcher and to start it again.
Furthermore I suggest that there should be a "retry" button in the main screen near to the description area. Sometimes the "server is overloaded, please try again in some time" error message appears instead of the description, and then there is no way to resend the video or image I have just taken to the server. It is only possible to take another video/image and to try with that, but the interesting moment captured beforehand would be lost this way. For images it also happens sometimes tthat no error is displayed, but simply no description is presented. The "retry" facility would be an immense help in all these scenarios.
Now Gemini experimental 1206 seems quite stable, that is the image doesn't get rerouted to some other "inferiour" model due to overloading, which was quite often the case 2-3 weeks ago. So now I especially like this model, as it provides all the details I am interested in: people, shapes, actions, spatial positioning of each content element, colours, lighting, atmosphere etc. And all this in vivid detail, but in a balanced and not "overdone" way and what's more practically hallucination-freely.
What I particularly like about Piccybot is that it is extremely light on battery. 3 weeks ago I watched a famous soccer match with the help of Piccybot and I took about half-an-hour of video altogether in several pieces of course to get them described. During all this it consumed only about 15 % of battery charge. This is very impressive! So keep up the good work!

Retry button and models

I second Laszlo's idea of a "retry" button. It doesn't happen often for me that the request doesn't go through, but when it does such a button would be great.
As I've said before, one of the coolest things with PiccyBot is the amount of models to choose from, both the "pro" and "fast" ones. There are some models that I'd like to try out that is not in the app now (or maybe they are, but not under those names):
OpenAI: Chatgpt-4o-latest (2024-11-20) (I mentioned this before, but there was some bug with it if I recall correctly). We also have the o1 model getting image support on the horizon, but that might be too expensive to be practical.
Meta: In the app there is a Llama 3, but there has been a Llama 3.2 Vision released (maybe 3.3 too but I'm not sure if this has vision support).
Anthropic: There is a Claude 3.5 Haiku model out now that maybe could (or already has) replace the Claude 3 Haiku model already in the app.
Mixtral: Pixtral-large-2411 (might also already be in the app under the "Mixtral pixtral" name)

Sharing Tiktoks To PiccyBot

Like the subject says, I'm wanting to share TikToks to PiccyBot. I thought I read somewhere in this thread that I could choose share while viewing a TikTok and share it directly to PiccyBot. This isn't working for me. I see messages, WhatsApp, and other options, but not PiccyBot. I don't even see an option to see a list of additional apps. Saving the TikTok to my photos works, and I can share to PiccyBot from there, but I thought you could share directly to PiccyBot from TikTok. What am I doing wrong, or am I just misunderstanding how this is supposed to work? Any help is appreciated. I'm loving PiccyBot and it's worth every penny.

I'd like to know this as well

Never even thought of describing TikTok vids, but if we can then cool. 🙂

Describing TikToks

Eerie, Brian, you can share the TikTok video to PiccyBot and it should describe it. PiccyBot is usually a bit hidden in the share sheet under 'more', but it is there.

Describing TikToks

Note that TikTok sharing is not 100% stable for sure, they seem to change the format on a regular basis. But most of the time it works.

New update: Retry added, plus audio mixer for share

Laszlo, Blindpk, thanks for the suggestion on the retry button. I added it with the latest update and I have to say it helped me as well, since sometimes for whatever network or model reason, the result doesn't come the first time.
I have also added an audio mixer option for subscribed users. In settings, you can set a percentage how loud you want the original audio and the PiccyBot description audio of a video to be. PiccyBot will now combine the two audio streams when you are sharing the video. So this should give complete freedom in whether you want a description only video, or some of the original sound, or whatever you like.
I know it adds to an already complex settings screen, so if you have any recommendations, please let me know.

Thanks for the retry button

Thanky you very much for the fast implementation. As for the settings screen, I don't personally find it that cluttered. You could of course put some settings under separate screens, like "video settings" and "voice settings", with the obvious drawback that it would take longer if the user wants to change a setting.

I would like to see shortcuts for this app

like speakaboo, It would be great if we could assign a shortcut to the action button that could directly capture and describe the seen.

using only English and without relying on visual references?

when describing video using claude 3-5 sonic and the "ask more" function, it always starts the answer with "using only English and without relying on visual references".
for example, when asked to describe the fingernails of the person, it said "Okay, let's focus on the fingernails of the hand, using only English and without relying on visual references."
why is it doing this? how can the model describe anything without any visual references? and i didn't ask the model to describe in other languages.

So since

So since the ability to share the new audio described videos with yourself or another device has been implemented, I have been able to do something very special. Last week, my wife, unfortunately lost her battle with leukemia. This has been an extremely difficult in trying time for me. It’s been very difficult for me and is still difficult for me to come to terms with this. But I found something that may make it a little less painful. I have taken all of the videos that my wife and I have ever done together on my phone, run them through PiccyBot, had them audio described and then saved the new audio described video to my Device, which now includes the original audio alongside it. So now I can look back on all of our videos and remember each good memory as if it was happening all over again. So I’d like to say a personal thank you to the developers of this app.

To Firefly

First I just wanted to say that, while I will never know what you are going through, nor could I ever know just what your significant other meant to you, I am truly sorry for your loss. For what it's worth.
Second, I think it is really awesome that you are using software such as PiccyBot, to enhance your digital memories of the life you shared with your wife.
May they bring you some semblance of joy in your darkest hour. 🙇‍♂️

Firefly

Thanks for sharing your experience, I cannot imagine how hard it must be, but I appreciate you sharing this feedback and thanks, despite it all. It is very rewarding for me to know that the effort on PiccyBot can be so impactful. It definitely motivates me to keep improving the app further. Thank you..

merging audio fail

I am using mainly the Mistral Pixtralb with videos share from YouTube. Not sure if that makes a difference. But when I get the audio description, it gives it like a summary of the video rather than scene by scene in sequential order. So I go back and ask it to do the description in sequential scene by scene order, and after processing, it says, something like merging audio, failed, and just display the text on the screen with the new result.

update to fix merging audio fail

Privateai, I have just released an update that should fix the merging audio fail. Please try it out and let me know if this works fine? How did the description scene by scene work out?

after trying the new update

first, let me say that I am truly amazed at how quick issues get addressed. Now until the testing result. Yes, the merge audio error message is gone, but the new result is not being read out by AI, it just displayed on the screen. No problem, I thought to myself that it should be OK if I save the video with the new description, which is better because it goes seen by scene sequentialy. When I saved the video, it has this new description in the video file name, but when I played the video, the audio track doing the description is still describing using the original description that sounds like a summary rather than describing each scene. So it looks like the audio merge is done initially when the video is first described, but when you ask for a new description, it does not merge the audio again?

another audio merging glitch, maybe?

following up on my previous post, today I came across another possible glitch or maybe it's intentional? I generated audio description for a short documentary I did, and the audio for the description turned out to be shorter than the actual video, so what it ended up doing was, at the end of the audio track, it started the audio over again. But here's the problem, the video is not long enough for the audio track to place through completely a second time, so about a minute in, it just ends. I am guessing the setting is for the audio track to keep looping until it matches the video lengths? Is there a way to maybe not do that but instead insert silent blocks in between paragraphs so the lengths would match? For example, if the video is one minute long, but the description is only 40 second long, just insert blocks of five second passes in between paragraphs to make that the right lengths? I am wanting to use this audio mixing and description for describing some of my earlier documentary works, but if it is going to loop and and in completely, it's going to be a lot of editing.
here is the link to the file I experimented on so you can see what I'm talking about
https://youtu.be/ib8BC7HEqHM?si=D57Id7FeNCT7TraQ

incidentally, I am noticing that in cases where the description generated as longer than the actual video, it doesn't do the audio mixing?

Question and Possible Bugs

Hi there,

As a Christmas gift, I treated myself to the lifetime subscription to PiccyBot, and I’ve been having so much fun exploring its features! I’ve spent a good amount of time experimenting with the various AI models, and it’s fascinating to see how their outputs can vary based on the same photo. The ability to choose from different voices and personalities is a fantastic touch. I also really appreciate how easily you can adjust the description length—whether you want something minimal or more elaborate, the choice is entirely up to you. Well done!

I do have a few questions I’d like to share:

Personality Toggle in Settings:
The personality toggle in the Settings menu doesn’t seem to be working as expected. When I double-tap the option, nothing happens the first time, but on the second double-tap, it switches between “on” and “off.” However, after closing the Settings menu, the generated descriptions still appear as if the personality is enabled. When I return to Settings, the personality toggle I just turned off has switched back on. Is this a bug?
Issue with Llama Model:
There seems to be a potential bug when switching to the Llama AI model. If I select Llama in the Settings while a photo is being described, I sometimes get an error message like, “There seems to be a hiccup,” followed by phrases like, “Please rephrase that” or “What personality did you want?” Interestingly, if Llama is already set as the AI model before taking or selecting a photo, it works just fine. The issue only arises when switching from another model to Llama. Could you look into this?
AI Model Variations in Descriptions:
This might be related to the models themselves rather than the app, but I’ve noticed that some models do an excellent job describing both me (in a selfie, for example) and the background, while others focus entirely on the background and seem to ignore me altogether. Is this kind of behavior typical for certain models?

Thank you for the incredible work you’ve done with this app—it’s absolutely amazing, and I love using it!

please hear me out, I know many of you would agree with me!

Hello, I'm Labron, another blind user of AI apps like yours, and i have to say, your app is amazing, but there is one major problem you need to deal with pronto! Just like every other AI app out there, we're paying money on your app, and we have the same issue of, all things bad and negative not being described to us at all! it's not fair and you know it! so, what you need to do now is make your AI apps describe things that AI just likes to avoid, like all things bad and wrong. you all can't just barrier us to the positive, we want to see the negative also, like regular sighted people. if I have a video of someone getting killed, your AI apps will try their best to not describe the bad aspects of the video, and that is really wrong, and you all know that! it's really not fair for people like us to have to deal with things of that sort! so please make your AI describe everything and don't keep the negative hidden from us! please, and thank you!
PS: I also have this same message on your YouTube! check it out and write back to me as soon as you see this! Thank you!

@labron3

This is quite literally nothing to do with the developers of the apps. It's the limits set by models such as Open AI and anthropic. Unfortunately , it's the current form that we can't have everything described to us, perhaps that will change in future but, like I say, it's nothing to do with the dev.

@inforover

Oh wow, that is so bad! why don't they fix this stuff they too like being positive! this is not the way life should work, especially with people like us!

Agents are the Solution I think

Well, we should have AI agents built specifically to help out visually impaired people, and it should be done with the full support of and in collaboration with one of the LLM providers. This way, the provider should know to relax several of these current limitations to this particular agent given the purpose and the justification.

Love this app.

First off, I want to thank the developer for creating such an amazing app, I use it almost every day to describe my random videos. This app has definitely come a long way since it was first released. I have some observations and feedback. I have started using this app around late August. I think sometime around late October to early November possibly, there was an update that changed a few things one of them being the processing or waiting sound and the other about the personality. Back with version 2.4 I believe was the one I was using in August, I feel like when the personality toggle was turned on the voice description had a lot more personality. like for example response was a lot more dramatic if that makes sense and it seems that it is not as much anymore. i’m not sure if that change was due to the actual AI model side of things or if it was on the PiccyBot side of things, but I liked those opinionated responses like it had before if that makes sense. I also think it would be nice if there could be a feature in settings where you could change the waiting sound effect so you can pick between different ones like the sound that was an earlier versions of the app or the sound that there is now.

To labron3

Well, yesterday I used it to describe a plane crash video and it did quite well with the traumatic scenes.

Alternative video models

I have added the new Amazon models Nova Lite and Nova Pro to PiccyBot. They don't seem to be as descriptive as some of the other models, but they may be able to describe videos that are rejected by others. Another possible alternative is Reka, which does follow a different description approach. Again, these models don't seem great at the moment, but we can count on them to improve the coming months.

adding my voice to requesting longer video and description time

have not tried out the new AI models yet, but definitely looking forward to it. Thank you so much for the constant update. I do want to add my voice to those who mentioned in the past needing longer processing videos and descriptions. As I mentioned before, I have begun using this app to generate descriptions for some of my documentaries. and I have come to a problem where most of my documentaries are between 6 to 10 minutes, right above the cut off point as it stands. So if that processing time can be extended to 10 minute rather than the five minute mark where it seems to be capped right now, it would be a lot more helpful for projects, such as what I am attempting. I understand that there is a cost associated with longer processing, but it may be worth looking into having a professional membership fee for that kind of stuff?

Censorship, models and a New Year's wish

In my experience for images Mistral Pixtral is the go-to model when I want to have an image described that has such content that all other models would reject. It is stated on their site that this model has virtually no censorship in place. Its descriptions are of course not on par with much much larger models, e.g. Gemini experimental 1206, but quite good nevertheless, and with follow-up questions it does the job very well indeed.
For videos I have no such experience, because I haven't dealt with such a video yet.
By the way, Martin, could you please list those models in Piccybot for us that have both image and video description abilities? It would be quite useful to know this. E.g. I highly doubt that if I set the model to Claude (any version) or Mistral Pixtral, then that would affecct video descriptions as these models process only images and not video, don't they?
Last, but not least, Martin you were among the persons that made 2024 a really special year for me. You know there was 17 years in my life when I had some vision: not too strong by far, but quite useable, that produced tons of memories and imagery that I still live on many many years later. And now Piccybot is among the pieces of technology that brings me much closer to that part of my life again when I had some eyesight. And all this without complicated, risky, invasive and very expensive surgery. I cannot thank your commitment in this field enough! So I wish you a very merry New Year from the bottom of my heart and let's walk further interesting roads in 2025!!!

Mixer problem

Hey guys, I activated the audio description mixer mode for videos, but when it processes or shares the video, it only plays the voice audio, not the video together. What should I do? I have a Galaxy S23. Sorry for posting in the iPhone thread, but since the discussions are here, I thought I'd comment. Once again, I apologize if this is the wrong place.

prompt customization

This app is cool, but I would like to be able to modify the initial prompt to say something other than "what is in this video".
Additionally, I'd like to be able to stop it from mentioning what is said in the audio if possible and maybe allow for haptic feedback?
Otherwise, this app is quite intriguing and I do appreciate your engagement with the community.

It completely changed a story!

So I wanted to try this app. I'm an avid gamer, so I had it describe the opening movie of a game I was playing. It was awkward to record the movie and then listen to the descriptions after the fact, but it did a decent job. Inserted its own commentary which was ... interesting and kind of weird, but It was helpful.

Then I went into the in-game menu to the "story so far" summary page, just to see if it could read me the text of the summary. It's a long multi-paragraph document you can scroll. So I recorded a video of me slowly scrolling the text until the end. I wish I could share a comparrison of what it actually said with what pixie came up with, because it was hilarious, creative and so very, VERY inaccurate. I'm completely flabberghastedhow it came up with what it did. If anyone wants, I could actually try it again and show the comparrison.

Video feedback

Laszio, thanks so much for including me in the people that made 2024 special for you. It really means so much to me. I have been creating software all my life but I never received the type of feedback people like yourself are giving me. As for the video capable models, they are Nova Lite, Nova Pro, Gemini Flash and Reka currently. This will change the coming weeks I am sure though.
Diego, thanks for using the Android version. When you have subscribed, PiccyBot will include the description audio track when sharing the video in the main view. The audio mixing is not done yet on Android however, hope to include that in an update soon.
Quinton, you can interrupt the processing of the video and change the prompt for it. But you are right, initially it defaults to 'what is in this video' despite having written something in the field before. I will adjust that.
Remy, sorry for the hallucinating model. If you suspect something is amiss, try different models or take multiple shots, it may have been not clear enough and the model will proceed, it won't admit that it is speculating.

Hope you all have a fabulous New Year and looking forward to improving PiccyBot further in 2025!

how long can a video be?

What's the limit on length of a video for PiccyBot to describe? I would love for it to describe my wedding, but it only worked with the first 5 minutes or so.

I believe it's five minutes.

Pretty sure that is the current limit.

Images and videos problem

Guys, I've been trying to use the app to recognize photos and videos using various models for 2 days now, but none of them work. Is anyone else having this problem? What can I do?

Working fine here

If it helps, these are my settings:
Nova voice, with personality enabled.
Speech rate is at 120%.
Using AI Model, Google Gemini Flash 2_0.
Length equals 60.
Video quality equals medium.

bugs, observations and questions

So, I've had much more time to play with this app and love it.
I'd like to mention what I've noticed, as well as ask a couple of questions.
How is it able to understand audio nuances, like voice genders?
It's been correct every time I've used it.
Would it be possible to have the app send a notification once video processing completes, rather than needing to keep the app in the foreground?
Elevenlabs gen fm has this feature.
Selecting the "none option in the voices menu does not appear to work.

Could we have an option in settings to change the prompt from always saying "what is in this video?"
I know that had been previously discussed but didn't know if that was something which would actually be coming in the future.
I must reiterate, this app is fantastic.
It's been great for understanding videos, as well as extracting text from them.
Thank you for doing what you're doing.
I, and many others really do appreciate the effort you've put into this.

edit: none option seemed to work this time

I'm not sure what happened, but almost every time I tried using the none option it wouldn't select, but of course, it works now I've mentioned it on here lol.

mixing audio: why does it fail?

i've been experimenting a lot this last month with different videos and mixing audios with description. I used to believe that when it fails to mix, it means the generated description is too long, or longer than the video. To figure out if that is the right assumption, I started requesting for total work count in the end of each description, and I am starting to think that's a length of the description have nothing to do with why it fails to mix. Quite often, a description that is 650 words long, mixed fine the last minute, and the next time, a description of the equal word count fails. what have you guys discovered, anyone can shine a light on why does it fail to mix so often? Also, I have noticed that some AI will produce descriptions according to the work count you specify and give you an accurate work count while others will totally discard word count altogether or act like they're unable to count how many words were used.
also, has anyone successfully come up with a prompt that will generate description to a specified links, for example, give me a description that is four minute and 30 seconds long? I find that by tailoring word Count, or using prompts like, describe each scene of this video using 80 words, I am able to somewhat tailor the length of the description, but the result is very inconsistent.

DeepSeek

Today, just out of a feeling brought by my intuition, I peeked into the model list in Piccybot. And my intuition didn't fail me, as there was a surprise there: a new name on the list! A name the world is learning hyperfast these days, after they made quite a big noise with their announcement timed just before the Chinese lunar new year of course. It's deepseek itself!
Martin, do you happen to know which model this is: DeepSeek v3, or R1 maybe, or some other model of theirs - they have lots on e.g. Huggingface. Can this processs images only, or video too?
By the way Martin and anyone who reads this and just feels like that: have a happy and lucky Chinese lunar new year!!! I am in style with this wish as I am writing this post with ZDSR, a Chinese development in the field of screen readers. ZDSR is my daily driver since last Christmas after almost a 15-year "marriage" with NVDA, and I simply love it!

@Laszlo, I sent you some emails.

I'd like to know more about the screen reader, thanks.

I love the speed of the zhengdu screen reader but...

If anyone else wants to try it, you won't be able to navigate using headings, buttons, or any of those features.

Like I said, I love the speed of this thing,, it's so smooth, but if i can't use navigation keys in the public version then I'd not want to buy the more enhanced one.

Sorry for being off topic but I just thought I'd let other blind people know.

Off-topic: Zhengdu web navigation

Hi Brad,

You CAN now use those navigation features in the public welfare version with Chromium-based browsers (e.g. Chrome, Edge etc.). So this restriction was partly lifted.
For a heap of further information, please check your e-mail and you will find my detailed reply to all your questions. I did my best to answer them.

New AI app for describing images and video: PiccyBot

Options

Comments