I'm about to show you something that breaks every rule about how vision AI is "supposed" to work.
And when I say breaks the rules, I mean completely flips the whole thing upside down.
Here's What's Wrong With Every Vision AI App You've Ever Used
You point your camera.
You wait.
The AI speaks: "It's a living room with a couch and a table."
Cool story. But where's the couch? What color? How close? What's on it? What about that corner over there? That thing on the wall?
Want to know? Point again. Wait again. Ask again.
The AI decides what you need to know. You're stuck listening to whatever it decides to tell you. You don't get to choose. You don't get to dig deeper. You don't get to explore.
You're just a passenger.
So I built something that does the exact opposite.
What If Photos Were Like Video Games Instead of Books?
Forget books. Think video games.
In a game, you don't wait for someone to describe the room. You walk around and look at stuff yourself. You check the corners. You examine objects. You go back to things that interest you. You control what you explore and when.
That's what I built. But for photos. And real-world spaces.
You're not listening to descriptions anymore.
You're exploring them.
Photo Explorer: Touch. Discover. Control.
Here's how it works:
Upload any photo. The AI instantly maps every single object in it.
Now drag your finger across your phone screen.
Wherever you touch? That's what the AI describes. Right there. Instantly.
Let's Get Real:
You upload a photo from your beach vacation.
Touch the top of the screen:
"Bright blue sky with wispy white clouds, crystal clear, no storms visible"
Drag down to the middle:
"Turquoise ocean water with small waves rolling in, foam visible at wave crests, extends to horizon"
Touch the left side:
"Sandy beach, light tan color with visible footprints, a few shells scattered about"
What's that on the right? Touch there:
"Red beach umbrella, slightly tilted, casting dark shadow on sand beneath it"
Wait, what's under the umbrella? Touch that spot:
"Blue and white striped beach chair, appears unoccupied, small cooler beside it"
Go back to those shells - drag your finger back to the beach:
"Sandy beach, light tan color with visible footprints, a few shells scattered..."
See what just happened?
The information didn't vanish. You went back. You explored what YOU wanted. You took your time. You discovered that cooler the AI might never have mentioned on its own.
You're not being told about the photo. You're exploring it.
And here's the kicker: users are spending minutes exploring single photos. Going back to corners. Discovering tiny details. Building complete mental maps.
That's not an accessibility feature. That's an exploration engine.
Live Camera Explorer: Now Touch the Actual World Around You
Okay, that's cool for photos.
But what if you could do that with the real world? Right now? As you're standing there?
Point your camera at any space. The AI analyzes everything in real-time and maps it to your screen.
Drag your finger - the AI tells you what's under your finger:
β’ Touch left: "Wooden door, 7 feet on your left, slightly open"
β’ Drag center: "Clear path ahead, hardwood floor, 12 feet visible"
β’ Touch right: "Bookshelf against wall, 5 feet right, packed with books"
β’ Bottom of screen: "Coffee table directly ahead, 3 feet, watch your shins"
The world is now touchable.
Real Scenario: Shopping Mall
You're at a busy mall. Noise everywhere. People walking past. You need to find the restroom and you're not sure which direction to go.
Old way? Ask someone, hope they give good directions, try to remember everything they said.
New way?
Point your camera down the hallway. Give it a few seconds.
Now drag your finger around:
β’ Touch left: "Store entrance on left, 15 feet, bright lights, appears to be clothing store"
β’ Drag center: "Wide corridor ahead, tiled floor, people walking, 30 feet visible"
β’ Touch right: "Information kiosk, 10 feet right, tall digital directory screen"
β’ Drag up: "Restroom sign, 25 feet ahead on right, blue symbol visible"
You just learned the entire hallway layout in 20 seconds.
Need to remember where that restroom was? Just touch that spot again. The map's still there.
Walk forward 20 feet, confused about where to go next? Point again. Get a new map. Drag your finger around.
But Wait - It Gets Wilder
Object Tracking:
Double-tap any object. The AI locks onto it and tracks it for you.
"Tracked: Restroom entrance. 25 feet straight ahead on right side."
Walk forward. The AI updates:
"Tracked restroom now 12 feet ahead on right."
Lost it? Double-tap again:
"Tracked restroom: About 8 steps ahead. Turn right in 4 steps. Group of people between you - stay left to avoid."
Zoom Into Anything:
Tracking that information kiosk? Swipe left.
BOOM. You're now exploring what's ON the kiosk.
β’ Touch top: "Mall directory map, large touchscreen, showing floor layout"
β’ Drag center: "Store listings, alphabetical order, bright white text on blue background"
β’ Touch bottom: "You are here marker, red dot with arrow, pointing to current location level 2 near food court"
Swipe right to zoom back out. You're back to the full hallway view.
Read Any Text
Swipe up - the AI switches to text mode and maps every readable thing.
Now drag your finger:
β’ Touch here: "Restrooms. Arrow pointing right."
β’ Drag down: "Food Court level 3. Arrow pointing up."
β’ Touch lower: "Store hours: Monday to Saturday 10 AM to 9 PM, Sunday 11 AM to 6 PM"
Every sign. Every label. Every directory. Touchable. Explorable.
Scene Summary On Demand
Lost? Overwhelmed? Three-finger tap anywhere.
"Shopping mall corridor. Stores on both sides, restroom 25 feet ahead right, information kiosk 10 feet right, people walking in both directions. 18 objects detected."
Instant orientation. Anytime you need it.
Watch Mode (This One's Wild)
Two-finger double-tap.
The AI switches to Watch Mode and starts narrating live actions in real-time:
"Person approaching from left" "Child running ahead toward fountain" "Security guard walking past on right" "Someone exiting store carrying shopping bags"
It's like having someone describe what's happening around you, continuously, as it happens.
The Fundamental Difference
Every other app: AI decides β Describes β Done β Repeat
This app: You explore β Information stays β Go back anytime β You control everything
It's not an improvement.
It's a completely different paradigm.
You're Not a Listener Anymore. You're an Explorer.
Most apps make you passive.
This app makes you active.
β’ You decide what to explore
β’ You decide how long to spend there
β’ You discover what matters to you
β’ You can go back and check anything again
The AI isn't deciding what's important. You are.
The information doesn't disappear. It stays there.
You're not being helped. You're exploring.
That's what accessibility should actually mean.
Oh Right, There's More
Because sometimes you just need quick answers:
Voice Control: Just speak - "What am I holding?" "Read this." "What color is this shirt?"
Book Reader: Scan pages, explore line-by-line, premium AI voices, auto-saves your spot
Document Reader: Fill forms, read PDFs, accessible field navigation
Why a Web App? Because Speed Matters.
App stores = submit β wait 2 weeks β maybe approved β users update manually β some stuck on old version for months.
Web app = fix bugs in hours. Ship features instantly. Everyone updated immediately.
Plus it works on literally every smartphone:
β’ iPhone β
β’ Android β
β’ Samsung β
β’ Google Pixel β
β’ Anything with a browser β
Install in 15 seconds:
1. Open browser
2. Visit URL
3. Tap "Add to Home Screen"
4. Done. It's an app now.
The Price (Let's Be Direct)
30-day free trial. Everything unlocked. No credit card.
After that: $9.99 CAD/month
Why? Because the AI costs me money every single time you use it. Plus I'm paying for servers. I'm one person building this.
I priced it to keep it affordable while keeping it running and improving.
Safety Warning (Important)
AI makes mistakes.
This is NOT a replacement for your cane, guide dog, or mobility training.
It's supplementary information. Not primary navigation.
Never make safety decisions based solely on what the AI says.
The Real Point of This Whole Thing
For years, every vision AI app has said:
"We'll tell you what you're looking at."
I'm saying something different:
"Explore what you're looking at yourself."
Not one description - touchable objects you can explore for as long as you want.
Not one explanation - a persistent map you can reference anytime.
Not being told - discovering for yourself.
Information that persists. Exploration you control. Discovery on your terms.
People are spending 10-15 minutes exploring single photos.
Going back to corners. Finding hidden details. Building complete mental pictures.
That's not accessibility.
That's exploration.
That's discovery.
That's control.
And I think that's what we should have been building all along.
You can try out the app here:
http://visionaiassistant.com
Comments
ah OK
Ah OK. so what features can I use after the trial ends? and I wish the app has a lifetime subscription that never exitres. now that would be cool!
@ Stephen
Would be great if you can update the original post with the link. For someone who is newly discovering this. It'd be better to find it in the first go rather than having to look through the comments.
@Gokul
Great minds think alike! I was updating this as you made that comment :).
well, I hope we can share screen and tell us the eloment or the
the subject said it all.
hope it c an have a feature that can share the computer screen then when we press the up and down arrows it will tell us the menu of the game
@ming
Hi Ming,
Thank you for your suggestion! I understand you're looking for a feature that can capture your computer screen and read out game menus when navigating with arrow keys.
This is a challenging request because this app is built as a progressive web app (PWA) that runs in your browser. Web browsers have significant security restrictions around screen capture - they can't directly access another application's screen content or intercept keyboard events from other programs like games. This is by design to protect user privacy and security.
Why we chose a web app approach:
β’ Universal accessibility: Works on any device (iPhone, Android, Windows, Mac) without separate downloads or app store approvals
β’ Instant updates: Everyone gets new features immediately without reinstalling
β’ No storage concerns: Doesn't take up device space like native apps
β’ Cross-platform: One codebase works everywhere
β’ Easier to maintain: We can focus on building features rather than managing multiple platform-specific versions
The technical challenges for your use case:
β’ Web apps are sandboxed and can't access content outside the browser tab for security reasons
β’ Intercepting arrow key presses from games would require system-level permissions that browsers don't have
β’ Reading game menus would need the game developer to provide accessibility APIs, which is outside our control
Game accessibility is ultimately dependent on game developers implementing accessible features in their games. Unfortunately, this isn't something a web app can bridge due to browser security limitations.
Watch mode is turning out to be incredible
This near-realtime monitoring of the camera feed is something almost the entirety of the community has been wanting for sometime now and no app/wearable could do reliably so far and we have it near perfect here now. So that's saying something. It's too much to hope for maybe, but still hope that someday this same thing gets done by a smartglass camera...
π§ AI Navigator: Because sometimes you need to go hands-free
Live Camera Mode is great - drag your finger around, explore what's in front of you, get detailed descriptions. But what if you just want to walk somewhere without touching your phone?
That's the whole point of AI Navigator.
Three-finger swipe down β just start talking.
No dragging. No tapping. Just conversation.
"Take me to the bathroom." Guides you step-by-step with live distance updates
"What's in front of me?" Describes everything ahead without you lifting a finger
"Where's the exit?" Spots it, tells you exactly where and how far
It reads building directory maps automatically, remembers every location, and keeps track of where you've been. Multi-floor buildings? No problem - it'll guide you to the elevator, tell you which floor to go to, and pick up navigation when you arrive.
It's proactive too. Spots spills, moving people, stairs - warns you before you get there.
Different tools for different moments. Sometimes you want to explore with your hands. Sometimes you just want to get somewhere. Now you can do both.
Testing it now. Updates coming soon. π
PS: Goes without saying but to go completely handsfree you would need some sort of mount :).
@Gokul
I'm building those features now so when Meta hands out that nice little API I can work to see how to implement. I'm staying 5 steps ahead with getting the infrastructure in place and we can go from there :).
Can you add italian translation also?
Hi, first of all congratulations for the app, very innovative. I noticed that you added several languages but Italian is missing. Could you add it? Thank you.
screensharing
Screen sharing could be used, like Google AI Studio does it, to let the web app capture screen contents.
Android, lots of unneeded elements on page
When I install the web app, and open the app, the first four items on the page are "region," "list," "region," "region." Maybe those could be trimmed from the start page?
@ Devin Prater
Screen sharing in a browser is not a system-level capture mechanism, and it cannot be used to directly interpret or interact with the contents of other running applications such as games as Ming is requesting.
On desktop systems, a web app yes can request temporary access to capture a specific browser tab, application window, or monitor using the browserβs built-in screen sharing permission prompt. That access is strictly limited to what the user manually selects, only remains active during that session, and provides nothing more than a raw video feed. It however does not expose the actual interface structure, menu hierarchy, UI elements, focus states, or internal game data. In other words, it can show pixels, but it cannot understand what those pixels represent in a reliable, semantic way.
In addition to that, a web app cannot intercept or listen to arrow-key input that is being handled by another application. The browser only receives keyboard input while it is the active, focused window. Once focus shifts to a game, the browser and any web app running inside it immediately lose access to those key events. This makes real-time βmenu trackingβ based on arrow keys technically impossible in a standard web environment.
On mobile platforms, especially iOS, the restrictions are even more severe. Safari and Progressive Web Apps cannot capture or analyze the screen contents of other apps at all. They also cannot run continuous background capture or monitoring processes. These limitations are enforced at the operating system level to protect user privacy, security, and data isolation between apps.
While some platforms demonstrate controlled screen capture in specific environments, those implementations operate within tightly constrained ecosystems and do not equate to universal, system-wide access for third-party web applications. A typical web app does not gain the ability to observe other programs or interpret their user interfaces simply by using screen sharing.
For a feature that reads game menus dynamically while navigating with arrow keys, the only technically valid approaches are:
β’ Native system-level accessibility services provided by the operating system
β’ Direct accessibility APIs implemented by the game developer
β’ Or a dedicated desktop application with elevated permissions and OS-level hooks
These capabilities are intentionally blocked from browsers and PWAs to prevent surveillance, input capture, and unauthorized data access across applications.
@ Devin Prater
Thanks for letting me know. I donβt have an android so Iβm counting on my android friends to yell at me if there is any bugs. Iβll fix that with the next update :).
@ Ambro
Iβll definitely look into this for you π.
well, I try it on my phone and I have some suggestion
well,
when I use my finger to explorrer around the screen it will tell me a desk, keyboard screen and so on.
but, I hope that it can describe more when releasing the finger...
and also I try it on my pc and it doesn't do too much.
because my computer doesn't have the camaera.
@ming
Hi Ming,
What feature were you using? Was it live camera explorer or photo explorer?
Did you double tap on an object and then swipe left to zoom in? Were you able to do the tutorial or did it pop up for you at all?
I need more details in order to help you better π.
Also, yes it does need a camera to work.
ok...
let me try it again...
maybe I didn't do it quite deeply.