Google AI Guide

Architecture

Quick read: the browser handles interaction and rendering, the Express backend coordinates feature logic, Google services provide maps/vision/translation/voice capabilities, and Gemini turns requests into structured UI-ready responses.

Agent Modes

Live agent behavior: realtime audio/vision interaction through Gemini Live with interruption-friendly control flow.
Creative storyteller behavior: interleaved narrative output combining text, visual media, and audio guidance in one continuous flow.
UI navigator behavior: visual/context interpretation that resolves intent and emits executable UI actions such as opening modes, jumping Street View, rendering cards, and running route flows.

System Topology

Flow: Browser UI -> backend orchestrator -> Google APIs + Gemini -> structured UI response.
Client role: collects user intent, manages popups, map state, and feature controls.
Server role: normalizes inputs, calls external services, and returns deterministic payloads.
Render result: cards, map markers, Street View state, media actions, and guided outputs.

Frontend / Backend

Frontend files: `web/index.html` for structure, `web/app.js` for state/mode routing, `web/styles.css` for layout and UI system.
Backend file: `backend/server.mjs` as the orchestration hub for chat, tourist, city, AR, sight, plant, meal, translation, and TTS paths.
Realtime layer: Gemini Live is used where low-latency conversational behavior is required.
Output contract: backend returns structured data so the frontend can render reliably.

Google Platform Stack

Gemini: `@google/genai` and `@google/generative-ai` for reasoning, multimodal interpretation, and structured generation.
Maps: Maps JS, Places, Geocoding, Directions, Street View, and Street View Static for geospatial context.
Vision + Translation + TTS: scan, OCR, multilingual conversion, and narrated playback.
YouTube Data API: video search and metadata retrieval for media workflows.

Deployment & Runtime

Container: root `Dockerfile` builds one deployable image.
Hosting: Google Cloud Run hosts the production service.
API enablement: services are enabled in Google Cloud Console under APIs & Services.
Runtime config: environment variables and Google credentials wire the backend to enabled services.

Development workflow note: MCP tooling is used in the development environment for structured tool access, browser-assisted validation, and operational workflows. It supports development and verification, but is not a direct end-user runtime dependency.

Feature Explanations

This page is the feature index. Use the left navigation to open each feature as its own standalone About page with its own title, build summary, API/Gemini usage, and step-by-step guide.

YouTube

Build: Popup player with search rail, queue state, speed/captions controls, and pop-out support.
APIs: YouTube Data API + YouTube Iframe Player API.
Gemini involvement: Optional recommendation/ranking context for map-related topics.
Best on: Desktop for multitasking; mobile pop-out is supported.

Step-by-step user guide

Open YouTube from the left feature row.
Type a topic in the search box and run search.
Select a result card to load the video.
Use speed, captions, and pop-out controls as needed.
Use result rail to switch videos without leaving the app.

Translate

Build: Text input + speech capture + translated speech output in one card.
APIs: Cloud Translation API + Web Speech API + Cloud TTS.
Gemini involvement: Refines phrasing and contextual translation quality for natural output.
Best on: Mobile for travel/on-location conversation.

Step-by-step user guide

Open Translate.
Choose source and target languages.
Type text or tap Listen & Translate.
Run translation and review output.
Tap Speak Translation for audio playback.

Meal Coach

Build: Camera-first nutrition support workflow that analyzes a meal photo and returns portion-aware coaching in a readable breakdown.
What it scans for: visible meal components, likely food categories, plate composition, portion size signals, and overall meal balance based on what can be inferred from the image.
Practical use: helps users quickly understand whether a meal looks balanced, heavy, light, protein-forward, carb-heavy, or portion-dense before eating.
APIs: Vision API + Gemini API + Cloud TTS.
Gemini involvement: Gemini converts visual meal hints into practical coaching, portion advice, and user-friendly explanation instead of raw labels alone.
Best on: Mobile, optimized for quick camera capture and on-the-go use before or during meals.

How it helps in practice: users can quickly photograph a plate and get guidance such as what looks oversized, what nutrients may be missing, and what to adjust if they want a lighter or more balanced meal.

Step-by-step user guide

Open Meal Coach.
Tap Take Photo and capture or upload a clear image of the meal.
Tap Analyze Meal to send the image for staged analysis.
Read the summary, detail, and coaching tips to understand likely meal composition and portioning.
Use Expand Analysis if you want a deeper breakdown.
Use Speak Tips if you want the meal guidance read aloud.

Sight

Build: Camera-based visual assist mode designed to help users understand what is happening in front of them through clear, text-first scene interpretation.
What it scans for: visible objects, people, gestures, signs, screens, written text, labels, menus, room context, and notable visual cues such as expressions or body positioning when they are visually relevant.
Accessibility use: especially useful for people who are hard of hearing because it can help describe what is visible in the environment when audio-only cues may be missed.
APIs: Vision API + Gemini multimodal analysis.
Gemini involvement: Gemini is the primary reasoning layer that turns visual signals into readable explanations, follow-up answers, and more specific scene detail on request.
Best on: Both; mobile for live camera access, desktop for longer review and follow-up Q&A.

How it helps in practice: users can point the camera at a room, sign, person, or object and ask questions such as “What is happening here?”, “What does that sign say?”, “What is that person holding?”, or “What should I notice in this scene?”

Step-by-step user guide

Open Sight and start camera.
Point at the scene, person, sign, object, or environment you want explained.
Enter a question in the prompt field, or ask for a general description of what is in view.
Tap Sight/Ask to run analysis.
Read the returned explanation and use follow-up questions to get more detail about people, text, objects, or scene context.
Reframe the camera if you want clearer detail about a specific sign, face, object, or area of the scene.

Plant

Build: Plant safety scan workflow built to identify likely plant type signals and translate them into practical dog-toxicity caution guidance.
What it scans for: leaves, stems, flowers, growth pattern, visible plant structure, and other visual cues that help estimate what plant the camera is pointed at.
Practical use: especially useful for dog owners who want a quick caution check while walking outdoors, in parks, or around home gardens.
APIs: Vision API + Gemini API.
Gemini involvement: Gemini summarizes likely risk level, explains why the plant may matter, and provides practical caution language rather than only an identification guess.
Best on: Mobile for real-world outdoor scanning and pet-safety checks while moving.

How it helps in practice: users can point the camera at a plant during a walk and quickly understand whether it may be worth keeping a dog away from it, even before formal identification is certain.

Step-by-step user guide

Open Plant and start camera.
Frame the plant clearly with leaves or flowers visible in good lighting.
Tap Scan Plant.
Read the dog-safety caution and plant summary.
If the result looks uncertain, retake from a closer angle with clearer plant detail.

Tourist

Build: Guided destination storytelling mode that streams a place-based narrative while synchronizing Street View scene changes across story segments.
What it works from: destination input, location lookup, Street View availability, nearby landmark context, and Gemini-authored narrative sequencing.
Practical use: gives users a quick guided tour of a destination by combining location context, historical/cultural framing, and visual movement through the place.
APIs: Gemini API + Street View + Geocoding + YouTube links + Cloud TTS.
Gemini involvement: Gemini generates the story structure, scene progression, and narration tone so the destination feels like a guided mixed-media experience rather than a static map lookup.
Best on: Desktop for richer storytelling and easier viewing; mobile also works well for quick guided destination previews.

How it helps in practice: users can enter a city, district, or landmark and get a narrated sequence of places worth knowing, while the app moves through matching Street View scenes.

Step-by-step user guide

Open Tourist.
Enter a destination and choose a story focus type.
Tap Start Story.
Follow the segment cards as Street View updates to the next scene.
Use Pause Story whenever you want to stop and explore a segment more slowly.

AR Scout

Build: Live camera scout mode that scans what the camera sees and prioritizes readable text translation before scene interpretation when words are detected.
What it scans for: signs, labels, menus, storefront text, posters, landmarks, and scene-level visual context in the center of the camera frame.
Practical use: useful when traveling or exploring unfamiliar places because it can translate visible text and summarize what the user is pointing at.
APIs: MediaDevices + Vision OCR + Translation API + Gemini API.
Gemini involvement: Gemini converts OCR signals and scene hints into concise, useful explanations and decides when translation should be prioritized over generic description.
Best on: Mobile-first by design because it is built around live camera scanning in the field.

How it helps in practice: users can point the phone at foreign-language signs, menus, or unfamiliar locations and get an immediate translation or explanation of what they are seeing.

Step-by-step user guide

Open AR Scout and start camera.
Aim at the sign, object, menu, poster, or place you want analyzed.
Tap Scan to trigger the scan manually.
Review translation-first output when text is detected, or scene explanation when visual context is dominant.
Use retake guidance if the text or object is too small, blurry, or poorly framed.

City TV

Build: City-focused intelligence mode with filters, scenarios, and map-linked cards.
APIs: Maps/Places/Geocoding + Gemini API + news/event feed integration.
Gemini involvement: Signal scoring, scenario simulation, and action-ready summaries.
Best on: Desktop for wider panel controls; mobile supports focused quick checks.

Step-by-step user guide

Open City TV.
Enter city/state and choose focus + timeframe.
Tap Simulate to generate scenario plan.
Review cards and city signal insights.
Use Open Street View for location context.

Open Street View

Build: Launches immersive panorama tied to city, route, or story context.
APIs: Street View Service + Maps JavaScript API.
Gemini involvement: Guides where and why to jump, especially in route/tour scenarios.
Best on: Desktop for larger scene framing; mobile for location-based quick use.

Step-by-step user guide

Open Street View from City TV, Tourist, or chat command.
Confirm target location/address is set.
Wait for panorama load and orientation set.
Navigate using map-linked controls and feature cards.
Exit Street View when done.

Google Products

Purpose: direct reference page for the Google products, APIs, and platform services used across Google AI Guide.
How to use this page: open any link below to review the official product page, API documentation, or platform overview behind the related feature.
Coverage: includes Gemini, Maps, Places, Geocoding, Directions, Street View, Vision, Translation, Text-to-Speech, YouTube Data API, and Cloud Run.

Google product links

Feature mapping

Chat, Tourist, City TV, AR Scout, Sight, Plant, Meal Coach: Gemini API is the reasoning layer behind summaries, multimodal interpretation, and structured outputs.
Map navigation and place context: Maps JavaScript API, Places API, Geocoding, Directions, and Street View power movement, lookups, and immersive map scenes.
Image and language features: Vision, Translation, and Text-to-Speech support OCR, multilingual conversion, and narrated playback.
Media and hosting: YouTube Data API supports video discovery, and Cloud Run hosts the deployed production application.