Colin Kim
Evaluating Autonomous Execution: Exploring Agentic-Driven Software Development, Algorithmic Trading, and Proprietary Environments
8th Grade Project Week
AI agent Panther ID Orbit Trading bots Evidence
Colin Kim
8th Grade Project Week
Three project-week case studies: identity software, LMS research, and trading automation.
Codex is OpenAI's coding agent. Hermes Agent is an open-source agent created by Nous Research; I ran my own setup on an Ubuntu server at home.
I used Codex less like a chatbot and more like a coding teammate that could work inside a real project.
Codex could inspect files and understand the existing project before changing it.
It could update code, run commands, and check whether the result built.
I gave the goals, constraints, and definition of done.
Inspect, plan, edit, test, debug, verify, document.
A coding agent can work across a project: reading code, editing files, running checks, and reporting what changed.
The phone session connected back to my MacBook Air and continued real code work from the same project context.
Hermes Agent is built by Nous Research. My setup made it reachable from Discord, where it could use tools behind the chat interface.
I communicated through the Agent HQ Discord server.
Hermes could run code, inspect logs, manage agents, and report status.
Discord showed gateway restarts, priority mode, reasoning settings, and command approvals.
It connected AI reasoning to a system I could monitor from anywhere.
I saw a Trump/IBM clip on X and wanted to know when the video was actually from, so I asked Hermes from Discord.
My Hermes Agent instance runs on an Ubuntu server at home. With Tailscale and Termius, my phone could securely SSH into it and run updates.
The projects are different, so the proof is different too.
Google sign-in, eligibility records, signed Wallet pass, rate limits, audit logs.
Built applicationRoute captures, endpoint fixtures, coverage labels, redacted research notes.
Research mapFour bots, paper runs, live canary, SQLite state, dashboard, test counts.
Execution systemEvery case study used the same basic loop.
Live site preview: the public entry point is intentionally simple, but it leads into identity checks and pass generation.
The screenshot is the artifact: name, grade, school year, photo, school branding, and scannable barcode.
Apple Wallet passes are not just images. To make and distribute a real pass, I needed Apple-issued certificates from the Apple Developer Program.
Apple's paid developer membership for building and distributing Apple experiences.
Wallet passes need a Pass Type ID certificate so the pass can be signed and trusted by Apple Wallet.
$99 per year.
I applied Friday, worried it would miss the deadline, and was approved Sunday.
A user experience that looks simple on the outside depends on several quiet checks inside the system.
The hardest parts were not decorative. They were the parts that make a school identity tool safe enough to take seriously.
Directory exports and profile photos had to be cleaned, limited, and kept out of public storage.
Wallet passes need the right certificates, team IDs, pass identifiers, and deployment secrets.
Apple Wallet is iPhone-specific, so laptop users needed a handoff instead of a broken download.
Rate limits and audit records made repeated pass attempts visible and controllable.
Research question: which Blackbaud LMS data can be observed safely and reused with clear coverage limits?
Blackbaud did not behave like one simple school website. It moved across school LMS pages, Blackbaud sign-in, Google OAuth, Blackbaud ID, and back into the LMS session.
The research turned browser traffic into redacted captures, endpoint fixtures, normalized student concepts, and a coverage map.
The app surface was pruned to match what the endpoint evidence could support.
Assignments, calendar lists, resources, groups, directory, and messages had useful evidence.
Grades, attendance, and course details had uneven coverage and sensitive data concerns.
A native app path was not simple because the observed session was browser-based.
Compare discovered needs against official APIs, OneRoster, or a school-approved integration.
They were not just rule calculators. They were trading programs watched by my Hermes setup, with logs, alerts, dashboards, tests, and safety checks around them.
People buy and sell contracts about future events. The price can be read like a rough probability.
The simplest version is: market probability versus the bot's estimated fair probability.
The bot does not just ask, 'is this a good trade?' It asks, 'is this good enough, safe enough, and liquid enough to be worth money?'
A stronger disagreement can justify a bigger starting size.
Exposure, max order size, confidence, and drawdown can shrink it.
If the market does not show enough real size, the order gets smaller.
If the safe size is too small, the bot skips instead of forcing a trade.
This chapter adapted the same general architecture across two domains and two exchanges.
Polymarket BTC/ETH crypto bot
Crypto / PolymarketKalshi BTC/ETH crypto bot
Crypto / KalshiPolymarket temperature bot
Weather / PolymarketKalshi high-temperature bot
Weather / KalshiThe project grew from separate experiments into installable Python apps with shared infrastructure.
Every bot follows the same loop, even though the markets are different.
The crypto bots use Binance BTC/ETH prices as outside evidence, then compare that movement to prediction-market prices.
Watches Polymarket BTC/ETH up-down contracts and fast Binance price movement.
PolymarketAdapts the crypto strategy to Kalshi, including fees, fills, settlements, and account reconciliation.
KalshiWeather markets move slower, but the data itself is uncertain and harder to parse.
Compares NOAA and Open-Meteo forecasts to Polymarket temperature-bucket contracts.
Polymarket weatherAdapts the weather strategy to Kalshi KXHIGH contracts with station and threshold parsing.
Kalshi weatherThe simple idea was buy underpriced contracts. The real work was everything required to make that idea safe and measurable.
REST, websockets, authentication, rate limits
Tickers, dates, outcomes, weather stations, thresholds
Orders, fills, fees, settlements, cash, positions
SQLite, dashboards, logs, tests, alerts
The architecture stayed understandable because each layer had a job.
Binance, NOAA, Open-Meteo, Polymarket, Kalshi
Position caps, drawdown limits, liquidity, confidence
Trades, positions, fills, alerts, equity snapshots
A responsible trading bot must know when not to trade.
The bots simulate trades unless live mode is deliberately enabled.
--live --confirm-real-money --confirm-no-guaranteed-edge
Single trades stay below configured account exposure limits.
The bot stops opening trades after losses cross a threshold.
The bots did not trade every apparent edge. They checked whether the situation was safe enough to act.
The displayed price may not be buyable at useful size.
The model may be uncertain or forecasts may disagree.
Open or pending orders may already use the risk budget.
Old quotes or stale markets are not safe inputs.
The strongest evidence was not one lucky trade. It was the engineering surface around the bots.
Instead of trusting logs by memory, the dashboard turned databases into a view a human could understand.
A lot of the learning came from small problems that would matter in a live system.
Names, dates, tickers, outcomes, buckets, and thresholds were inconsistent.
Some attractive prices had too little visible size to trade safely.
Paper positions still needed correct tie and expiry behavior.
Orders, fills, fees, cash, and settlements had to reconcile with the exchange.
The strongest fixes reduced a specific operational risk.
Verification was tracked as package-level test evidence, not as a general feeling that the code worked.
The dollar results are useful only when the evidence type is labeled clearly.
Amir tested Polymarket BTC/ETH crypto markets from May 11 to June 1, 2026.
Gene adapted the crypto strategy to Kalshi and included explicit fee modeling.
Gene's paper run was profitable, so I ran a small live canary to test real orders, fills, settlements, and fees. In the short live window, Gene LIVE lost $4.36.
The weather strategies were defined as much by refusal as by fills.
The interpretation depends on whether the evidence is paper trading, live canary data, or engineering telemetry.
Paper trading can miss fill priority, fees, slippage, latency, and account constraints.
Amir BTC was strong, while Amir ETH was negative.
Risk filters refused many trades instead of chasing every signal.
The platform could test, record, inspect, and explain behavior.
The goal was not to claim that the bots can automatically make money.
It can overestimate results because execution is simplified.
Gene LIVE was useful as a canary, not a statistical proof.
Fees, liquidity, APIs, forecasts, and market behavior can shift.
Automation was bounded by configuration, tests, and review.
AI did not build the project by itself. It helped most when it had repository access, tools, tests, logs, and clear success checks.
The workflow was not one computer. It was a small personal engineering setup.
Most of the work used tools I already had. The new project costs were tied to Apple Wallet and the live trading test.
My Hermes Agent instance and I created the slide deck, then sent the code to Codex so it could improve the actual user experience.
The Nous Research agent helped shape project notes into a presentation structure.
Worked on the Astro route, layout, transitions, responsive fixes, and visual polish.
Chose what was true, what mattered, and what should be understandable to normal people.
Thank you to my mother for supporting me through project week. If you are curious about my work, take a look at my personal website.