Colin Kim

Evaluating Autonomous Execution: Exploring Agentic-Driven Software Development, Algorithmic Trading, and Proprietary Environments

June 2026

Can an AI-assisted workflow build real systems?

Three project-week case studies: identity software, LMS research, and trading automation.

What was built or researched?
What evidence shows it worked?
What constraints kept it from overclaiming?

Codex and Hermes were the two AI agents in the workflow.

Codex is OpenAI's coding agent. Hermes Agent is an open-source agent created by Nous Research; I ran my own setup on an Ubuntu server at home.

Codex Made by OpenAI A software-engineering agent that can read code, edit files, run commands, and show evidence.

Hermes Agent Made by Nous Research A self-improving open-source agent with memory, tools, skills, and multi-platform chat access.

My setup Self-hosted I configured and operated Hermes through Discord, Tailscale, Termius, and my home server.

Real difference Agents can act They are useful when they can use a computer or codebase, not just write a paragraph.

Codex became my coding workspace.

I used Codex less like a chatbot and more like a coding teammate that could work inside a real project.

Codex is an AI coding agent, not just autocomplete.

A coding agent can work across a project: reading code, editing files, running checks, and reporting what changed.

Codex OpenAI coding agent Works in the app, terminal, IDE, web, and mobile workflows

DeepSWE benchmark 70% vs 58% GPT-5.5 versus Claude Opus 4.8 on long coding tasks

Important caveat Benchmark, not magic DeepSWE compares models on long coding tasks; real projects still need review

Why I cared It finished loops The useful part was not one answer; it was repeated edit-test-debug cycles

Codex also worked when I was away from the laptop.

The phone session connected back to my MacBook Air and continued real code work from the same project context.

Hermes let me run engineering tasks from Discord.

Hermes Agent is built by Nous Research. My setup made it reachable from Discord, where it could use tools behind the chat interface.

Hermes answered a real research question.

I saw a Trump/IBM clip on X and wanted to know when the video was actually from, so I asked Hermes from Discord.

Updating Hermes from six miles away.

My Hermes Agent instance runs on an Ubuntu server at home. With Tailscale and Termius, my phone could securely SSH into it and run updates.

Three systems, three evidence types.

The projects are different, so the proof is different too.

The work cycle stayed consistent.

Every case study used the same basic loop.

Panther ID starts with one action.

Live site preview: the public entry point is intentionally simple, but it leads into identity checks and pass generation.

The output is a real Wallet pass.

The screenshot is the artifact: name, grade, school year, photo, school branding, and scannable barcode.

The pass needed Apple approval before it could work.

Apple Wallet passes are not just images. To make and distribute a real pass, I needed Apple-issued certificates from the Apple Developer Program.

The pass only appears after every check passes.

A user experience that looks simple on the outside depends on several quiet checks inside the system.

Identity software has to be boring in the right places.

The hardest parts were not decorative. They were the parts that make a school identity tool safe enough to take seriously.

Private data

Directory exports and profile photos had to be cleaned, limited, and kept out of public storage.

Apple signing

Wallet passes need the right certificates, team IDs, pass identifiers, and deployment secrets.

Device reality

Apple Wallet is iPhone-specific, so laptop users needed a handoff instead of a broken download.

Abuse limits

Rate limits and audit records made repeated pass attempts visible and controllable.

Orbit was a feasibility study.

Research question: which Blackbaud LMS data can be observed safely and reused with clear coverage limits?

Internal LMS endpoints exist but depend on browser session state.
Captured fixtures made the research repeatable.
Coverage labels kept unsupported features out of the product claim.

The login flow was part of the problem.

Blackbaud did not behave like one simple school website. It moved across school LMS pages, Blackbaud sign-in, Google OAuth, Blackbaud ID, and back into the LMS session.

I mapped the data before claiming the product.

The research turned browser traffic into redacted captures, endpoint fixtures, normalized student concepts, and a coverage map.

Coverage changed the product scope.

The app surface was pruned to match what the endpoint evidence could support.

Works better

Assignments, calendar lists, resources, groups, directory, and messages had useful evidence.

Needs caution

Grades, attendance, and course details had uneven coverage and sensitive data concerns.

Not solved

A native app path was not simple because the observed session was browser-based.

Best next step

Compare discovered needs against official APIs, OneRoster, or a school-approved integration.

The bots were monitored execution systems.

They were not just rule calculators. They were trading programs watched by my Hermes setup, with logs, alerts, dashboards, tests, and safety checks around them.

Decision inputs: market price plus outside data.
Safety layer: paper mode, risk caps, drawdown limits, liquidity checks.
Monitoring layer: Hermes Agent setup, Discord alerts, SQLite state, dashboards, and tests.

What is a prediction market?

People buy and sell contracts about future events. The price can be read like a rough probability.

A 60-cent YES contract roughly means the market thinks the event has a 60% chance.
The bots compare that market price to outside evidence.
If the disagreement is large enough, there may be an edge.

The bot is not guessing. It is comparing.

The simplest version is: market probability versus the bot's estimated fair probability.

Market says 40%.
Outside data suggests 55%.
After fees and risk checks, the bot may trade or skip.

Sizing answers: how much should the bot risk?

The bot does not just ask, 'is this a good trade?' It asks, 'is this good enough, safe enough, and liquid enough to be worth money?'

Estimate edge after fees
Start with a conservative math-based size (half-Kelly)
Apply position cap
Apply liquidity cap
Check stale data and drawdown
Trade or skip

Bigger edge

A stronger disagreement can justify a bigger starting size.

Safety caps

Exposure, max order size, confidence, and drawdown can shrink it.

Liquidity cap

If the market does not show enough real size, the order gets smaller.

Zero size

If the safe size is too small, the bot skips instead of forcing a trade.

The trading testbed: four specialists.

This chapter adapted the same general architecture across two domains and two exchanges.

The bot project became a small trading platform.

The project grew from separate experiments into installable Python apps with shared infrastructure.

What the bots actually do

Every bot follows the same loop, even though the markets are different.

Crypto bots: Amir and Gene

The crypto bots use Binance BTC/ETH prices as outside evidence, then compare that movement to prediction-market prices.

Weather bots: Tara and Vikram

Weather markets move slower, but the data itself is uncertain and harder to parse.

The formula was the smallest part.

The simple idea was buy underpriced contracts. The real work was everything required to make that idea safe and measurable.

How the system was structured

The architecture stayed understandable because each layer had a job.

The first answer should be no.

A responsible trading bot must know when not to trade.

Skipping is part of the strategy.

The bots did not trade every apparent edge. They checked whether the situation was safe enough to act.

What made it real software?

The strongest evidence was not one lucky trade. It was the engineering surface around the bots.

The dashboard made the bots inspectable.

Instead of trusting logs by memory, the dashboard turned databases into a view a human could understand.

The useful failures were engineering failures.

A lot of the learning came from small problems that would matter in a live system.

Changes were tied to failure modes.

The strongest fixes reduced a specific operational risk.

Test counts by package.

Verification was tracked as package-level test evidence, not as a general feeling that the code worked.

Paper P&L versus live canary.

The dollar results are useful only when the evidence type is labeled clearly.

Paper trading

Amir produced the largest paper dataset.

Amir tested Polymarket BTC/ETH crypto markets from May 11 to June 1, 2026.

Paper trading

Gene had the strongest paper result.

Gene adapted the crypto strategy to Kalshi and included explicit fee modeling.

Live canary

Gene LIVE tested paper results with real money.

Gene's paper run was profitable, so I ran a small live canary to test real orders, fills, settlements, and fees. In the short live window, Gene LIVE lost $4.36.

Why the live test went negative.

The short answer is: this was a tiny real-money canary, not a long enough experiment to judge the strategy.

Only about an hour

Gene LIVE had roughly an hour to trade, so the sample was too small to know how it would perform over time.

Real fills are harder

Paper trading can be optimistic because it does not fully capture order priority, partial fills, fees, slippage, and account constraints.

Latency can matter

Crypto-market edges can be short-lived. If data, decisions, or orders arrive late, an apparent edge may already be gone.

Still useful

The canary proved live orders, fills, settlements, reconciliation, watchdog monitoring, and dashboard reporting could be inspected.

Weather bots mostly skipped.

The weather strategies were defined as much by refusal as by fills.

What the results really mean

The interpretation depends on whether the evidence is paper trading, live canary data, or engineering telemetry.

A credible project explains its limits.

The goal was not to claim that the bots can automatically make money.

So how useful was AI really?

AI did not build the project by itself. It helped most when it had repository access, tools, tests, logs, and clear success checks.

Most useful When it could act Reading files, editing code, running tests, checking logs, and explaining failures.

Least useful When prompts were vague The weaker moments came from unclear goals, missing evidence, or overconfident wording.

Human role Direction and judgment I still chose the goals, reviewed outputs, set safety boundaries, and decided what counted as proof.

Scale ~800M tokens Roughly 600M words of AI reading and writing across the project workflow.

The project ran across four devices.

The workflow was not one computer. It was a small personal engineering setup.

The counted project costs were small and specific.

Most of the work used tools I already had. The new project costs were tied to Apple Wallet and the live trading test.

If you thought this slide deck was created by AI, you are not far off.

My Hermes Agent instance and I created the slide deck, then sent the code to Codex so it could improve the actual user experience.

Thanks for listening.

Thank you to my mother for supporting me through project week. If you are curious about my work, take a look at my personal website.

Scan the QR code for colinkim.dev.