GenAI Radar -- Tuesday, April 28, 2026

📡 Industry Signals

What's happening?

ships Artificial Analysis / OpenAI 4 min

GPT-5.5 shifts the enterprise pitch from benchmark scores to autonomous work completion 🔗

The case enterprises make against expanding AI budgets points to a consistent gap: benchmark scores do not predict reliable task completion. OpenAI released GPT-5.5 on April 23 as an agentic runtime, not a conversational model; it reclaimed the top position on the Artificial Analysis Intelligence Index with a score of 60.

Three artefacts need revision: vendor evaluation criteria should add task-completion rates alongside benchmark scores; the Master Services Agreement with OpenAI should be reviewed for conversational-scope clauses, including the data-processing addendum and liability cap on automated decisions; and the internal eval harness needs an agentic task-completion suite. Ask your AI platform lead: which contracted use cases are now re-priceable as agentic subscriptions, and does the current contract expose the organisation to unexpected cost on autonomous loops?

Why it matters The shift from token-based to task-based pricing will flow into contract renegotiations across 2026. The Technology Committee should see updated total cost of ownership numbers by end of Q2, using agentic task volumes rather than token budgets as the baseline. Pull the current data-processing addendum and liability cap before any renewal conversation with OpenAI begins.

Read source →

risk Fisher Phillips / NY Governor 4 min

New York's revised RAISE Act creates safety obligations for frontier model developers 🔗

State-level AI safety mandates are creating indirect compliance obligations for enterprise buyers, not just vendors. New York Governor Kathy Hochul signed a revised RAISE Act in April 2026, establishing safety obligations for frontier model developers effective January 1, 2027.

Three artefacts need updating before year-end: the vendor risk review checklist must add a RAISE Act compliance tier for New York-exposed frontier model providers; the Master Services Agreement with those vendors should require a regulatory compliance warranty and a right-to-audit on safety documentation; and the AI governance policy must identify which deployed models fall within the new reporting regime. Ask your procurement lead and General Counsel: do existing contracts require AI vendors to notify the organisation when they become subject to new state safety mandates?

Why it matters The Technology Committee briefing note due before December 2026 should map each contracted AI vendor against the RAISE Act coverage criteria. Adding a RAISE Act compliance row to the vendor risk review questionnaire now costs a day of legal time; the same gap discovered during an audit costs considerably more. Brief the General Counsel on which vendors will carry the new reporting obligation before the next contract renewal cycle begins.

Read source →

risk Cooley / Colorado Legislature 3 min

Colorado's AI Act targets deployers for discrimination liability and lands in 63 days 🔗

Most enterprise AI compliance programs focus on model-developer obligations. Colorado's AI Act puts equivalent accountability on the deployer, with enforcement starting June 30, 2026. The Colorado AI Act requires companies deploying high-risk AI systems to take reasonable care against algorithmic discrimination and to conduct documented impact assessments.

Three artefacts need attention before June 30: the compliance review must screen every production AI-driven decision in hiring, lending, insurance, or benefits against the Colorado high-risk definition; the AI governance policy needs a Colorado-specific algorithmic-discrimination section; and each Master Services Agreement with a downstream deploying entity should carry a compliance warranty. Ask your Chief Data Officer and General Counsel: which production AI systems qualify as high-risk under Colorado's definition, and do current impact assessments meet the evidentiary standard a state regulator would accept?

Why it matters Add the Colorado AI Act definition of high-risk AI to the next governance board agenda before May 15. The gap between "we have an AI inventory" and "we have signed-off impact assessments for every high-risk deployment" is where enforcement will focus first. Pull the impact assessment template from the legal team and run it against the three AI-driven decision processes with the widest population exposure in Colorado.

Read source →

🧠 Models & Tools

What's new?

Google / GitHub 3 min

Google's Agent Development Kit puts vendor-neutral multi-agent composition into open source 🔗

Google's Agent Development Kit (ADK) reached 8,200 GitHub stars in April 2026, making it one of the fastest-growing agent orchestration projects in the current wave. The kit lets teams compose multi-agent pipelines (assigning roles, managing tool access, routing work between agents, and handling inter-agent communication) without requiring a Google Cloud account or a Vertex commitment. It supports any model backend that speaks standard interfaces, including local Ollama deployments. The practical implication is that teams can prototype a multi-agent architecture against a local model, validate it, and then swap in a frontier model for production without rewriting the orchestration layer. Google released the kit under the Apache 2.0 licence, which permits commercial use without royalty or attribution obligations.

What it enables For engineering leads evaluating agent orchestration frameworks, ADK is worth a structured comparison against LangGraph and CrewAI on the specific dimensions that matter for enterprise use: how it handles tool authorisation, whether inter-agent communication is logged for audit, and how the framework behaves when one agent in a chain fails. Run a three-task prototype, measure the error-handling behaviour, and check whether the logging output satisfies your internal audit trail requirements before committing to any framework.

Read source →

OpenAI / GitHub 3 min

OpenAI Agents SDK 0.4 lets agents consume any MCP server as a native tool 🔗

The OpenAI Agents Software Development Kit (SDK) version 0.4, released April 5, added native Model Context Protocol (MCP) tool-use support and streaming handoffs between agents. The MCP integration means any agent built with the SDK can now consume existing MCP servers as first-class tools without a custom adapter layer. Streaming handoffs allow a long-running task to transfer mid-execution between specialised agents without losing accumulated state. For teams that have already invested in MCP server infrastructure, this update means that same tool surface is now available inside OpenAI-SDK-based agents at no additional integration cost. Combined with the 2,000-plus MCP servers now on GitHub, the effective tool library available to an OpenAI-SDK agent grew significantly in a single release.

Try this If your team has an internal MCP server for any data source or business system, the fastest way to evaluate version 0.4 is to point an SDK-based agent at that server and run three real queries against production data. The integration test to care about is not whether the tool call succeeds but whether the agent's actions are logged in a form your internal audit trail can consume. Check the SDK's logging hooks before wiring it to any system that holds regulated data.

Read source →

🚀 Applications

What's working?

Enterprise FifthRow / EY / Salesforce 4 min

EY, Salesforce, Oracle, and Microsoft April launches make agentic AI the 2026 enterprise baseline 🔗

Four major enterprise platform vendors shipped agentic AI features in April 2026 that materially raise the operational baseline for large organisations. EY launched an AI-driven tax and regulatory compliance workflow suite; Salesforce extended Agentforce to handle multi-step customer escalation resolution without human handoff; Oracle embedded agentic procurement assistants directly into its Fusion ERP (Enterprise Resource Planning) modules; and Microsoft released Copilot Studio orchestration capabilities that allow IT teams to compose multi-agent workflows across the Microsoft 365 suite without writing code. Taken together, the four releases mean that a large enterprise not running at least one production agentic AI workflow by the end of Q2 2026 is now behind its peer group on a capability that major system vendors have made standard. The question has shifted from whether to adopt to which workflows to prioritise and how to govern the ones already running.

What it proves Chief Operating Officers reviewing the Q2 AI investment case should note that the competitive comparison now includes platform capabilities shipped at no incremental licence cost. Before approving a bespoke agentic build, verify whether the same outcome is achievable through an existing ERP, CRM (Customer Relationship Management), or productivity platform already under contract. The due diligence question for the Technology Committee: what are we paying to build that a platform vendor has already shipped?

Read source →

Personal arXiv 2604.22875 / sketchvlm.github.io 3 min

SketchVLM lets AI models draw on images to explain their own reasoning 🔗

SketchVLM is a training-free framework that enables vision-language models (VLMs) such as Gemini 3 Pro and GPT-5 to produce non-destructive SVG overlays on input images to explain their answers visually: pointing, labelling, circling, or connecting objects. Across seven benchmarks spanning maze navigation, object counting, and part labelling, SketchVLM improves visual reasoning accuracy by up to 28.5 percentage points and annotation quality by up to 1.48x compared to image-editing baselines. The framework is model-agnostic and requires no fine-tuning; it works by prompting the model to express its reasoning as SVG drawing commands on top of the original image, which the system then renders non-destructively. For anyone working with image analysis in research, design, or document review, the interactive demo at sketchvlm.github.io provides a zero-cost way to see whether visual explanation materially changes how quickly you can verify or correct a model's interpretation.

Try this Open the interactive demo at sketchvlm.github.io with an image from a domain where you regularly use AI analysis: a floor plan, a product diagram, a data chart, or a document page. Ask a spatial or counting question and compare the annotated output against a text-only answer. The relevant question is not accuracy alone but whether the annotation makes it faster to spot errors before acting on the model's response.

Read source →

Developer OpenAI / GitHub 3 min

OpenAI Codex CLI runs a full coding agent inside the terminal with no IDE required 🔗

OpenAI's Codex command-line interface (CLI), released in April 2026, is a terminal-native coding agent with 5,800 GitHub stars in its first weeks. It runs directly in a shell, reads the local codebase, writes and executes code, runs tests, and commits changes against a task description typed in plain language. The tool operates without a visual editor, making it practical for server-side or remote-development workflows where an IDE is not available. It connects to OpenAI's API using the user's existing key and respects the project's existing linting, testing, and build configuration by reading the files in the working directory. For engineering teams that have standardised on terminal-based workflows or that need a coding agent inside automated pipelines, Codex CLI slots in without requiring a change to the development environment setup.

Try this Run Codex CLI against a single isolated task in a project with good test coverage: a bug fix, a refactor of one module, or a documentation update. Measure the cycle from task description to passing tests and compare it against the same task performed manually. The evaluation that matters for production adoption is whether the agent's changes pass your existing test suite without human review on the first attempt, and how often it requests clarification versus making an incorrect assumption.

Read source →

💡 Term of the Day

What does it actually mean?

Compliance Debt 🔗

Governance · Risk

Compliance debt is the governance obligation that accumulates when an AI system is deployed into production before the documentation, audit trails, impact assessments, and control mechanisms needed to demonstrate regulatory compliance are in place. The term borrows the structure of technical debt: like code that works but cannot be maintained, a deployed AI system that lacks governance scaffolding functions until a regulator, auditor, or incident makes the debt due. Compliance debt compounds because each quarter a system runs without adequate governance, the cost of remediation rises: logs that were not captured cannot be reconstructed; decisions that were not documented cannot be explained retrospectively; and a production system woven into operational workflows cannot be cleanly shut down while controls are retrofitted. The term is particularly relevant in 2026 as EU (European Union) AI Act enforcement, the Colorado AI Act, and the revised New York Responsible AI Safety and Education (RAISE) Act all carry enforcement dates for deployed systems, not just for new ones. Organisations that moved fast on deployment and slow on governance now hold compliance debt that is measurable in months, not quarters.

Often mistaken for:

Compliance debt is often treated as a documentation backlog: a set of missing policies and records that a project team can write up in a sprint. That framing understates the structural problem. The most expensive compliance debt is not missing documents but missing data: the system never logged the right information in the first place, so the documentation cannot be written accurately even if someone sits down to write it today. A second misreading is assuming that compliance debt only matters for regulated industries such as financial services or healthcare. The Colorado AI Act applies to hiring, lending, insurance, and benefits decisions across every industry. A tech company that used AI to screen job applicants without a documented impact assessment has compliance debt regardless of its sector. The test is not "are we regulated?" but "would we be able to show a regulator, on demand, exactly how this AI system makes decisions and what controls exist to prevent discriminatory outcomes?"

⚠️ Safety & Policy

What's being governed?

Safety arXiv 2604.23775 4 min

New safety survey maps attack and defence timings for AI-controlled physical systems 🔗

As AI-controlled robots and physical systems enter enterprise environments, the attack surface expands beyond software to include actions with irreversible physical consequences. A new academic survey consolidates the safety literature on vision-language-action (VLA) models across four dimensions: attacks (data poisoning, adversarial patches, semantic instructions that bypass safety filters), defences (both training-time and runtime), evaluation benchmarks, and deployment-domain considerations across six sectors including logistics, manufacturing, and healthcare. The survey's organising contribution is a timing taxonomy: mapping each class of threat to the stage at which it can realistically be mitigated, which clarifies where runtime monitoring stops being sufficient and training-time controls become necessary. For enterprise security leads deploying AI in any environment with physical actuators, the survey provides the first structured framework for translating software security controls to the VLA context.

What it signals Chief Information Security Officers (CISOs) approving AI deployments that control physical equipment (warehouse robots, automated quality-control systems, AI-driven building management) should request a VLA-specific threat model before sign-off. The relevant question is not "does this system have a firewall?" but "what class of adversarial input could cause a physical action the organisation cannot reverse, and what control exists at the training and runtime layers to detect it?" Add that question to the architecture review checklist before the next VLA-adjacent pilot goes to production.

Read source →

Policy California Governor / Cooley 3 min

California's Newsom executive order routes AI safety requirements through state procurement 🔗

On March 30, 2026, California Governor Gavin Newsom signed Executive Order N-5-26 directing all state agencies to draft AI safety requirements including standards around illegal content, bias, civil rights, and free speech for companies doing business with the state. The order works through procurement rather than legislation: any vendor seeking a California state contract must meet the safety criteria that agencies define under the order. For enterprise AI vendors with material California government revenue, this creates a set of compliance requirements that will be operationally equivalent to regulation even without a statute. The procurement vector matters beyond California because state purchasing rules frequently propagate into vendor qualification requirements at county and municipal levels, and because other states have historically adopted California procurement standards as reference frameworks when writing their own. The order does not carry an enforcement date yet; agency draft requirements are expected by Q3 2026.

The compliance angle Legal and government affairs teams at enterprise AI vendors with California public-sector exposure should begin tracking the agency rulemaking process now, before draft requirements are published. The standard procurement play is to engage during the comment period, not after. Pull the current California state contract terms and note which AI-related clauses already exist; the EO requirements will layer on top of whatever is there. Brief the Chief Legal Officer before the Q3 draft drops.

Read source →

📄 Research Papers

What's being researched?

arXiv 2604.23781 4 min

ClawMark benchmark: frontier agents score well on partial progress but complete only 20% of multi-day tasks end to end 🔗

ClawMark is a new benchmark for evaluating AI coworker agents in persistent, multi-day professional workflows. It tests 100 tasks across 13 professional scenarios, using five stateful sandboxed services: a filesystem, email, calendar, knowledge base, and spreadsheet. Scoring is done by 1,537 deterministic Python checkers against post-execution service state, with no language model judge involved. The best-performing frontier model achieves a weighted score of 75.8 but a strict end-to-end task success rate of only 20.0%, indicating that partial progress on individual steps is common while complete workflow completion remains rare. Critically, agent performance drops sharply after the first exogenous environment update within a task (a new email arriving, a calendar entry shifting), which ClawMark identifies as the primary open challenge: adapting to a changing environment mid-task rather than completing a static checklist.

If this holds For teams currently piloting AI agents on multi-step professional workflows, ClawMark's strict task success metric (20%) is the right benchmark to use internally, not weighted partial-credit scores. Before expanding any agent pilot, measure strict end-to-end completion on a sample of real tasks from your deployment environment, including tasks where an external event changes the context mid-run. That number, not the vendor's benchmark result, is the production readiness signal.

Read source →

arXiv 2604.24198 / Zhejiang University 3 min

DataPRM catches the silent errors that standard data-analysis agents miss and improve task accuracy by up to 11% 🔗

DataPRM is an environment-aware process reward model (PRM) built specifically for data-analysis agents, addressing a failure mode that standard outcome-based evaluation misses: silent errors, logical flaws that produce incorrect results without triggering interpreter exceptions. General-domain PRMs fail on data analysis tasks because they penalise necessary exploratory steps, treating trial-and-error data exploration as a grounding failure. DataPRM instead interacts with the data environment at each intermediate step to probe whether execution state is correct before proceeding. Trained on 8,000 high-quality annotated trajectories, DataPRM improves downstream agent task accuracy by 7.21% on ScienceAgentBench and 11.28% on DABStep, with a 4B-parameter model outperforming much larger baselines. For enterprise teams deploying AI agents on internal data analysis, the silent error problem is directly relevant: an agent that returns a plausible-looking number without triggering an error is more dangerous than one that fails visibly.

If this holds Before expanding AI-driven data analysis to any decision that feeds a board report, a regulatory filing, or a financial model, audit whether the agent's intermediate steps are being validated or only the final output. The silent error class DataPRM targets, a computation that looks correct but is not, is the exact failure mode that creates restatement risk in AI-assisted financial reporting. Ask the engineering lead: what is the coverage of our intermediate-step checks relative to the checks on final output?

Read source →