Structure of effective regex patterns
An effective anti-SQLi regex balances sensitivity and specificity. The (?i) flag (case-insensitive) is critical because attackers exploit variations like UnIoN or SeLeCt to evade basic filters. The quantifier .{1,100} captures obfuscation with spaces, tabs and comments: UNION/**/SELECT vs UNION SELECT.
Common mistake: overly specific patterns like union\s+select. This fails against union/**/select or union+select (URL-encoded). Better: (union\s+(all\s+)?select) which also covers UNION ALL SELECT. For blind attacks, the pattern and\s+1\s*=\s*1 must consider optional spaces around the operator.
Word boundaries \b prevent false positives: without them, union matches inside 'communion'. But beware: \b fails with non-ASCII characters. In email fields, the pattern should be less strict because legitimate addresses can contain + or . that look like SQL operators.
Implementing validation in multiple layers
Backend-only regex is insufficient. Sophisticated attacks use double encoding (%2527 = %27 = '), unicode normalization (ᴜɴɪᴏɴ) or HPP (HTTP Parameter Pollution). Your stack should include: 1) perimeter WAF with OWASP ModSecurity rules, 2) regex validation in API layer, 3) prepared statements in DAL.
For forms, combine regex with whitelisting. If the 'username' field only accepts [a-zA-Z0-9_-], apply that first. SQLi regex acts as second line against payloads that pass the whitelist. In free-text fields (bio, comments), regex should be more aggressive: block select, union, drop even in legitimate context, better than allowing a 0-day.
Don't trust client-side JavaScript for security. Frontend regex improves UX (immediate feedback) but an attacker simply bypasses it with curl. Every request must be validated server-side. Log the matches: if someone triggers the SQLi regex 10 times in 1 minute, it's automated scanning, not a typo.
Detecting advanced evasion techniques
Modern attacks exploit inline comments to fragment keywords: SEL/**/ECT, UN/*comment*/ION. Your regex must handle /\*.*?\*/ between characters. In MySQL, /*!50000 SELECT */ executes only in specific versions; the pattern needs /\*!\d+.*?\*/.
Alternate encoding is a common vector: CHAR(117,110,105,111,110) constructs 'union' dynamically. Patterns like char\s*\(\d+ detect this. In PostgreSQL, CHR() and CONCAT() serve the same purpose. Hex encoding (0x756e696f6e) requires 0x[0-9a-f]+.
Modern time-based blind injection uses more subtle functions than SLEEP(): BENCHMARK(10000000, MD5('a')) in MySQL, WAITFOR DELAY '00:00:05' in MSSQL. Your regex must cover entire families of delay functions, not just the obvious ones. And watch for RLIKE or REGEXP: they enable ReDoS (Regex DoS) that locks the server with catastrophic patterns.
Testing and maintaining rules
Build a test suite with 100+ real payloads from sqlmap, PayloadsAllTheThings and recent CVEs. Each regex should have >95% detection rate without false positives on legitimate data. Use datasets like SecLists for validation.
False positives are critical: if your regex blocks searches for "union membership" or names like "Shelby Andersen" (and + select), users report bugs instead of attacks. Tune with negative lookaheads: (?!.*membership)union excludes safe contexts.
Update patterns quarterly. Follow OWASP SQLi and PortSwigger Research for new techniques. When SQLite releases a function like JSONB_EXTRACT(), evaluate if it needs coverage. In production, log all blocks: analyze monthly whether there are patterns that no longer match (the attack evolved) or match too much (false positives). Security regex is not set-and-forget.