Handling URLs with diacritics in XAMPP application27.06.2018
came across created myself a case for an Apache + MySQL + PHP application. A user visits path that contains “weird” (namely: Polish) characters. This path is then used to get a unique row from DB.
During finding my way from Apache 404 error page to working solution I compiled the following recipe just for you from online-researched information about each layer.
Our example URL will be
http://example.com/trudność. “Trudność” means “difficulty” in Polish, and is a perfect summary of my little development story.
I’m using a Front Controller design pattern, which means I redirect every request to index.php (using Apache mod_rewrite). The following configuration gave me 404 error page when I tried to access
1 2 3 RewriteEngine On # ...other rules RewriteRule ^.*$ index.php
In the above example I omitted some other RewriteRules for clarity. We wouldn’t want index.php to handle serving virtually all requests including styles, images or static views. So the presented RewriteRule goes last as a kind of “fallback” for anything not matched by previous rules.
Back to explanation, I expected that
. (dot) metacharacter, that “matches any single character except newline” (as per specifications) would “catch” my ogonek’d letters. But it didn’t.
What worked was to apply the
\S escape sequence. It matches all non-whitespace characters (covering all non-ASCII ones), and is effectively broader than popular dot.
1 RewriteRule ^\S*$ index.php
Note: If you want to be more specific about supported characters in the RewriteRule, it’s possible to write them all explicitly in bracketed character class. I went with more general style.
The request lands in our application. Let’s examine the value of our path.
1 2 3 4 <?php $path = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH); // '/trudno%C5%9B%C4%87' $decoded_path = urldecode(substr($path, 1)); // 'trudność' /* Now use $decoded_path in SQL query */
Seeing value of
$path was an “A-ha!” moment for me. Although the address bar was showing localized string, the browser actually sent request to percent-encoded path. This is because the standard requires URLs to use limited set of characters and escape out-of-ASCII or unsafe ones. Look how each of “ść” characters is represented by a pair of hex codes. I decoded it back using PHP urldecode() function.
Website graphemica.com provides complete list of sequences for URL escaping and different programming languages (check what it says about letter “ś”).
I ran a little manual test of checking what happens if I open
http://example.com/trudnosc. The application returned the same result from DB as if the path contained Polish letters! I’d rather want it to look up by an exact value so I can have two distinct resources under different paths.
Here is the query:
1 SELECT * from `pages` WHERE `title` = 'trudnosc';
What influenced the returned result was collation. I had it set to
utf8_unicode_ci, and in this configuration it just treated “ś” and “s” characters as equal. As in MySQL documentation:
A collation is a set of rules for comparing characters in a character set.
The simple update made a fix:
1 ALTER TABLE `pages` CHANGE `title` `title` VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_polish_ci NOT NULL;
Note: You might want to keep this behavior. Let’s say your users can’t always copy-paste the URL or enter it using localized keyboard so it’s reasonable to provide “ASCIIfied” paths. In that case you could let database engine help instead re-adding the “translation” logic to your app.
Most important findings
- Web adresses that contain out-of-ASCII-range characters are called IRIs. The browser translates them to URLs using percent-encoding when making request to server.
- Apache mod_rewrite can handle unicode characters in a few ways. I suggest using
^\S*$regular expression as the most compact one.
- MySQL uses collation to handle strings. Collation settings affect queries that contain unicode characters.
- (Slightly off topic) The domain names and TLDs themself can also contain international characters (even emojis!). The encoding method is different and is called punycode.