[owncloud-devel] Hi! (and Improving Cross-Platform International/Unicode Support)

Lee Thompson thompsonl at logh.net
Fri Nov 14 00:56:50 GMT 2014


Hi Everyone,

Like I'm sure many of you, I've been programming for a long time on 
various projects.

My typical platform is Windows although I do work on Linux machines as 
well (usually Debian flavor).

At my work, we've recently deployed ownCloud (7.0.2, community edition) 
and it's working well although we've run into some issues with unicode 
characters in filenames causing some issues with inserting the rows into 
the database.

The root issue appears to be the way PHP communicates with Windows, I 
don't know if this issue is affecting other operating systems as well.

(I have reported this as a bug at 
https://github.com/owncloud/core/issues/12112 but I'm perfectly happy to 
help with contributing to this project.   Some of that information will 
be repeated here.)


Basically what happens is when PHP's functions look at the Windows file 
system and it contains a unicode character it will appear to PHP's 
mbstring to be encoded as UTF-8 but it is actually encoded (on US-EN 
anyway) as Windows-1252.

Now with this, we can get the correct codepage...

$target_encoding = "UTF-8";
$default_codepage = "UTF-8";

if ( 'WIN' === substr( PHP_OS, 0, 3 ) ) {
	$codepage = 'Windows-' . trim( strstr( setlocale( LC_CTYPE, "" ), '.' 
), '.' );
} else {
	$codepage = $default_codepage;
}

... and then convert it

$encoded_filename = mb_convert_encoding( $filename, $target_encoding, 
$codepage );



So my thought is to add to config.php a default codepage to use, 
initially filled in by the installer as UTF-8 or if on Windows, from the 
routine above.

There should also be codepage settings for each of the external storages 
(defaulting to the 'system one' for local/smb) just in case other file 
systems are in play, this would allow the admin to account for special 
or mixed environments.

The sync clients should probably communicate their local codepage as 
well just to ensure that it all translates properly (if needed).  (I'll 
confess I haven't done any programming of WebDAV and don't know if any 
codepage translation occurs.)


Other notes and potential gotchas:

1. Folders should probably be codepage encoded too.

2. Unfortunately we will likely need to decode back to the codepage to 
open the file within PHP.  (e.g. $decoded_filename = 
mb_convert_encoding( $encoded_filename, $codepage, $target_encoding); )

3. It's also worth noting that for MySQL 5.5.3+, utf8/utf8_bin is not 
sufficient for true UTF-8 support.   It needs to be utf8mb4 with 
utf8mb4_bin_ci or utf8mb4_unicode_ci collation.   (ref: 
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html)


-- 
Lee Thompson
thompsonl at logh.net



More information about the Devel mailing list