I got a pretty nasty error while trying to parse table rows(all contents were set to UTF-8) from the database for a dictionary project. The idea was to get all the rows from the first table (that is a table with bulgarian phrase in the first field, and its translation in english, french and german in the next fields). I needed to index all the bulgarian words that are found in the table to make an intelligent search. And that is where my headache started.
First of all, even with mb_strtolower() a lot of cyrillic characters went corrupted (ex: 'т,ъ,у,ф,б,г,з,ж,' etc...). After an hour of different attempts I got such a solution:
<?php
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
$rows = $db->getRows();
$contents = array();
foreach ($rows as $eachRow)
{
$cleared = str_replace($commonWords, ' ', mb_strtolower(stripslashes($eachRow['bulgarian']), 'UTF-8' ));
if (trim($cleared) != '') $contents[] = trim($cleared);
}
$list = array();
foreach ($contents as $eachRow)
{
$exploded = explode(' ', $eachRow);
foreach ($exploded as $eachExpl)
{
$eachExpl = mb_ereg_replace('[^а-я ]',' ', $eachExpl);
if (trim($eachExpl) != '')
if (!in_array($eachExpl, $list, true)) $list[] = trim($eachExpl);
}
}
?>
To work properly I got to set all the internal encoding settings to UTF-8. Else the default Latin-1 got half my database with missing characters.
I am posting this solution just in case someone has encountered a similar problem. Hope it helps you in case you need something like that.
mb_ereg_replace
(PHP 4 >= 4.2.0, PHP 5)
mb_ereg_replace — Replace regular expression with multibyte support
설명
string mb_ereg_replace
( string $pattern
, string $replacement
, string $string
[, string $option= "msr"
] )
Scans string for matches to pattern , then replaces the matched text with replacement
인수
- pattern
-
The regular expression pattern.
Multibyte characters may be used in pattern .
- replacement
-
The replacement text.
- string
-
The string being checked.
- option
- Matching condition can be set by option parameter. If i is specified for this parameter, the case will be ignored. If x is specified, white space will be ignored. If m is specified, match will be executed in multiline mode and line break will be included in '.'. If p is specified, match will be executed in POSIX mode, line break will be considered as normal character. If e is specified, replacement string will be evaluated as PHP expression.
반환값
The resultant string on success, or FALSE on error.
주의
Note: 내부 인코딩이나 mb_regex_encoding()으로 정의한 문자 인코딩을 이 함수의 문자 인코딩으로 사용할 수 있습니다.
Warning
신뢰할 수 없는 입력에 대해서 e 변경자를 사용하지 마십시오. 자동 회피를 수행하지 않습니다. (preg_replace()와 마찬가지) 주의하지 않으면 원격 코드 실행 취약점을 가지게 됩니다.
참고
- mb_regex_encoding() - Returns current encoding for multibyte regex as string
- mb_eregi_replace() - Replace regular expression with multibyte support ignoring case
mb_ereg_replace
daemoneye at gmail dot com
03-Feb-2009 07:53
03-Feb-2009 07:53
keizo at gomo dot jp
24-Jul-2008 01:32
24-Jul-2008 01:32
<?php
$pattern = "([あ-ん]+)[0-9]+";
$string = mb_ereg_replace($pattern, '「\\1」:\\0', $string);
?>
you can use \\n for capture group in replacement
gmx dot net at ulrich dot mierendorff
01-Jul-2008 11:39
01-Jul-2008 11:39
If you want to replace characters like "ä" or "ø" you can use mb_ereg_replace, but it is very slow. str_replace is much faster and also works with characters like "ä" or "ø"!
I think this has something to with the fact that str_replace works on byte level and does not care about characters.
I hope that can help.
05-Dec-2006 01:36
'i' option does not work correctly with multibyte characters. The function does not locate/replace the multibyte string if it's different case then specified on multibyte needle which is in different case.
squeegee
02-Nov-2006 12:41
02-Nov-2006 12:41
well, if you just calculated the length of the find and replace strings once instead of on every loop, it would likely speed it up a lot.
mpnicholas [@t] gmail (dot) com
10-Jul-2006 07:09
10-Jul-2006 07:09
Regarding the mb_str_ireplace() function: I benchmarked it against mb_eregi_replace() for single-character substitution, and it was significantly slower. Despite avoiding the ereg call, I think the while loop ends slowing you down too much for this to be practical.
vondrej(at)gmail(dot)com
27-Feb-2006 08:47
27-Feb-2006 08:47
Are you looking for htmlentities() for multibyte strings? This might help you - it just replace <, >, ", '
<?php
/**
* Multibyte equivalent for htmlentities() [lite version :)]
*
* @param string $str
* @param string $encoding
* @return string
**/
function mb_htmlentities($str, $encoding = 'utf-8') {
mb_regex_encoding($encoding);
$pattern = array('<', '>', '"', '\'');
$replacement = array('<', '>', '"', ''');
for ($i=0; $i<sizeof($pattern); $i++) {
$str = mb_ereg_replace($pattern[$i], $replacement[$i], $str);
}
return $str;
}
?>
faxe at neostrada dot pl
10-Aug-2005 07:52
10-Aug-2005 07:52
A simple mb_str_ireplace() implementation - a faster (?) replacement for non-regexp multi-byte string replacement:
<?php
function mb_str_ireplace($co, $naCo, $wCzym)
{
$wCzymM = mb_strtolower($wCzym);
$coM = mb_strtolower($co);
$offset = 0;
while(!is_bool($poz = mb_strpos($wCzymM, $coM, $offset)))
{
$offset = $poz + mb_strlen($naCo);
$wCzym = mb_substr($wCzym, 0, $poz). $naCo .mb_substr($wCzym, $poz+mb_strlen($co));
$wCzymM = mb_strtolower($wCzym);
}
return $wCzym;
}
?>
[thiago - EDITOR NOTE: This function has improvements from d-okumura [aat] fi{dot}kyd[dot]co.jp]
