| Unicode::Normalize(3pm) | Perl Programmers Reference Guide | Unicode::Normalize(3pm) | 
Unicode::Normalize - Unicode Normalization Forms
(1) using function names exported by default:
use Unicode::Normalize; $NFD_string = NFD($string); # Normalization Form D $NFC_string = NFC($string); # Normalization Form C $NFKD_string = NFKD($string); # Normalization Form KD $NFKC_string = NFKC($string); # Normalization Form KC
(2) using function names exported on request:
  use Unicode::Normalize 'normalize';
  $NFD_string  = normalize('D',  $string);  # Normalization Form D
  $NFC_string  = normalize('C',  $string);  # Normalization Form C
  $NFKD_string = normalize('KD', $string);  # Normalization Form KD
  $NFKC_string = normalize('KC', $string);  # Normalization Form KC
Parameters:
$string is used as a string under character semantics (see perlunicode).
$code_point should be an unsigned integer representing a Unicode code point.
Note: Between XSUB and pure Perl, there is an incompatibility about the interpretation of $code_point as a decimal number. XSUB converts $code_point to an unsigned integer, but pure Perl does not. Do not use a floating point nor a negative sign in $code_point.
Note: FCD is not always unique, then plural forms may be equivalent each other. "FCD()" will return one of these equivalent forms.
Note: FCC is unique, as well as four normalization forms (NF*).
As $form_name, one of the following names must be given.
  'C'  or 'NFC'  for Normalization Form C  (UAX #15)
  'D'  or 'NFD'  for Normalization Form D  (UAX #15)
  'KC' or 'NFKC' for Normalization Form KC (UAX #15)
  'KD' or 'NFKD' for Normalization Form KD (UAX #15)
  'FCD'          for "Fast C or D" Form  (UTN #5)
  'FCC'          for "Fast C Contiguous" (UTN #5)
    
  If the second parameter (a boolean) is omitted or false, the decomposition is canonical decomposition; if the second parameter (a boolean) is true, the decomposition is compatibility decomposition.
The string returned is not always in NFD/NFKD. Reordering may be required.
 $NFD_string  = reorder(decompose($string));       # eq. to NFD()
 $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
    
  For example, when you have a list of NFD/NFKD strings, you can get the concatenated NFD/NFKD string from them, by saying
    $concat_NFD  = reorder(join '', @NFD_strings);
    $concat_NFKD = reorder(join '', @NFKD_strings);
    
  For example, when you have a NFD/NFKD string, you can get its NFC/NFKC string, by saying
    $NFC_string  = compose($NFD_string);
    $NFKC_string = compose($NFKD_string);
    
  Note that $processed may be empty (when $normalized contains no starter or starts with the last starter), and then $unprocessed should be equal to the entire $normalized.
When you have a $normalized string and an $unnormalized string following it, a simple concatenation is wrong:
 $concat = $normalized . normalize($form, $unnormalized); # wrong!
    
    Instead of it, do like this:
 ($processed, $unprocessed) = splitOnLastStarter($normalized);
 $concat = $processed . normalize($form,$unprocessed.$unnormalized);
    
    "splitOnLastStarter()" should be called with a pre-normalized parameter $normalized, that is in the same form as $form you want.
If you have an array of @string that should be concatenated and then normalized, you can do like this:
    my $result = "";
    my $unproc = "";
    foreach my $str (@string) {
        $unproc .= $str;
        my $n = normalize($form, $unproc);
        my($p, $u) = splitOnLastStarter($n);
        $result .= $p;
        $unproc  = $u;
    }
    $result .= $unproc;
    # instead of normalize($form, join('', @string))
    
  If you have an array of @string that should be concatenated and then normalized, you can do like this:
    my $result = "";
    my $unproc = "";
    foreach my $str (@string) {
        $unproc .= $str;
        $result .= normalize_partial($form, $unproc);
    }
    $result .= $unproc;
    # instead of normalize($form, join('', @string))
    
  (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
The following functions check whether the string is in that normalization form.
The result returned will be one of the following:
    YES     The string is in that normalization form.
    NO      The string is not in that normalization form.
    MAYBE   Dubious. Maybe yes, maybe no.
Note: If a string is not in FCD, it must not be in FCC. So "checkFCC($not_FCD_string)" should return "NO".
As $form_name, one of the following names must be given.
  'C'  or 'NFC'  for Normalization Form C  (UAX #15)
  'D'  or 'NFD'  for Normalization Form D  (UAX #15)
  'KC' or 'NFKC' for Normalization Form KC (UAX #15)
  'KD' or 'NFKD' for Normalization Form KD (UAX #15)
  'FCD'          for "Fast C or D" Form  (UTN #5)
  'FCC'          for "Fast C Contiguous" (UTN #5)
    
  Note
In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and FCC.
A "MAYBE" string should contain at least one combining character or the like. For example, "COMBINING ACUTE ACCENT" has the MAYBE_NFC/MAYBE_NFKC property.
Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and "checkNFC("B\N{COMBINING ACUTE ACCENT}")" will return "MAYBE". "A\N{COMBINING ACUTE ACCENT}" is not in NFC (its NFC is "\N{LATIN CAPITAL LETTER A WITH ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is in NFC.
If you want to check exactly, compare the string with its NFC/NFKC/FCC.
    if ($string eq NFC($string)) {
        # $string is exactly normalized in NFC;
    } else {
        # $string is not normalized in NFC;
    }
    if ($string eq NFKC($string)) {
        # $string is exactly normalized in NFKC;
    } else {
        # $string is not normalized in NFKC;
    }
These functions are interface of character data used internally. If you want only to get Unicode normalization forms, you don't need call them yourself.
Note: According to the Unicode standard, the canonical decomposition of the character that is not canonically decomposable is same as the character itself.
Note: According to the Unicode standard, the compatibility decomposition of the character that is not compatibility decomposable is same as the character itself.
If they are not composable, it returns "undef".
"NFC", "NFD", "NFKC", "NFKD": by default.
"normalize" and other some functions: on request.
    perl's version     implemented Unicode version
       5.6.1              3.0.1
       5.7.2              3.1.0
       5.7.3              3.1.1 (normalization is same as 3.1.0)
       5.8.0              3.2.0
         5.8.1-5.8.3      4.0.0
         5.8.4-5.8.6      4.0.1 (normalization is same as 4.0.0)
         5.8.7-5.8.8      4.1.0
       5.10.0             5.0.0
        5.8.9, 5.10.1     5.1.0
       5.12.x             5.2.0
       5.14.x             6.0.0
       5.16.x             6.1.0
       5.18.x             6.2.0
       5.20.x             6.3.0
       5.22.x             7.0.0
    
  SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Currently maintained by <perl5-porters@perl.org>
Copyright(C) 2001-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| 2022-02-19 | perl v5.34.1 |