Boundaries
- Breaks Demonstrates different boundaries within text.
- Enter the sample text.
- Pick the kind of boundaries, or hit Test.
- Regex Shows transformation of (Java) Regex pattern to support Unicode.
- Enter the regex pattern
- Change the sample text if desired.
- Click Show Modified Regex Pattern
You'll then see the modified pattern.
It will often be much larger, but any reasonable Regex engine will compile character classes reasonably.
Below that, you'll see a sample of how the expression works, using it to find substrings of the sample text and underline them.
|
- Unicode Property Demo window
- Enter a character code in the right side, and hit Show. You'll see the properties
for that character (where they have non-default values).
- If you click on any property (like
Age), you'll see
a list of all the properties and their values in the Unicode Property List window
- If you click on any property value in either of these two windows, like
4.0.0.0 for Age, you'll see the characters with that property in the UnicodeSets
Demo
window
- UnicodeSet Demo window
- You can put in arbitrary UnicodeSets, allowing boolean combinations
of any of the property+value combinations in the Unicode Property List window
- If you click on Compare at the top, you can compare any two UnicodeSets.
|
Transforms
- Transform Demonstrates different boundaries within text.
- Enter the Transform Rules
- Enter Sample Text
- Hit Show Transform
- Examples:
The rules can either be IDs (simple or compound)
or general rules. To see a list of all the IDs, see
ID List.
The sample can either be a piece of text or a UnicodeSet. In the latter case, only characters that are affected by the
transform are shown. They are listed alphabetically by the result of the transform, with multiple entries shown in a UnicodeSet.
|
UnicodeSets use regular-expression syntax to allow for arbitrary set operations (Union, Intersection,
Difference) on sets of Unicode characters. The base sets can be specified explicitly, such as
[a-m w-z]
, or using Unicode Properties like [[:script=arabic:]&[:decompositiontype=canonical:]]
.
The latter set gets the Arabic script characters that have a canonical decomposition. The properties
can be specified either with Perl-style notation (\p{script=arabic}
) or with POSIX-style
notation ([:script=arabic:]
). For more information, see
ICU UnicodeSet Documentation.
In the online demo, the implementation of UnicodeSet is customized in the following ways.
- Query Use. The UnicodeSet can be typed in, or used as a URL query parameter, such as
the following. Note that in that case, "&" needs to be replaced by "%26".
- Regular Expressions. For the name property, regular expressions can be used for
the value, enclosed in /.../. For example in the following expression, the first term will select
all those Unicode characters whose names contain "CJK". The rest of the expression will then subtract
the ideographic characters, showing that these can be used in arbitrary combinations.
Some particularly useful regex features are:
- \b means a word break, ^ means front of the string, and $ means end. So /^DOT\b/ means
the word DOT at the start.
- (?i) means case-insensitive matching.
Caveats:
- The regex uses the standard
Java Pattern.
In particular, it does not have the extended functions in UnicodeSet, nor is it up-to-date with
the latest Unicode. So be aware that you shouldn't depend on properties inside of the /.../
pattern.
- The Unassigned, Surrogate, and Private Use code points are skipped in the Regex comparison,
so [:Block=/Aegean_Numbers/:] returns a different number of characters than [:Block=Aegean_Numbers:],
because it skips Unassigned code points.
- None of the normal "loose matching" is enabled. So [:Block=aegeannumbers:] works, but
[:Block=/aegeannumbers/:] fails -- you have to use [:Block=/Aegean_Numbers/:] or [:Block=/(?i)aegean_numbers/:].
- Casing Properties. Unicode defines a number of string casing functions in Section
3.13 Default Case Algorithms. These string functions can also be applied to single characters.
Warning: the first three sets may be somewhat misleading: isLowercase means that
the character is the same as its lowercase version, which includes all uncased characters. To
get those characters that are cased characters and lowercase, use
[[:isLowercase:]&[:isCased:]]
- The binary testing operations take no argument:
- The string functions are also provided, and require an argument. For example:
Note: The Unassigned, Surrogate, and Private Use code points are skipped in generation of the
sets.
- Normalization Properties. Unicode defines a number of string normalization functions
UAX #15. These string functions can also be applied to single characters.
- The binary testing operations have somewhat odd constructions:
- The string functions are also provided, and require an argument. For example:
Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of
the sets.
- IDNA Properties. The status of characters with respect to IDNA (internationalized domain
names) can also be determined. The available properties are listed below.
[:idna=output:]
The set of all characters
allowed in the output of IDNA. An example is
U+00E0
( à ) LATIN SMALL LETTER A WITH GRAVE
[:idna=ignored:]
The set of all characters
ignored by IDNA on input. That is, these characters are mapped to nothing -- removed -- by NamePrep. An example is:
[:idna=remapped:]
The set of characters
remapped to other characters by IDNA (NamePrep). Examples are:
U+00C0
( À ) LATIN CAPITAL LETTER A WITH GRAVE (remapped to the lowercase version).
U+FF21
( A ) FULLWIDTH LATIN CAPITAL LETTER A
[:idna=disallowed:]
These are characters
disallowed (on the registry side) by IDNA. An example is:
Note: The client side adds characters unassigned in Unicode 3.2, for compatibility. To
see just the characters disallowed in Unicode 3.2, you can use
[[:idna=disallowed:]&[:age=3.2:]]
.
To also remove private-use, unassigned, surrogates, and controls, use
[[:idna=disallowed:]&[:age=3.2:]-[:c:]]
.
Fonts and Display. If you don't have a good set of Unicode fonts (and modern browser),
you may not be able to read some of the characters.
Some suggested fonts that you can add for coverage are:
Noto Fonts site,
Unicode Fonts for Ancient Scripts,
Large, multi-script Unicode fonts.
See also: Unicode Display Problems.
Version 3.9; ICU version: 72.0; Unicode/Emoji version: 15.0;
|