1.0.32.15: update Unicode data files to Unicode 5.2

We do still need also to update a small bit of code, but at least the explanatory comment now makes it obvious which bits.
gsorbier · Nov 11, 2009 · 9b2b4bc · 9b2b4bc
1 parent 3eae72c
commit 9b2b4bc
Show file tree

Hide file tree

Showing 6 changed files with 2,954 additions and 457 deletions.
diff --git a/NEWS b/NEWS
@@ -7,7 +7,11 @@ changes relative to sbcl-1.0.32:
  * new feature: SB-INTROSPECT:WHO-SPECIALIZES-GENERALLY to get a list of
  definitions for methods specializing on the passed class itself, or on
  subclasses of it.
- * fixes and improvements related to external formats:
+ * fixes and improvements related to Unicode and external formats:
+ ** the Unicode character database has been upgraded to the
+ Unicode 5.2 standard, giving names and properties to a number of new
+ characters, and providing a few extra characters with case
+ transformations.
  ** fix a typo preventing conversion of strings into octet vectors
  in the latin-2 encoding. (reported by Attila Lendvai; launchpad bug
  #471689)

diff --git a/src/code/target-char.lisp b/src/code/target-char.lisp
@@ -162,7 +162,7 @@
 
 ;;;; UCD accessor functions
 
-;;; The first (* 8 206) => 1648 entries in **CHARACTER-DATABASE**
+;;; The first (* 8 215) => 1720 entries in **CHARACTER-DATABASE**
 ;;; contain entries for the distinct character attributes:
 ;;; specifically, indexes into the GC kinds, Bidi kinds, CCC kinds,
 ;;; the decimal digit property, the digit property and the
@@ -189,12 +189,12 @@
 ;;;
 ;;; To look up information about a character, take the high 13 bits of
 ;;; its code point, and index the character database with that and a
-;;; base of 1648 (going past the miscellaneous information[*], so
+;;; base of 1720 (going past the miscellaneous information[*], so
 ;;; treating (a) as the start of the array). This, labelled A, gives
 ;;; us another index into the detailed pages[-], which we can use to
 ;;; look up the details for the character in question: we add the low
 ;;; 8 bits of the character, shifted twice (because we have four-byte
-;;; table entries) to 1024 times the `page' index, with a base of 6000
+;;; table entries) to 1024 times the `page' index, with a base of 6072
 ;;; to skip over everything else. This gets us to point B. If we're
 ;;; after a transformed code point (i.e. an upcase or downcase
 ;;; operation), we can simply read it off now, beginning with an
@@ -208,8 +208,8 @@
 (defun ucd-index (char)
  (let* ((cp (char-code char))
  (cp-high (ash cp -8))
- (page (aref **character-database** (+ 1648 cp-high))))
- (+ 6000 (ash page 10) (ash (ldb (byte 8 0) cp) 2))))
+ (page (aref **character-database** (+ 1720 cp-high))))
+ (+ 6072 (ash page 10) (ash (ldb (byte 8 0) cp) 2))))
 
 (declaim (ftype (sfunction (t) (unsigned-byte 8)) ucd-value-0))
 (defun ucd-value-0 (char)

diff --git a/tools-for-build/Jamo.txt b/tools-for-build/Jamo.txt
@@ -1,14 +1,14 @@
-# Jamo-5.1.0.txt
-# Date: 2008-03-20, 17:59:00 PDT [KW]
+# Jamo-5.2.0.txt
+# Date: 2009-05-22, 13:02:00 PDT [KW]
 #
 # Unicode Character Database
-# Copyright (c) 1991-2008 Unicode, Inc.
+# Copyright (c) 1991-2009 Unicode, Inc.
 # For terms of use, see https://www.unicode.org/terms_of_use.html
-# For documentation, see UCD.html
+# For documentation, see https://www.unicode.org/reports/tr44/
 #
 # This file defines the Jamo Short Name property.
 #
-# See Section 3.12 of The Unicode Standard, Version 5.0
+# See Section 3.12 of The Unicode Standard, Version 5.2
 # for more information.
 #
 # Each line contains two fields, separated by a semicolon.