lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Den Berghe, Vincent" <Vincent.VanDenBer...@bvdinfo.com>
Subject Proposal: UnicodeUtils.ToCharArray implementation
Date Wed, 22 Mar 2017 14:58:48 GMT
When the dust settles, I propose to rewrite UnicodeUtils.ToCharArray as follows:

        public static char[] ToCharArray(int[] codePoints, int offset, int count)
        {
            if (count < 0)
            {
                throw new System.ArgumentException();
            }
                           // as a first approximation, assume each codepoint is 1 character
            char[] chars = new char[count];
            int w = 0;
            for (int r = offset, e = offset + count; r < e; ++r)
            {
                int cp = codePoints[r];
                if (cp < 0 || cp > 0x10ffff)
                {
                    throw new System.ArgumentException();
                }
                                  if (cp < 0x010000)
                  {
                        chars[w++] = (char)cp;
                  }
                  else
                  {
                        chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_));
                                                       // if we need more room, add enough
to hold 2-character codepoints for the rest of
                                                       // the string. Normally, this resize
operation will happen at most once.
                                                       if (w >= chars.Length)
                                                              Array.Resize(ref chars, chars.Length
+ (e - r) * 2 - 1);
                                                chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE
+ (cp & TRAIL_SURROGATE_MASK_));
                  }
            }
                           // resize to the exact length: it's slightly faster to check if
the resize is needed
                           if (w != chars.Length)
                                  Array.Resize(ref chars, w);
            return chars;
        }

This avoids exception overhead and at least one heap allocation.  If no code points generate
2 characters, the method doesn't allocate anything extra than the result array.


Vincent


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message