Wednesday, May 29, 2013

Hunting Firefox Bugs: Hang On Focus

I am just putting this here so that this might show up a bit easier in a Google search. I recently figured out a solution (sort of) to one of these nagging issues that I have been facing for a long time. The issue is that sometimes Firefox gets in a state where it hangs for a few seconds, pegging the CPU, whenever the window comes into focus (i.e. you click on it). This means it is very difficult to switch back and forth between Firefox and another window. This was a mystery for a long while. However, I found these two bug reports today. The earlier bug was reported around two years ago, and they supposedly pushed a fix around 8 months ago, but it is still biting (me) today.

This seems to actually be a limitation of the current implementation of the global menu in Ubuntu. This works via DBUS, which apparently isn't very efficient at populating large menus. Firefox has a bookmarks menu which means that having many entries in your bookmark database will cause Firefox to hang as that menu is populated. This was actually pretty difficult to figure out as the behavior is intermittent. Often Firefox would work as expected, but sometimes this hanging would start and it would persist until reboot of the entire system (restarting Firefox is not enough). I now suspect that this state is triggered by setting a bookmark, but I am unsure.

Anyway, the fix here is to not use bookmarks too heavily. If you delete all of your bookmarks this problem will go away, and this is just what I did (after I saved them to disk). That said, this really puts me, and people who browse the Web like I do, in a predicament.

When I find an article or web page that I am interested in, I cannot usually read it when I find it. So, I like to store it for future consumption. For this purpose, I used to use tabs, but as everybody knows, this will bog your computer down as most browsers attempt to keep them in RAM. Firefox, fairly recently, started releasing them from memory after they haven't been used, which helps matters (unfortunately they made the, IMHO, stupid decision to completely reload on restore rather than cache to disk). But, even with this memory friendly behavior from Firefox, it will still routinely bloat to greater than 1GB of resident RAM and needs to be restarted (presumably so that the tabs will be flushed out of RAM). And before the "Firefox is just making efficient use of memory" people show up, Firefox is not making good use of memory if it causes my computer to hang for 30+ seconds when switching windows that have nothing to do with Firefox (which it does in certain extreme cases), or if it causes Compiz, Xorg, or my entire computer to crash (which happens every once in a while and correlates so strongly with Firefox memory consumption that I have to suspect it as a culprit).

To combat this, a while ago I started categorizing pages I find as I want to read this within a week or I want to read sometime later in the future. The ones I want to read within a week I keep as tabs and the ones that I want to read sometime in the future I set a bookmark to find it easily later. I figured that bookmarks should take practically no resources to hold other than a bit of disk space. With this bug, however, setting bookmarks on interesting pages isn't really a good option either. This means that I now have my bookmarks in a separate HTML and JSON file while I try to come up with a better solution.

I suppose that this isn't an issue with Chromium or Chrome as those don't use the global menu for bookmarks (they have a separate menu right of the address bar).

Thursday, May 23, 2013

Quickly Searching For a String in a Buffer

So this is a bit embarrassing that I have never seen this solution before, but I recently saw this problem in a Coursera lecture.

Given a buffer sequence of length \(N\) and a query sequence of length \(m\), find all occurrences of the query in the buffer.

The most transparent way to solve this problem is to search each starting position in the buffer for the query string. This is clearly \(O(N*m)\) (for small \(m\), that is. It is really around \(O((N-m)*m)\). However, in big \(O\) fashion, we normally say \(O(N)\) and leave it at that).

(defun get-string (i length buffer)
  "Extract the string without any copying or memory use."
  (make-array length
              :element-type 'character
              :displaced-to buffer
              :displaced-index-offset i))

(defun naive-search (query buffer)
  "Find all occurrences of query in buffer."
  (iter (for i :to (- (length buffer) (length query)))
    (when (equal query (get-string i (length query) buffer))
      (collect i))))

Although this all pretty silly, as Common Lisp has a sequence search built in called search. In addition, it also has third party libraries that can do string searches and much more:

;; Can't get much simpler than...
(defun search-search (query buffer)
  "Find all occurrences of query in buffer."
  (iter
    (for i :initially (search query buffer :start2 0)
           :then (search query buffer :start2 (1+ i)))
    (while i)
    (collect i)))

;; If you are using strings, regular expressions are pretty versatile.
(ql:quickload :cl-ppcre)
(defun regex-search (query buffer)
  "Find all occurrences of query in buffer."
  (let (acc)
    (ppcre:do-matches (start end
                       query buffer
                       (reverse acc))
      (push start acc))))

The interesting part of these is that they are probably much smarter than my naive implementation. Besides being highly tuned, they also most likely use parts of previous failed matches in subsequent matches, which means that they might run in \(O(N)\) time (i.e. no \(m\) factor).

The point is that either way I do this it would take \(O(N)\) time to simply search the text for the value in the most naive way possible. In the lecture, however, they stated that you could sort the data and then perform a bisection search. I was taken aback as it seemed to me that you cannot sort a body of text without screwing up the order, which is needed because we are actually interested in matching a sequence of characters in a larger sequence.

The way we can do this, "sort the sequence" and keep the sequence in order, is to sort a different buffer that holds the indexes into the sequence. Sorting takes \(O(N\log N)\), so a single search should actually take longer to compute, but once you have the indexing you can perform subsequent searches (with the same length or shorter queries) in \(O(\log N)\). In order to accommodate this usage pattern, I have included an ad-hoc, manual, partial memoization mechanism. I return the computed sorted indexes (along with the maximum length of a valid query) as a second return value and take them as an optional argument (I use a class here so that the printed value doesn't flood your terminal). If you pass invalid cache parameters, the function detects this and recomputes.

;; Define a container for our cache
(defclass cache ()
  ((max-length :initarg :max-length :accessor max-length)
   (buffer :initarg :buffer :accessor buffer)
   (indexes :initarg :indexes :accessor indexes)))

(defun indexed-search (query buffer &optional cache)
  "Find all occurrences of query in buffer."
  ;; Extract (or compute) the cache
  (destructuring-bind (max-query-length sorted-indexes)
      (if (and cache
               (eq buffer (buffer cache))
               (<= (length query) (max-length cache)))
          (list (max-length cache) (indexes cache))
          (list (length query)
                (stable-sort (iter (for i :to (- (length buffer) (length query)))
                               (collect i :result-type vector))
                             #'string<
                             :key (lambda (i) (get-string i (length query) buffer)))))
    (labels ((bisection (lo hi)
               ;; A simple bisection search
               (let ((mid (floor (+ lo hi) 2)))
                 (cond ((= lo hi) lo) ; we have either found the first match or
                                      ; there are no matches
                       ((string<= query (get-string (aref sorted-indexes mid)
                                                    (length query) buffer))
                        (bisection lo (min mid (- hi 1))))
                       (t (bisection (max mid (+ lo 1)) hi))))))
      ;; Perform the search and return the index and cache
      (values (iter (for i :from (bisection 0 (length sorted-indexes)))
                (while (string= query (get-string (aref sorted-indexes i)
                                                  (length query) buffer)))
                (collect (aref sorted-indexes i)))
              (make-instance 'cache :max-length max-query-length
                                    :buffer buffer
                                    :indexes sorted-indexes)))))

And indeed, after the \(O(N\log N)\) preprocessing, this runs in \(O(\log N)\) time. The timings work out that the initial sort takes a very long time (presumably due to poor performance in string<= for lexical order), but the subsequent queries are basically instantaneous (too fast to measure even for large buffer sequences). So, while you probably don't want to use this for normal searches, you might want to do this if you need to make many queries (of a limited length) on a immutable sequence, and it could result in a huge performance increase.

Well, I guess this just serves as a reminder that I still have a lot of basic stuff to learn. However, after writing this up, I doubt that I will forget it…

Thursday, May 16, 2013

ICFP Contest 2013 Anyone?

As you may have seen on Reddit this morning, Microsoft Research has put up an ICFP placeholder page.  This might seem a bit premature, but I would like to have some Common Lisp friends (new and old) to partake in ICFP this year, so I am getting started early.

The contest seems to be starting August 8th sometime and should run for the next 72 hours.  If you are at all interested in programming Common Lisp on a team ICFP attempt, please comment below, or email, or whatever your favorite method of communication.

This year I have a few new tools that should work better than last year, including a way that spectators can follow along without video streams that are susceptible to crappy resolutions (anyway, I now know someone that works on the YouTube team at Google, so he could probably get the old setup to work better).  I plan to post on those tools later to try and drum up interest as the contest approaches.

Wednesday, May 15, 2013

What is wrong with Python strings?

This is a micro-rant about Python.  If this isn't your thing, please excuse this post.  I have been trying to use Python in order to become more employable (well more employable for the sort of places I would like to work).  Here are a few grievances I have found.  These complaints are, I suspect, actually about the default Python behavior.  That doesn't actually diminish the criticisms much, however, as they are definitely barriers to entry.  For some of these things I have still not been able to find a way around them, which makes them pretty serious.
  1. Python actually only allows ASCII characters?  Really?!?  If you include some odd (or not so odd) character like a lambda, it will choke.  If this doesn't seem like a big deal to you, you have clearly never dealt with any language other than English, nor have you dealt with the various fancy forms of punctuation.  You can instruct Python to accept a different encoding by putting a magic comment at the beginning of your source file that looks like "# coding: <encoding>".  As a reference, the last time this was an issue in any Lisp I tried, was around 5 years ago with CLISP (a fairly out of date Lisp at the time) and it was remedied shortly after.
  2. What is going on with your string type?  If I index into the string type, I can land in the middle of a multi-byte character.  The very fact that this is possible is evidence that this string type is fundamentally screwed up.
  3. Why is the printed representation of a Unicode string unreadable?  Seriously, we have terminals that can handle pretty much anything that you can throw at them, why print a bunch of line noise instead of the actual character?
  4. What is wrong with the way you output in Python?  Why is it that I get an encoding error when I try to pipe or redirect output from my program to another program and a non-ASCII character is encountered?  I have no idea what is going on here, or how to get around it.  This is, quite possibly the biggest hiccup in terms of productivity.  I cannot even send my output to a file or to less in order to carefully review the output.  How hard is it to dump the output of a program into a file?  In the case of pipes, you might think that this is a shortcoming of the pipe itself, or of the program on the other side of the pipe, but this is not the case.  I routinely pipe all sorts of crazy text through pipes and to shell commands without any problems; whatever the issue is, it is rooted in Python.
In Lisp (my usual language of choice), a string is a vector of characters, not ASCII bytes or a vector of data that might be interpreted as characters, which I imagine is the primary shortcoming in Python here.  Including encodings other than ASCII seems to have been an afterthought, an ugly thing that was bolted on to make the language work with text that wasn't ASCII.  Why Python didn't decide at conception to do away with the idea of that a string is a sequence of bytes, I have no idea.  It seems like Python wanted to be a high level language, but stopped short of actually making it there, at least in the way strings are handled.  They seem to have the basic building blocks of a proper string type, but stupidly set the default behavior to only deal with ASCII strings, which is basically backwards, IMO.

People say (and I believe) that Common Lisp is simultaneously the highest level and lowest level programming language you will ever use.  In Lisp you can write code that is so abstracted you have have no idea of what code is actually being passed to the CPU, much less the data-structures or  memory usage.  But, in the same program, you can write code that translates directly to machine instructions in a predictable way (or even put inline C or assembly).  Because of its high level nature, I have never had to worry about strings with odd characters in them; they just work.  And, because of its low level nature, it is simple to convert strings down to ASCII when you have to, which is a very rare need in my experience.  You would think that Python, whose earliest versions where a full 5 years after the most recent standardization of Lisp (coming up on 20 years since then), would have done this better.

I could actually go on and on about things that I perceive Python sucking at, for instance the fact that there is a REPL but no iterative development, or the fact that the error messages I get are basically garbage (however, Lisp is not that much better in this regard), or the fact that you must explicitly have a "return" statement on your functions, and of course, of course, the accursed whitespace dependence (which isn't actually annoying because of the whitespace, but because I actually have to tab around to get the correct indentation level in Emacs).  But those things are probably just matters of taste, something that usually works out with time.  The string handling in Python, however, is basically inexcusable.