<?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:admin="http://webns.net/mvcb/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
   xmlns:content="http://purl.org/rss/1.0/modules/content/"
   >
<channel>
    <title>Lemma's blog - Other</title>
    <link>http://www.confuego.org/</link>
    <description></description>
    <dc:language>en</dc:language>
    <generator>Serendipity 1.4.1 - http://www.s9y.org/</generator>
    <pubDate>Fri, 16 Jan 2009 18:05:22 GMT</pubDate>

    <image>
        <url>http://www.confuego.org/templates/default/img/s9y_banner_small.png</url>
        <title>RSS: Lemma's blog - Other - </title>
        <link>http://www.confuego.org/</link>
        <width>100</width>
        <height>21</height>
    </image>

<item>
    <title>Improving Bugzilla search speeds</title>
    <link>http://www.confuego.org/archives/13-Improving-Bugzilla-search-speeds.html</link>
            <category>Other</category>
    
    <comments>http://www.confuego.org/archives/13-Improving-Bugzilla-search-speeds.html#comments</comments>
    <wfw:comment>http://www.confuego.org/wfwcomment.php?cid=13</wfw:comment>

    <slash:comments>9</slash:comments>
    <wfw:commentRss>http://www.confuego.org/rss.php?version=2.0&amp;type=comments&amp;cid=13</wfw:commentRss>
    

    <author>nospam@example.com (Michael Leupold)</author>
    <content:encoded>
    &lt;p&gt;The major tool we use in Bugsquad is our beloved &lt;a href=&quot;http://bugs.kde.org&quot;&gt;bugtracker&lt;/a&gt;, currently running a tailored version of &lt;a href=&quot;http://www.bugzilla.org&quot;&gt;Bugzilla&lt;/a&gt; 3.0. It usually works pretty well but has had some problems lately. One of them is that Bugzilla&#039;s search performance when searching through the bugs&#039; comments is pretty low, degrading with the total number of comments stored. This total doesn&#039;t only encompass open bugs but closed ones as well, so performance is steadily declining (we currently have &amp;gt;180000 bugs).&lt;/p&gt;&lt;p&gt;My journey started when I heard someone asking if switching from MySQL to PostgreSQL as the unerlying storage layer would help the speed. I decided do dive into it and started doing some simple benchmarks. Unfortunately results indicated that PostgreSQL would only marginally increase performance. As there&#039;s other sites (eg. &lt;a href=&quot;http://techbase.kde.org&quot;&gt;techbase&lt;/a&gt;) using the same database we&#039;d either have to migrate all of them (if that&#039;s even possible) or live with two database servers running at once. While I don&#039;t have a problem to have different sites handled by different database systems, memory consumption certainly would have been an issue, so I scrapped the idea.&lt;/p&gt;
&lt;h3&gt;The Sphinx full-text search engine&lt;/h3&gt;
&lt;p&gt;On searching for ways to speed things up I discovered a small full-text search engine named &lt;a href=&quot;http://www.sphinxsearch.com&quot;&gt;Sphinx&lt;/a&gt; and decided to give it a try. Sphinx has several features which are pretty useful for our purpose:&lt;br /&gt;
&lt;em&gt;(DISCLAIMER: I&#039;m not an expert on full-text search engines. Some of the features described might be pretty common, some might not be. I don&#039;t know how Sphinx compares to other fts systems.)&lt;/em&gt;
&lt;ul&gt;&lt;li&gt;It&#039;s a complete daemon-based search engine with various language bindings ready to use.&lt;/li&gt;&lt;li&gt;It comes with an indexer that can be used to pull data from MySQL tables into its index.&lt;/li&gt;&lt;li&gt;It allows substring searches with arbitrary infix length.&lt;/li&gt;&lt;li&gt;It has builtin word stemming and also provides other text preprocessors like phoneme conversion.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The first results after constructing my first full-text index were pretty promising, so I started with a proof-of-concept implementation of a Bugzilla extension that uses Sphinx.&lt;/p&gt;
&lt;h3&gt;Extending Bugzilla&lt;/h3&gt;
&lt;p&gt;Unfortunately developing for Bugzilla means that you have to code Perl - a language I so far always avoided to learn because it&#039;s said to be hard to read. While this generally may or may not be true, Bugzilla&#039;s code turned out to be pretty understandable and not too hard to work with. It even provides nice hooks to extend its functionality with custom extensions without even having to touch the core code - not for extending or refining the search mechanism though, so I had to devise my own little hook which I hope will be reviewed and make it upstream one day.&lt;/p&gt;&lt;p&gt;Searching using my extension works like this:
&lt;ul&gt;&lt;li&gt;Bugzilla asks the extension to handle a tuple ($field,$operator,$value) where $field is basically the database field to comb through, $operator is the match operator to apply (eg. &lt;i&gt;equals&lt;/i&gt; or &lt;i&gt;constains all the words&lt;/i&gt;) and $value is the value the user is looking for.&lt;/li&gt;&lt;li&gt;Additionally the extension is told what Bugzilla itself would do to perform the search (I&#039;ll call this the &lt;i&gt;original query&lt;/i&gt;).&lt;/li&gt;&lt;li&gt;The extension then decides if it can handle the search and - if so - queries the Sphinx daemon for bugs matching what the user searches for (else it would just fallback to the original query).&lt;/li&gt;&lt;li&gt;The original query is then either replaced or enriched with Sphinx&#039; results and sent back to Bugzilla.&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;
&lt;h3&gt;Benchmarking&lt;/h3&gt;
&lt;p&gt;With this basic implementation in place it was time to do some benchmarking. While reading this, please keep the following in mind:
&lt;ul&gt;&lt;li&gt;&lt;em&gt;I&#039;m not an expert on tuning MySQL. I used &lt;a href=&quot;http://mysqltuner.com&quot;&gt;mysqltuner&lt;/a&gt; to get as good and fair results as possible. I ended up giving around 5G of memory to MySQL, distributed betwen innodb_buffer_pool_size, the query cache and the key_buffer_size.&lt;/em&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;I know MySQL has its own FULLTEXT index - I didn&#039;t use it though.&lt;/em&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;I certainly don&#039;t want to claim MySQL is slow.&lt;/em&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;Tests were performed on my Q6600 desktop machine WHILE I was running the desktop. While I tried to keep my resource usage down, the benchmark&#039;s results might not be representative.&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;
&lt;h4&gt;The sample set&lt;/h4&gt;
&lt;p&gt;As I didn&#039;t have exact numbers about the amount of data I constructed a sample set consisting of:
&lt;ul&gt;&lt;li&gt;200 products, 5 components and 5 versions each&lt;/li&gt;&lt;li&gt;200.000 bugs&lt;/li&gt;&lt;li&gt;2.200.189 comments uniformly distributed among them&lt;/li&gt;&lt;li&gt;Short descriptions randomly generated from /usr/share/dict/british-english-huge with a uniformly distributed length of 5-20 words&lt;/li&gt;&lt;li&gt;Comments also randomly generated with a uniformly distributed length of 20-200 words&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;
&lt;h4&gt;The benchmark method&lt;/h4&gt;
&lt;p&gt;I performed tests on the following search types: &lt;ul&gt;&lt;li&gt;&lt;i&gt;substring&lt;/i&gt; - Simple substring search&lt;/li&gt;&lt;li&gt;&lt;i&gt;anywordsubstr&lt;/i&gt; - Substring search trying to match any of several search strings&lt;/li&gt;&lt;li&gt;&lt;i&gt;allwordssubstr&lt;/i&gt; - Substring search trying to match all of several search strings&lt;/li&gt;&lt;/ul&gt;I performed each test 20 times in a row, using 1-3 random words as search terms (using the same random seed for bare Bugzilla and Bugzilla/Sphinx searches). Searches were performed by sending the queries using HTTP. To strip template processing times I introduced a Bugzilla template that returns nothing but the bare number of hits.&lt;/p&gt;&lt;p&gt;&lt;em&gt;Please keep in mind that the results are only meant to indicate the difference in the overall level of performance. As the search terms were completely random (and usually had a lot of search results) they won&#039;t tell you how searches you usually do on Bugzilla will perform. Furthermore even though no list of bugs is generated, search speeds are related to the number of rows found and returned by the database. The benchmark doesn&#039;t take this into account.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;Results: Comparing plain Bugzilla and Bugzilla/Sphinx&lt;/h4&gt;
&lt;p&gt;The comparison was done using one single index per table.&lt;/p&gt;
&lt;!-- s9ymdb:4 --&gt;&lt;img class=&quot;serendipity_image_center&quot; width=&quot;1039&quot; height=&quot;621&quot; style=&quot;border: 0px; padding-left: 5px; padding-right: 5px;&quot; src=&quot;http://www.confuego.org/uploads/charts/comparision-summary.png&quot; alt=&quot;&quot; /&gt;
&lt;p&gt;This chart shows the times a single search on the bugs&#039; summary took (in ms). &lt;i&gt;med_std&lt;/i&gt; shows the average time a regular Bugzilla search took, &lt;i&gt;med_sphinx&lt;/i&gt; shows the average time a search using the Sphinx extension took. Overall searching summaries using Sphinx took around half the time of a regular Bugzilla search (47%) with 1 word substring searches showing the smallest drop (71%) and 3 word anywordsubstr searches showing the biggest (35%).&lt;/p&gt;
&lt;!-- s9ymdb:5 --&gt;&lt;img class=&quot;serendipity_image_center&quot; width=&quot;1033&quot; height=&quot;623&quot; style=&quot;border: 0px; padding-left: 5px; padding-right: 5px;&quot; src=&quot;http://www.confuego.org/uploads/charts/comparison-comment.png&quot; alt=&quot;&quot; /&gt;
&lt;p&gt;This chart shows the times searching the bugs&#039; comments took (in ms). Not surprisingly the drop is quite big here with Sphinx only taking around 3% of the time regular searches do. Again the smallest drop can be witnessed on 1 word substring searches (7%), the other searches rank between 2% and 4%.&lt;/p&gt;
&lt;p&gt;To sum up, Sphinx combined with Bugzilla seems to be a good way to increase performance. Please note that I didn&#039;t measure the resources taken. That&#039;s probably hard to achieve on a regular desktop PC as Sphinx seems to rely on disk/system caches instead of caching data itself. A quick look suggests that runtime resource consumption is a fair bit lower with Sphinx though.&lt;/p&gt;
&lt;h4&gt;Results: Scalability&lt;/h4&gt;
&lt;p&gt;To see how well my extension scales with the number of concurrent searches performed I did another benchmark. I ran the same queries as before but started 1-10 of them at the same time doing 20 testruns per setup. The numbers represent the average time taken per query (please keep in mind that due to the parallelity of the test the time taken to perform all &lt;i&gt;n&lt;/i&gt; searches of a run doesn&#039;t equal the &lt;i&gt;n &amp;lowast; average&lt;/i&gt;).&lt;/p&gt;
&lt;!-- s9ymdb:6 --&gt;&lt;img class=&quot;serendipity_image_center&quot; width=&quot;1029&quot; height=&quot;713&quot; style=&quot;border: 0px; padding-left: 5px; padding-right: 5px;&quot; src=&quot;http://www.confuego.org/uploads/charts/scalability.png&quot; alt=&quot;&quot; /&gt;
&lt;p&gt;As you can see the search extension seems to scale well. The time increment seems to be linear but I didn&#039;t do any tests to confirm this. The small spikes you see now and then are mostly due to one single search taking a little longer - as I said I didn&#039;t stay from my machine all of the time.&lt;/p&gt;&lt;p&gt;If running a lot of parallel searches at the same time, the Sphinx search daemon performs best if you partition your index into several parts. It will then run one search thread per index (indexes can also be distributed to several servers to improve performance). That&#039;s exactly what I did for these scalability tests, using 4 indexes on 4 CPU cores.&lt;/p&gt;
&lt;h3&gt;Peculiarities, additional benefits and drawbacks&lt;/h3&gt;
&lt;p&gt;In addition to being faster, Sphinx has some peculiarities and extra advantages:&lt;/p&gt;&lt;p&gt;It&#039;s important to note that the Sphinx extension doesn&#039;t always return the same amount of search results a vanilla Bugzilla would. This is due to stripping of certain characters and treating accented characters the same way. In most situation this is beneficial because it ignores common spelling mistakes like inserted apostrophes (&amp;acute;) for plural forms.&lt;/p&gt;&lt;p&gt;Unlike in MySQL there&#039;s currently no way to push data into the fulltext index when it changes. Reindexing has to be scheduled. Fortunately the reindexing time can be cut by introducing multi-level delta indexes. However there will always be a lag between the time a summary or comment is entered and its availability for search.&lt;/p&gt;&lt;p&gt;Sphinx indexes (especially if indexed for substring search) take up a considerable amount of disk space. The sample implementation uses one to three indexes (depending on which searches you want to enable):&lt;ul&gt;&lt;li&gt;bugs.short_desc - 521MB&lt;/li&gt;&lt;li&gt;longdescs.thetext - 74297MB&lt;/li&gt;&lt;li&gt;bugs_fulltext.&amp;lowast; - 6095MB&lt;/li&gt;&lt;/ul&gt;However trading performance for diskspace might be fair under many circumstances.&lt;/p&gt;
&lt;p&gt;To sum up, implementing an external full-text search engine into Bugzilla seems to be quite worthwhile. I hope I&#039;ll get this fully working and that our Bugzilla&#039;s admins still have a little diskspace left &lt;img src=&quot;http://www.confuego.org/templates/default/img/emoticons/smile.png&quot; alt=&quot;:-)&quot; style=&quot;display: inline; vertical-align: bottom;&quot; class=&quot;emoticon&quot; /&gt;&lt;/p&gt; 
    </content:encoded>

    <pubDate>Thu, 15 Jan 2009 23:59:08 +0100</pubDate>
    <guid isPermaLink="false">http://www.confuego.org/archives/13-guid.html</guid>
    
</item>

</channel>
</rss>