Latest Tweets

Magento - Apache Solr Integration - Part III (indexing custom data)

In our previous post, we show how make Solr handle the default Magento searches (catalog, products, etc). 
Now we'll go one step further: what if we want to have that search power with custom things? 

Indexing custom information in Solr from Magento

Let's clarify what we are talking about here, putting it in the context of one specific example: 

In this Magento store, we have a custom entity called Charity. Every time a customer buys something, a percentage of the sale goes to one specific charity (the user decides to which one).
The list of charities is huge. About 2 million. I'm kidding you not.

The charities info is kept in a new table in our Magento schema. We have implemented an extension to manage the CRUD operations for that, that is, adding new charities, updating them, deleting them, etc (of course, having 2M charities there required an initial mass import!)

Menu_014_0.png

We are not going to talk about how we created the custom extension to handle the charities CRUD. That's the usual Magento stuff. What we are going to show is how to put this data in Solr. How to add these elements to Solr index, so we can craft a great search mechanism for our customers. 

We want the charity selection process easy and quick. The search must guide, it should help the customers to find the charity they want to donate to. 

Understanding what we need to do

In a nutshell, we have to put the Charities info in Solr. And we need to keep that info in sync.The scenarios are:

  • Whenever a charity is added to Magento, we need to add it to the Solr index as well.
  • Whenever a charity is updated in Magento, it must be updated in Solr.
  • When a charity is deleted in Magento, we need to get it deleted from Solr too.
  • Finally, we also need a way to index all charities together (a mass indexing feature). 

Disclaimer: in order to make this post short, we'll take a few shortcuts. First, we won't show the resolutions for all these scenarios. We're not holding something back, the techniques that we are going to show here are exactly the same needed for the other cases. We'll pick the final scenario for this post, because is the one that includes more things. For the others, I'll just quickly comment what we have done.

Solr Documents

Solr stores the information in documents. Everyhing we are sending from Magento ends up being an XML document with the information. The things that you can include in such documents are defined in the schema.xml file.
If you open the one that the Magento team prepared for us, and look up for a <fields> section, you should get something like this (it's below a bunch of comments. If you didn't modify the files, the lines numbers should help to locate the exact position):

Selection_015_0.png

These are the fields that our Solr document could have. Check out some magentisms there, like 'store_id', 'categories', 'in_stock', 'sku'. You may also notice that not all fields are required, and that they have different types (string, tfloat, textTight_en, etc). Types are very important: they define how the fields will be processed. Before we take a look into that, let's check a set of very interesting fields:

Selection_016.png

The dynamic fields allow to add, well, dynamic fields to our documents :) (you'll find a better explanation here).
This means that, for example, if we add a field called "charity_name_en" to our Solr document, Solr will treat it as a "text_en" type of value, indexed. Ok, let's see what this type thing is about. 
In this very same schema.xml file, you'll find the types definition. Let's take a look to the "text_en" one: 

Selection_017.png

As you can see, the "text_en" fields have quite some processing associated to them. Similar configurations are done for each type definition.

OK, we need to send our charities info from Magento to Solr in form of documents. How do we do that? do we need to send the data in XML form?
Solr needs the data in XML. But no, we don't need to assemble XML documents in order to send the data... lucky us, the Magento guys have prepared some good services to make our life simpler.
We can use the tools they have prepared for us. Let's start doing so.  

Charities Mass reindex implementation

Let's make this mass reindex straightforward: I'm  putting it now into a controller action, so we can invoke it directly from the browser. Several things in this post are simplified to make them shorter and more didactical (our point here is to show how to comunicate with Solr).

This is is the action:

    /**
     * Create the index with the full list of charities.
     */
    public function reindexAllCharitiesAction() {

        $result = $this->getAllCharities();

        $charitiesDocs = array();
        $solrClient = $this->getSolrClient();

        $initTime = microtime();

        while ($charity = $result->fetch()) {

            $charitiesDocs[] = $this->getCharitySolrDocument($charity);

            if (count($charitiesDocs) == 10000) {
                $this->addDocumentsToSolr($solrClient, $charitiesDocs);
                $charitiesDocs = array();
            }
        }

        if (count($charitiesDocs) > 0) {
            $this->addDocumentsToSolr($solrClient, $charitiesDocs);
        }

        $finalTime = microtime();
        $totalTime = $finalTime - $initTime;

        echo "Total indexing time: " . $totalTime . "ms";
    }

Not a big function, is it? I've highlighted the interesting parts.

Getting all charities

We get the information about charities through getAllCharities function, which is not done 'The Magento Way' (that is, using proper collections), again to keep it plain for this post:

    /**
     * Get the full list of charities in the DB.
     * @return type 
     */
    protected function getAllCharities() {

        $db = Mage::getSingleton('core/resource')->getConnection('core_read');

        $query = 'select * from gtg_charities';
        return $db->query($query);
    }

Dirty and simple. We got the handle to our entire Charity information, and that's it.

Getting Solr client

Things start getting interesting. We get an instance of the Magento Solr Client this way:

    /**
     * Connect to Solr Client by specified options that will be merged with default
     *
     * @param array $options
     * @return Apache_Solr_Service
     */
    protected function getSolrClient($options = array()) {
        $helper = Mage::helper('enterprise_search');
        $def_options = array(
            'hostname' => $helper->getSolrConfigData('server_hostname'),
            'login' => $helper->getSolrConfigData('server_username'),
            'password' => $helper->getSolrConfigData('server_password'),
            'port' => $helper->getSolrConfigData('server_port'),
            'timeout' => $helper->getSolrConfigData('server_timeout'),
            'path' => $helper->getSolrConfigData('server_path')
        );
        $options = array_merge($def_options, $options);

        try {
            $client = Mage::getSingleton('enterprise_search/client_solr', $options);
        } catch (Exception $e) {
            Mage::logException($e);
        }

        return $client;
    }

This is the key in the interaction between Magento and Solr. It provides us with all the plumbing to manage our documents in the index. Both the helper and the Client objects are part of what the Magento team has prepared for us. 

Creating Solr documents

The next interesting thing in our main action, is the call to the function getCharitySolrDocument. We are calling it for each charity in our DB result, and getting a proper Solr document as a response. How do we create the Solr Documents? this way:

    /**
     * Prepare the Solr document from the data we fetch from 
     * the DB.
     * @param type $info
     * @return Apache_Solr_Document 
     */
    protected function getCharitySolrDocument($info) {

        $doc = new Apache_Solr_Document();
        $doc->addField("id", $info["id"]);
        $doc->addField("unique", $info["id"]);
        $doc->addField("charity_name_en", $info["name"]);
        $doc->addField("charity_category_en", $info["category_code"]);
        $doc->addField("charity_state_en", $info["state"]);

        return $doc;
    }

We are creating an instance of the Apache_Solr_Document class (that is also part of the Magento-Solr integration), and adding the fields we want to it:

  • id
  • unique
  • charity_name_en
  • charity_category_en
  • charity_state_en

These fields will be later used in our queries. 

Adding documents to Solr and commiting changes

We are putting all those documents together, in groups of 10K. Every time a group is completed, we are calling the addDocumentsToSolr function, and starting a new batch.
This is it:

    /**
     * Add the documents to the solr index. 
     * @param type $documents 
     */
    protected function addDocumentsToSolr(&$solrClient, $documents) {
        $solrClient->addDocuments($documents);
        $solrClient->commit();
    }

In that function, we send the documents to Solr, and execute a commit.

Before moving away from the reindexAllCharitiesAction function, I would like to comment about that 10K magic number: you can send the documents to Solr either in a one by one basis, or in batches. Batches are more efficient than invidual posts (speed wise), but consume more memory. You will have to find the right balance according to your conditions.

What about the other scenarios?

If you followed the post in detail, you probably identified that the other scenarios are easy to implement. I'm providing a few hints here:

  • Magento Solr Client provides the addDocuments and addDocument functions (the later receives just one document, instead of an array of documents).
  • Adding and updating documents in Solr is done the same way. From Magento, that means that we can to use the same addDocument (or addDocuments) method for both cases.
  • Magento Solr Client provides several ways of deleting documents from the index:  deleteByIDdeleteByMultipleIds and deleteByQuery. In our case, deleteById was precisely what we needed.

I hope this post was useful for you!

Don't miss the next post in the series: Magento - Apache Solr Integration - Part IV (Ajax Search Form). We put the cherry in the top of the cake there, actually letting the user enjoy our new search super powers. :)

Author

Managing Partner
Aldo works as a general mentor for the development teams, keeping in direct contact with programming and design.

Comments

I plan on trying out this approach with some upcoming Solr work for a project. Thanks for the info!

Everything appears to run as expected, but doesn't update the index, does something else need to happen after the commit?

Add comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.