Category Archives: Lessons Learned

cm:name – Limits of sorting

We have implemented a compliance management system on the basis of Alfresco Share 3.2.2 for one of our customers. In addition to contract and document templates, organisational structures and complex workflows to comply with review, approval and documentation processes, this system also manages more than 500.000 base data record sets. The latter are modelled as an abstract content type with aspects grouping subsets of properties, and are regularly imported from / synchronized with an external data source. The records reside in the default ContentStore “workspace://SpacesStore” as 500.000 objects with about 20 properties are too few to expect a noticable impact on the performance of the platform as a whole.

Despite our expectations, a problem based on the Alfresco-specifics of sorting Lucene-searches was observed. Since the content type uses cm:cmobject as the parent type, every object inherits the property cm:name which we map to the unique key of the associated record. The first import added more than 500.000 entries to the Lucene index with fully distinct values for the field @cm:name, causing a noticable drop in the performance of the Share document library navigation. We observed a base overhead of 3-5 s for every search sorting on @cm:name even before the actual Lucene search startet processing. As eery navigation within the Share document library executes a sorting search in doclist.get.js the entire application is affected beyond the users tolerance levels.

What causes Alfresco to perform this badly considering a (presumably) small data set? From a technical point of view two main reasons can be identified:

  1. All values of the field to be sorted will be loaded from Lucene into memory for pre-sorting (for us this means more than 800.000 distinct values for an usually less than 50 results to sort).
  2. The internal Lucene FieldCache cannot be used to optimise repeated queries. Each search makes use of a unique IndexReader wrapper-instance due to multi-layered faceting – the FieldCache on the other hand is contractually obliged to only return previously loaded field values for the identical instance. This means that field values are always loaded directly from the index. (Those with time and curiosity at hand may inspect the cache using a Java debugger and will notice that the necessary data would be available several times over but can not be accessed.)

The magnitude of the performance impact sclaes with the I/O performance of the data volume used for the index. My personal development laptop which includes a solid state drive usually offers better performance than customers are willing to pay for in their productive machines. Thus I only have to suffer 1 – 2 s degradation, but intensive use of the navigation will swiftly lead to a bad impression on users.

What solutions / concepts are there to addres these performance problems for sorting searches?

  • Large amounts of base data should be stored in separate ContentStores, which automatically use a separate index. This is possible only if there are either no or just simple hierarchial relationships with other data sets to consider.
  • Metadata for sorting should be mapped to individual, business specific properties if at possible. When standard properties are used, sorting performance side effects may be incurred involuntarily when large record sets reuse the same property.
  • Searching over smaller subsets may in extreme cases be faster using sorting (and paging) implemented using JavaScript or Java instead of relying on Lucene. (In our case this would be possible for the navigation within the documen tlibrary since only 5 to 15 elements are managed on any one hiearchy level.)
  • Migration to Alfresco 4.0 which uses SOLR / canned queries.

This was a rather unexpected realisation for me as this means that only a few hundred thousand of documents can be managed in Alfresco Share before the document library as its core component reacts noticiably slower. Previous experiences with managing millions of objects in a single Alfresco instance are in a rather strong contrast to this …

The problems relating to sorting have been known to Alfresco for a time. Combined with similar problems with PATH-based queries and permission checking of large result sets, this was the reason for / a reinforcement of the switch to SOLR and moving core queries to the datbase layer in Alfresco 4.0 Expecially canned queries guarantee that sorting queries are affected only by the properties of the objects in the hierarchy being queried.