Tag Archives: Performance

cm:name – An enforced property

in the last post I described a performance problem which could be traced back to the usage of cm:name (cm:cmobject as parent type) in modelling / instantiating 500.000+ record sets in the default content store. Using one of the listed concepts to work around this issue, I have been setting up a small migration aiming to remove the redundant property cm:name by switching to the parent type sys:base. I have since come to realize that cm:name isindependently from my model type definition in the data dictionaryenforced on all public interfaces and always indexed. Only the integrity checks for mandatory and constrained properties respect the actual type definition.

This of course negates the purpose of my entire approach of combatting our performance problems. If it is impossible to have a node in the database which is not indexed with a cm:name property, side effects on the performance of sorting navigation scripts for Share using that same property are unavoidable..

How does this behaviour manifest itself?

  • When a node is created without a cm:name value, no value for that property is persisted to the database. During reads on the nodes properties, the UUID of the NodeRef is transparently returned as a fake cm:name value (see e.g. DBNodeServiceImpl.getProperty(NodeRef, QName) or ReferenceablePropertiesEntity.addReferenceableProperties(Node, Map<QName,Serializable>)).
  • Only the properties defined in the type definition are validated during node creation / modification. Since cm:name is only defined for cm:cmobject and its subtypes, it is only validated for these types. Any evaluation of the mandatory constraint is suppressed as the property is being faked to have the value of the UUID if not set explicitly.
  • During indexing the type definition of the node being indexed is not respected as far as properties are concerned. All properties present on the node are indexed according to their property definition, irregardless of wether they should even be present on the node or not. This means that even if nodes do not inherit from cm:cmobject, a cm:name value is being indexed because a) the property is transparently set to the UUID if not present and b) a property definition for cm:name exists which specifies that it must be indexed.

This behaviour has essentially remained unchanged since 3.2 based on my investigations into the Alfresco SVN and remains in place in the current 4.0 trunk. I was unable to identify which Alfresco feature might require this enforcement of the property, overriding the configuration of my data model. Regarding the question “bug or feature” I am currently leaning towards “bug”. Since this was discovered in a project of an Enterprise customer, I have relegated this question to Alfresco Support. In case this is a consciously implemented behaviour it would be better / more appropriate to model cm:name as a property of sys:base, similar to how sys:referencable defines the other common set of properties (store protocol, identifier and UUID).

cm:name – Limits of sorting

We have implemented a compliance management system on the basis of Alfresco Share 3.2.2 for one of our customers. In addition to contract and document templates, organisational structures and complex workflows to comply with review, approval and documentation processes, this system also manages more than 500.000 base data record sets. The latter are modelled as an abstract content type with aspects grouping subsets of properties, and are regularly imported from / synchronized with an external data source. The records reside in the default ContentStore “workspace://SpacesStore” as 500.000 objects with about 20 properties are too few to expect a noticable impact on the performance of the platform as a whole.

Despite our expectations, a problem based on the Alfresco-specifics of sorting Lucene-searches was observed. Since the content type uses cm:cmobject as the parent type, every object inherits the property cm:name which we map to the unique key of the associated record. The first import added more than 500.000 entries to the Lucene index with fully distinct values for the field @cm:name, causing a noticable drop in the performance of the Share document library navigation. We observed a base overhead of 3-5 s for every search sorting on @cm:name even before the actual Lucene search startet processing. As eery navigation within the Share document library executes a sorting search in doclist.get.js the entire application is affected beyond the users tolerance levels.

What causes Alfresco to perform this badly considering a (presumably) small data set? From a technical point of view two main reasons can be identified:

  1. All values of the field to be sorted will be loaded from Lucene into memory for pre-sorting (for us this means more than 800.000 distinct values for an usually less than 50 results to sort).
  2. The internal Lucene FieldCache cannot be used to optimise repeated queries. Each search makes use of a unique IndexReader wrapper-instance due to multi-layered faceting – the FieldCache on the other hand is contractually obliged to only return previously loaded field values for the identical instance. This means that field values are always loaded directly from the index. (Those with time and curiosity at hand may inspect the cache using a Java debugger and will notice that the necessary data would be available several times over but can not be accessed.)

The magnitude of the performance impact sclaes with the I/O performance of the data volume used for the index. My personal development laptop which includes a solid state drive usually offers better performance than customers are willing to pay for in their productive machines. Thus I only have to suffer 1 – 2 s degradation, but intensive use of the navigation will swiftly lead to a bad impression on users.

What solutions / concepts are there to addres these performance problems for sorting searches?

  • Large amounts of base data should be stored in separate ContentStores, which automatically use a separate index. This is possible only if there are either no or just simple hierarchial relationships with other data sets to consider.
  • Metadata for sorting should be mapped to individual, business specific properties if at possible. When standard properties are used, sorting performance side effects may be incurred involuntarily when large record sets reuse the same property.
  • Searching over smaller subsets may in extreme cases be faster using sorting (and paging) implemented using JavaScript or Java instead of relying on Lucene. (In our case this would be possible for the navigation within the documen tlibrary since only 5 to 15 elements are managed on any one hiearchy level.)
  • Migration to Alfresco 4.0 which uses SOLR / canned queries.

This was a rather unexpected realisation for me as this means that only a few hundred thousand of documents can be managed in Alfresco Share before the document library as its core component reacts noticiably slower. Previous experiences with managing millions of objects in a single Alfresco instance are in a rather strong contrast to this …

The problems relating to sorting have been known to Alfresco for a time. Combined with similar problems with PATH-based queries and permission checking of large result sets, this was the reason for / a reinforcement of the switch to SOLR and moving core queries to the datbase layer in Alfresco 4.0 Expecially canned queries guarantee that sorting queries are affected only by the properties of the objects in the hierarchy being queried.