Tag Archives: Alfresco

Script imports with a cleaner API

Update: The patch was transferred to the Alfresco JIRA and can be tracked as ALF-13631.

In my attempts to remote debug Alfresco JavaScript using Eclipse JSDT , the way script imports are handled in Alfresco proved to be one of the key obstacles. The current approach merges scripts in a pre-processor style stage just before they are actually executed. The directive used in this mechanism in both the Repository and Share can be used as follows:

<import resource="classpath:/alfresco/templates/org/alfresco/import/alfresco-util.js">
/**
 * Main entrypoint
 */
function main()
{
   var activityFeed = getActivities();
   var activities = [], activity, item, summary, fullName, date, sites = {}, siteTitles = {};
   var dateFilter = args.dateFilter, oldestDate = getOldestDate(dateFilter);
   ...
}
main()

Prior to execution, a script is scanned for import tags starting at the very first line, collecting dependencies – even transitive ones – for the merging step. The scan process stops at the first non-whitespace character that cannot be matched to an import directive. This approach leads to some restrictions that apply to scripts:

  1. Imports are only possible in the head segment of a script before any actual processing logic.
  2. Scripts cannot be dynamically imported as the import tags resource information can not be altered by JavaScript processing logic.
  3. Syntax checks performed in IDEs rightfully complain about the syntactically invalid import directive.
  4. Rhino exceptions show a line number that does not match the source code – developers are often forced to manually calculate the line offsets of imports to find the affected line in the original source file.
  5. Breakpoints for debugging cannot be reliably set prior to the first execution and thus merging of a script. This affects my preferred choice of the Eclipse JSDT more than the embedded Rhino Debugger UI, since the former currently does not allow interaction with the merged scripts unlike the latter.

I had planned for a while to find an alternative solution to script imports – not only to allow for remote debugging using JSDT but to be able to make use of a more flexible means of using / reusing for scripts within Alfresco. This weekend I finally had the time to work on this. My goal was to provide a small extension to the JavaScript API that allows importing of scripts in arbitrary places within a script. Additionally I wanted to allow for extensibility of the lookup mechanism involved without requiring extenders to dive into the actual JavaScript API.

Alfresco itself already provides a means to add additional root objects in the JavaScript API through its javaScriptExtension beans. This feature was not sufficient for what I had in mind – the Java-based services that could be provided that way do not have the necessary level of access to the execution context of the Rhino engine. A patch of the RhinoScriptProcesor on the other hand easily allowed for adding a native JavaScript function in the global context, that – as a side effect of implementing it in the script processor – also has access to important processor internals like the script cache. The final import function can be used in JavaScript in the following manner:

importScript("legacy", "classpath:/alfresco/templates/webscripts/org/alfresco/repository/forms/pickerresults.lib.js", true);

The three parameters of the function are defined as follows:

  1. The unique identifier of lookup component to be used for the import. Using “legacy” reuses the current lookup concept that is used by the pre-processor merging and allows developers to easily adapt existing code by running a simple RegEx search & replace. Additional lookup components I’ve implemented are for dedicated “classpath” and “xpath” based resolution.
  2. A text-based reference to a script that can be resolved by the chosen lookup component.
  3. A boolean parameter that specifies wether the import should fail with a ScriptException if the script reference cannot be resolved.

The function resolves and executes the imported script in the same context and scope of the caller. The imported script has access to variables and functions defined by the importing script and can interact with those. The boolean return value of the function can be used to check for a successful resolution of the import in use-cases a failed resolution does not automatically raise an exception. As a JavaScript function it can be used at any time in the execution of a script and be used in conjunction with variables as parameters for dynamically importing arbitrary scripts.

Additional lookup components can be added by implementing a trivial Java interface and linking it with the RhinoScriptProcessor via Spring. The interface defines a single method for the resolution of the text-based script reference. A context parameter providing the resolved script location of the script performing the import is provided – if available – to allow for resolution of relative references.

public interface ScriptLocator {
 
	/**
	 * Resolves a string-based script location to a wrapper instance of the
	 * {@link ScriptLocation} interface usable by the repository's script
	 * processor. Implementations may support relative script resolution - a
	 * reference location is provided in instances an already running script
	 * attempts to import another.
	 *
	 * @param referenceLocation
	 *            a reference script location if a script currently in execution
	 *            attempts to import another, or {@code null} if either no
	 *            script is currently being executed or the script being
	 *            executed is not associated with a script location (e.g. a
	 *            simple script string)
	 * @param locationValue
	 *            the simple location to be resolved to a proper script location
	 * @return the resolved script location or {@code null} if it could not be resolved
	 *
	 */
	ScriptLocation resolveLocation(ScriptLocation referenceLocation,
			String locationValue);
}
    <bean id="javaScriptProcessor" class="org.alfresco.repo.jscript.RhinoScriptProcessor" init-method="register">
        <!--...-->
        <property name="scriptLocators">
            <map>
                <entry key="classpath">
                    <ref bean="javaScriptProcessor.classpathScriptLocator"/>
                </entry>
                <entry key="xpath">
                    <ref bean="javaScriptProcessor.xPathScriptLocator"/>
                </entry>
                <entry key="legacy">
                    <ref bean="javaScriptProcessor.legacyScriptLocator"/>
                </entry>
            </map>
        </property>
    </bean>
 
    <bean id="javaScriptProcessor.classpathScriptLocator" class="org.alfresco.repo.jscript.ClasspathScriptLocator" />
 
    <bean id="javaScriptProcessor.xPathScriptLocator" class="org.alfresco.repo.jscript.XPathScriptLocator">
        <property name="serviceRegistry" ref="ServiceRegistry"/>
    </bean>
 
    <bean id="javaScriptProcessor.legacyScriptLocator" class="org.alfresco.repo.jscript.LegacyScriptLocator">
        <property name="services" ref="ServiceRegistry"/>
        <property name="storeUrl">
            <value>${spaces.store}</value>
        </property>
        <property name="storePath">
            <value>${spaces.company_home.childname}</value>
        </property>
    </bean>

So far we have dealt with improving script imports for the Repository tier. I originally planned to use the same concept for Share / Spring Surf, but soon realized that Spring Surf / Web Scripts already provide a utility for loading of dependencies from different sources. This utility is already being used in the pre-processor stage when scripts are merged. Using a so-called “Store” allows for the resolution of abstract document paths within the classpath of an application or a remote store, such as the Alfresco Repository. This mechanism is sufficiently extensible that it would make no sense to add another.

I have provided a reduced JavaScript API within Share / Spring Surf which makes use of the existing mechanism – the “legacy” mode of the Repository can be seen as always / implicitly enforced.

importScript("classpath:/alfresco/templates/org/alfresco/import/alfresco-util.js", true);

The function is available for web scripts and template controllers, and support both explicit classpath resolution – as in the example above – as well as abstract document and relative paths. Relative resolution is only supported when the importing script is loaded from the classpath. In such a case relative resolution is attempted first, falling back to resolving an abstract path via the Store-concept (relative paths cannot in all instances be distinguished from abstract paths).

To test the new API I have applied search & replace to the my local Alfresco 4.0 Enterprise installation and replaced ALL instances of the old import directive with the new function. The migration worked without any issues so far. The Rhino Debugger UI shows all scripts formerly merged as individual scripts and breakpoints can not be set prior to the execution of any script. Line numbers in JavaScript exception are finally correct in all cases I was able to test.

Debugging Alfresco #1 – Eclipse JavaScript Debugger and Alfresco Repository

Debugging Alfresco is not always a simple undertaking. Remote Debugger features of common IDEs allow remote debugging of Java-based components, but for JavaScript and FreeMarker templates things are not as simple as they could be.

While the Rhino engine embedded in Alfresco comes with an own integrated Debugger, there is no such thing for FreeMarker as far as I know. But even the embedded Rhino Debugger is everything but feature complete. On the one hand it represents another break in the already extensive tool chain and on the other it can only be used on servers that come with a graphical user interface. Debugging of development or test environments on headless servers or VMs is not possible. I’ve recently taken the time to check out the new JavaScript Debugger features of the Eclipse JavaScript Development Tools (JSDT) project.

Starting in version 3.7 of the Eclipse IDE the essential components of the JSDT are part of every distribution which includes the Web Standard Tools sub project. After some initial problems with the not fully matured Debugger component I switched to milestone 4 of the upcoming Juno release for my tests. The project’s wiki has a rather useful guide for using the Rhino Debugger support as well as our special use case of integrating with an embedded Rhino Engine. A small FAQ for the most common problems is available as well.

In order to remote debug the JavaScript code of web scripts and the like using Eclipse, a special debug component has to run within the Alfresco server and listen on a TCP port for incoming debug communication (see the Java Platform Debugger Architecture). The JSDT provides the necessary JARs as part of its plugins, so we only have to copy them into <tomcat>/webapps/alfresco/WEB-INF/lib (due to a class dependency on the Rhino engine, <tomcat>/shared/lib is not an option). Those libraries are:

  • org.eclipse.wst.jsdt.debug.rhino.debugger_<version>.jar
  • org.eclipse.wst.jsdt.debug.transport_<version>.jar

Based on the guide for debugging an embedded script engine, the server component has to be bound to a Rhino runtime context and activated. This requires implementing a simple bootstrap bean and including it in the web application startup via Spring.

package com.prodyna.debug.rhino;
 
import java.text.MessageFormat;
 
import org.eclipse.wst.jsdt.debug.rhino.debugger.RhinoDebugger;
import org.mozilla.javascript.ContextFactory;
import org.springframework.beans.factory.InitializingBean;
 
public class RemoteJSDebugInitiator implements InitializingBean {
 
	private static final int DEFAULT_PORT = 9000;
	private static final String DEFAULT_TRANSPORT = "socket";
 
	private boolean suspend = false; // suspend until debugger attaches itself
	private boolean trace = false; // trace-log the debug agent
	private int port = DEFAULT_PORT;
	private String transport = DEFAULT_TRANSPORT;
 
	// the global context factory used by Alfresco
	private ContextFactory contextFactory = ContextFactory.getGlobal();
 
	public void afterPropertiesSet() throws Exception {
		// setup debugger based on configuration
		final String configString = MessageFormat.format(
			"transport={0},suspend={1},address={2},trace={3}",
			new Object[] { this.transport, this.suspend ? "y" : "n",
				String.valueOf(this.port), this.trace ? "y" : "n" });
		final RhinoDebugger debugger = new RhinoDebugger(configString);
		this.contextFactory.addListener(debugger);
		debugger.start();
	}
 
	public void setSuspend(boolean suspend) { this.suspend = suspend; }
	public void setTrace(boolean trace) { this.trace = trace; }
	public void setPort(int port) { this.port = port; }
	public void setTransport(String transport) { this.transport = transport; }
	public void setContextFactory(ContextFactory contextFactory) { this.contextFactory = contextFactory; }
}

The following bean delcaration in <tomcat>/shared/classes/alfresco/extension/dev-context.xml activates the bean.

<bean id="pd.jsRemoveDebugger" class="com.prodyna.debug.rhino.RemoteJSDebugInitiator">
	<property name="port"><value>8000</value></property>
	<property name="trace"><value>true</value></property>
</bean>

After restarting the Alfresco Repository server Eclipse can connect to the Rhino engine using the JavaScript debugger. The parameters used in the activation bean – wether default or customized – need to be provided in the debug configuration, using the Mozilla Rhino Attaching Connector.

Unfortunately that is not yet enough to successfully debug server side JavaScript from within Eclipse. Similar to the classpath for Java source code, JavaScript files need to reside in a specific structure for Eclipse to be able to associate them with scripts being executed by the server engine. Only if this association can be made are breakpoints set in JavaScript source code actually being transmitted to and evaluated by the server side debugger component.

The expected source code structure for remote debuggable scripts is dependent on the source name used when executing scripts with the Rhino engine. Alfresco refers to the file URI of the main script, i.e. in a repository server set up under “D:\Applications\Swift\tomcat” the URI for the web script controller sites.get.js is “file://D:/Applications/Swift/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/repository/sites/sites.get.js”. Such a URI is mapped without the “file://D:/” prefix to a automatically created source project “External JavaScript Source” according to the FAQ of JSDT. That did not work for and after studying the source code of the JSDT plugin I found a working alternative: with the first path fragment referring to a specific source code project, the remainder of the path is used for code lookup relativ to that project. In order to debug a JavaScript web script controller of my Swift repository server, those web scripts had to be made available in a project called “Applications” and a subfolder structure “Swift/t/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/templates/webscripts/”. The simplest way to do this is linking the source code of the Remote API project into such a structure instead of duplicating it.

Having complete the last piece of configuration, breakpoints set in Alfresco web scripts like sites.get.js will now be properly transmitted to and activated on the server. On the next execution of a site search from within Alfresco Share, the debugger will pause at the specified code line. Standard features like step over / iunto, variables and expression views are available to investigate the behavior of the selected script. Especially the expressions view is currently of utmost importance as the debugger is not yet able to handle Java objects as variable values unless they are transformed into native JavaScript instances via an expression.

Mid-term review: The Eclipse JSDT allows debugging of JavaScript scripts that are part of the Alfresco application – i.e. lying in its classpath – from within the familiar IDE used by a majority of Alfresco developers. This eliminates the previous restrictions imposed by the Rhino Debugger which only allowed debugging on servers that were either local or sported a graphical user interface. Setting up the JSDT remote debugger takes some getting used to but should be easy to handle with the tools provided by the IDE, such as source linking. Currently there are some functional limitations and peculiarities due to the yet not matured debugger and the way the Rhino engine is embedded within Alfresco. I will address some of these issues in upcoming posts of this new blog series and provide solutions where possible.

Managing and using custom classifications

Classifications in Alfresco allow for associations of content elements with specific categories from a hierarchial structure. The standard product provides a generic out-of-the-box tree of categories relating to languages, regions and types of software documentation classifications. Lucene queries may be formulated that select or aggregate content based on the associated categories and even sub-trees of categories, e.g. selecting all documents associated with an English language independent of the actual dialect, i.e. American, British or any other English variant.

The Alfresco wiki has a pretty good documentation of classifications and categories. Unfortunately, the documentation does not reflect how classifications are used in a  apparent majority of projects. Instead of defining custom classifications similar to the example provided in the wiki, I have seen numerous instances where the out-of-the-box hierarchy was simply extended. This is understandable considering the amount of functionality provided for the out-of-the-box hierarchy and the effort saved by reusing it. But this approach means that categories do not serve in their intended function of providing semantically separate classifications – any arbitrary category may be assigned to the cm:categories property instead of chossing from business-oriented, separate value sets with the odd connection between individual categories.

My colleagues and I have all participated in several projects that use custom classifications to organize content and – in some instances – provide virtual navigation structures based on content classification. Apart from the technical architecture, Alfresco does not provide much in the way of supporting using custom classifications. The category manager included in the Web Client only handles the out-of-the-box hierachy as does its Share counterpart introduced in Alfresco 4.0 (by Jan Pfitzner).  In order to save us – and other developers of the community – the trouble of having to reinvent the wheel any more than necessary, I recently set out to enhance Jans component and submit it as a contribution to Alfresco.

In short I have modified the following four aspects based on Alfresco 4.0c:

  • Added the ability to manage multiple classifications
  • Added the ability to create / modify categories that use a business-specific subtype
  • Patched the Forms API to allow creating new content objects using a child-association other than cm:contains
  • Patched the object-finder form component to support usage of business-specific category subtypes

Category Manager - Multiple Classifications

The Share category manager only allows for managing the out-of-the-box hierarchy of cm:generalclassifiable – as does its Web Client predecessor. In order to simplify usage of custom classifications, they need to be managable without requiring additional develeopment effort. Minimal adaptions made to the tree construction code and the introduction of a new data web script on the repository tier now allow any classification to be administered. In order to hide certain technical classifications that are used and managed differently (e.g. cm:taggable and cm:classifiable) I have introduced a configuration option to ignore these classification aspects.

Category Manager - Form-based Management

Categories are standard nodes – much like almost anything else in Alfresco – that are defined by the type cm:category. This type may suffice for the most usages, but sometimes there is the requirement of associating business metadata with categories. The data dictionary and modelling of Alfresco allows that subtypes of cm:category may be defined or aspects used to enhance categories with the additional data. The Share category manager only supported simple categories with a name as the sole property. I have based the component on forms to provide the necessary flexiblity in managing category types and metadata. New categories will always be created using a form dialog, while existing categories may be edited using the insitu-editor or a form dialog based on another configuration option I have introduced. The former is the default for simple categories of type cm:category.

The form-based management of categories required extensions of the Forms API. In order to create root categories in the correct location it was necessary to provide a form filter that resolves the virtual node reference alfresco://category/root to the correct reference for the specific classification aspect. The creation of sub-categories via forms in addition required the ability of specifying the correct child-association to use (cm:subcategories) instead of the default cm:contains – a feature marked with a TODO in the code but not implemented since at least Alfresco 3.2.

Assigning values from a custom classification

Using a custom classification for editing the metadata of a content item only worked if the categories used where of the type cm:category. Navigating into a sub-level of the hierarchy was not possible for any subtype otherwise. Supporting subtypes required an adaption to the object finder providing the selection dialog within forms and the supporting data web script on the repository tier. Type specific checks were replaced by proper type hierarchy evaluations. A small Spring Surf Extension may be used to associate any business category type with either the generic or a custom icon.

cm:name – An enforced property

in the last post I described a performance problem which could be traced back to the usage of cm:name (cm:cmobject as parent type) in modelling / instantiating 500.000+ record sets in the default content store. Using one of the listed concepts to work around this issue, I have been setting up a small migration aiming to remove the redundant property cm:name by switching to the parent type sys:base. I have since come to realize that cm:name isindependently from my model type definition in the data dictionaryenforced on all public interfaces and always indexed. Only the integrity checks for mandatory and constrained properties respect the actual type definition.

This of course negates the purpose of my entire approach of combatting our performance problems. If it is impossible to have a node in the database which is not indexed with a cm:name property, side effects on the performance of sorting navigation scripts for Share using that same property are unavoidable..

How does this behaviour manifest itself?

  • When a node is created without a cm:name value, no value for that property is persisted to the database. During reads on the nodes properties, the UUID of the NodeRef is transparently returned as a fake cm:name value (see e.g. DBNodeServiceImpl.getProperty(NodeRef, QName) or ReferenceablePropertiesEntity.addReferenceableProperties(Node, Map<QName,Serializable>)).
  • Only the properties defined in the type definition are validated during node creation / modification. Since cm:name is only defined for cm:cmobject and its subtypes, it is only validated for these types. Any evaluation of the mandatory constraint is suppressed as the property is being faked to have the value of the UUID if not set explicitly.
  • During indexing the type definition of the node being indexed is not respected as far as properties are concerned. All properties present on the node are indexed according to their property definition, irregardless of wether they should even be present on the node or not. This means that even if nodes do not inherit from cm:cmobject, a cm:name value is being indexed because a) the property is transparently set to the UUID if not present and b) a property definition for cm:name exists which specifies that it must be indexed.

This behaviour has essentially remained unchanged since 3.2 based on my investigations into the Alfresco SVN and remains in place in the current 4.0 trunk. I was unable to identify which Alfresco feature might require this enforcement of the property, overriding the configuration of my data model. Regarding the question “bug or feature” I am currently leaning towards “bug”. Since this was discovered in a project of an Enterprise customer, I have relegated this question to Alfresco Support. In case this is a consciously implemented behaviour it would be better / more appropriate to model cm:name as a property of sys:base, similar to how sys:referencable defines the other common set of properties (store protocol, identifier and UUID).

cm:name – Limits of sorting

We have implemented a compliance management system on the basis of Alfresco Share 3.2.2 for one of our customers. In addition to contract and document templates, organisational structures and complex workflows to comply with review, approval and documentation processes, this system also manages more than 500.000 base data record sets. The latter are modelled as an abstract content type with aspects grouping subsets of properties, and are regularly imported from / synchronized with an external data source. The records reside in the default ContentStore “workspace://SpacesStore” as 500.000 objects with about 20 properties are too few to expect a noticable impact on the performance of the platform as a whole.

Despite our expectations, a problem based on the Alfresco-specifics of sorting Lucene-searches was observed. Since the content type uses cm:cmobject as the parent type, every object inherits the property cm:name which we map to the unique key of the associated record. The first import added more than 500.000 entries to the Lucene index with fully distinct values for the field @cm:name, causing a noticable drop in the performance of the Share document library navigation. We observed a base overhead of 3-5 s for every search sorting on @cm:name even before the actual Lucene search startet processing. As eery navigation within the Share document library executes a sorting search in doclist.get.js the entire application is affected beyond the users tolerance levels.

What causes Alfresco to perform this badly considering a (presumably) small data set? From a technical point of view two main reasons can be identified:

  1. All values of the field to be sorted will be loaded from Lucene into memory for pre-sorting (for us this means more than 800.000 distinct values for an usually less than 50 results to sort).
  2. The internal Lucene FieldCache cannot be used to optimise repeated queries. Each search makes use of a unique IndexReader wrapper-instance due to multi-layered faceting – the FieldCache on the other hand is contractually obliged to only return previously loaded field values for the identical instance. This means that field values are always loaded directly from the index. (Those with time and curiosity at hand may inspect the cache using a Java debugger and will notice that the necessary data would be available several times over but can not be accessed.)

The magnitude of the performance impact sclaes with the I/O performance of the data volume used for the index. My personal development laptop which includes a solid state drive usually offers better performance than customers are willing to pay for in their productive machines. Thus I only have to suffer 1 – 2 s degradation, but intensive use of the navigation will swiftly lead to a bad impression on users.

What solutions / concepts are there to addres these performance problems for sorting searches?

  • Large amounts of base data should be stored in separate ContentStores, which automatically use a separate index. This is possible only if there are either no or just simple hierarchial relationships with other data sets to consider.
  • Metadata for sorting should be mapped to individual, business specific properties if at possible. When standard properties are used, sorting performance side effects may be incurred involuntarily when large record sets reuse the same property.
  • Searching over smaller subsets may in extreme cases be faster using sorting (and paging) implemented using JavaScript or Java instead of relying on Lucene. (In our case this would be possible for the navigation within the documen tlibrary since only 5 to 15 elements are managed on any one hiearchy level.)
  • Migration to Alfresco 4.0 which uses SOLR / canned queries.

This was a rather unexpected realisation for me as this means that only a few hundred thousand of documents can be managed in Alfresco Share before the document library as its core component reacts noticiably slower. Previous experiences with managing millions of objects in a single Alfresco instance are in a rather strong contrast to this …

The problems relating to sorting have been known to Alfresco for a time. Combined with similar problems with PATH-based queries and permission checking of large result sets, this was the reason for / a reinforcement of the switch to SOLR and moving core queries to the datbase layer in Alfresco 4.0 Expecially canned queries guarantee that sorting queries are affected only by the properties of the objects in the hierarchy being queried.