Follow the Hacking SharePoint 2013 Search series: http://www.portalsolutions.net/WhatWeKnow/pslabs/Lists/Posts/Post.aspx?ID=2
Code Samples for this article: http://code.msdn.microsoft.com/SharePoint-Search-2013-d47a613d
Seeing the “Making Synonyms Visible in SharePoint 2013 Search Results” post by Christoffer Vig reminded me of the earlier research I did during the SharePoint 2013 TAP on FSIS/Juno bits that were included with the pre-RTM binaries. What follows is the first entry, in what will hopefully turn out to be a series of posts, on discovering customization options involving SharePoint Search 2013 content processing engine. The configuration steps and code I am releasing with this and future posts will show that it is possible, albeit currently unsupported by Microsoft, to run custom stages in-process of the content processing component, similar to how it was done on good-old FAST ESP.
To get a major caveat out of the way, the topic I will be exploring represents early-stage research into the art of the possible, is not presently intended for production use, and is not supported by Microsoft. Although being somewhat limited and performance sapping, The Content Enrichment Web Service (CEWS) is the primary documented and supported mechanism for implementing “customer” content processing logic in SharePoint Search 2013.
Down to the brass tacks…
For this first post we are going to concentrate on deploying an operator similar in functionality to ESP’s SPY module. The code of the module is inspired by the Chatter operator that ships with the 2013 search platform and is un-documented. To deploy the custom Spy operator, perform the following steps:
- Identify server(s) in your farm running the content processing component. Launch the SharePoint 2013 Management Shell and run the following command: Get-SPEnterpriseSearchServiceApplication | Get-SPEnterpriseSearchStatus –Text
Perform the following sub-steps on all servers listed as running ContentProcessing components.
– From the attachment (see link below), extract the PS.Ceres.Extensions.dll file and copy it into %PROGRAMFILES%\Microsoft Office Servers\15.0\Search\Runtime\1.0\Assemblies – Recycle the “SharePoint Search Host Controller” (SPSearchHostController) service to pick up the new binary changes -
Connect to the Admin component CTS engine by running the following commands in the SP2013 Management Shell.
&”C:\Program Files\Microsoft Office Servers\15.0\Search\Scripts\ceresshell.ps1″ Connect-System -Uri (Get-SPEnterpriseSearchServiceApplication).SystemManagerLocations[0] -ServiceIdentity (Get-SPEnterpriseSearchService).ProcessIdentity Connect-EngineIt’s important to specify an explicit –Uri argument to the Connect-System cmdlet as the default port-based listener has been removed in the RTM. If everything goes right you should see a message confirming that you connected to AdminComponent1.
- CTS Content Processing flow is roughly equivalent to a single FAST ESP pipeline. It is also modular. The parent flow (Microsoft.CrawlerFlow) sets up the overall processing sequence that is implemented within individual more specialized flows. For this step we are going to modify the Microsoft.CrawlerContentEnrichmentSubFlow, or the flow that implements the Content Enrichment Web Service logic. But first, we take a backup of the flow: Get-Flow Microsoft.CrawlerContentEnrichmentSubFlow | sc Microsoft.CrawlerContentEnrichmentSubFlow_original.xml
-
Extract the Microsoft.CrawlerContentEnrichmentSubFlow_modified.xml file from the attached archive or make a copy of the …_original.xml file and implement changes as to it as described below.
Insert the following XML node after the closing element of the Operator[@name=”ContentEnrichment”] node. So far I can tell, the document order of Operator nodes is not important, as it is governed by Operator/Targets/Target/operatorMoniker/@name attribute, but it helps to have the document presentation order mimic the overall flow.
<Operator name=”Spy” type=”PS.Ceres.Extensions.Operators.Spy”> <Targets> <Target> <operatorMoniker name=”/Microsoft.CrawlerContentEnrichmentSubFlow/ContentEnrichmentCleanup” /> </Target> </Targets> <Properties> <Property name=”outputFile” value=”"c:\\temp\\spy.txt"” /> <Property name=”verbosity” value=”2″ /> </Properties> </Operator>Here, confirm that the outputFile property has a correct value pointing to an existing directory on the file system and that the process identity of the “SharePoint Search Host Controller” service has write permissions on that folder.
The overall order of operator evaluation is captured by following a chain of operator outputs and inputs into the subsequent operator encoded in the operatorMoniker/@name values within each operator. Each subflow begins the with the built-in SubFlowInput operator and terminates with the SubFlowOutput operator. The value that goes into operatorMoniker/@name is comprised of the subflow name derived out of /OperatorGraph/@name attribute, a “/” separator and the in-flow operator name (Operator/@name) of the target operator. Consequently, to inject one more operator in the sequence, modify the Operator[@name=”ContentEnrichment”]/Targets/Target/operatorMoniker/@name attribute to target the Spy operator. Make sure the snippet looks as follows:
<Operator name=”ContentEnrichment” type=”Microsoft.Office.Server.Search.ContentProcessing.Operators.ContentEnrichmentClient”> <Targets> <Target> <operatorMoniker name=”/Microsoft.CrawlerContentEnrichmentSubFlow/Spy” /> </Target> </Targets> -
Once the updated flow is ready, register it with the CTS engine:
Remove-Flow Microsoft.CrawlerContentEnrichmentSubFlow
gc .\Microsoft.CrawlerContentEnrichmentSubFlow_modified.xml | Out-String | Add-Flow Microsoft.CrawlerContentEnrichmentSubFlow
If at any point you need to revert your changes, rerun the same commands using the .._original.xml file.
- Start an incremental crawl. If everything goes right you should see spy_**.txt files being created in the specified output directory after about 2 minutes into the start of the crawl. I will go more into the data model of the record sets and records passing through the flow in a future post. One thing to note here is that the processing model remained similar to ESP. The crawled properties and managed properties are grouped into bucket-type fields called content and ManagedPropertiesBucket, with the rest of the fields (body, summary, etc) being siblings to the bucket fields.
What now?
Hopefully, the above proves that the task of running in-process custom flow operators, previously known as pipeline extensions, is possible in SharePoint Search 2013. If you got to the end of this post and have a working customized flow, as an additional bonus, the solution contains the Microsoft.CrawlerContentEnrichmentSubFlow_modifiedregex.xml file with the configuration for a Regex operator I’ve written that matches and rewrites text patterns in the tokens contained in the Body field. In my next post, I am going to document the runtime model of custom operators, evaluators and record producers on the example of that same custom regex op.
My research into the OTB code lead me to believe that the CTS team developed the Ceres framework fairly cleanly and robustly, using a private-label dependency injection framework and related code development best practices, such as developing to interfaces, etc., making the tasks of integration with and unit testing of the framework client code a great pleasure. As an aside: one would wish that the core SharePoint bits received a similar rewrite at some point in the past when developing farm solutions and unit testing SharePoint code was still relevant. Unfortunately for companies building search-driven applications, Microsoft chose to keep this part of the SharePoint platform closed off, instead concentrating on APIs as the overall integration story for the broader platform (think 2013 RESTful API surface) and search (searchable BCS improvements, CEWS – a revamp of the disaster known as the Customer Pipeline Extensibility stage on FS4SP).
When it comes to search and performing custom content processing, I am of the opinion that CEWS is fine for tasks involving algorithms with high CPU, memory requirements, e.g. content categorization, or the actual content enrichment scenarios involving lookups / matching of in-flow/pipeline attributes to external data stores. In all cases, the above algorithms are probably better off being ran with at least process isolation on the same box, or being scaled out across multiple servers on hardware that does not overlap with your search topology. However, a whole family of processing algorithms that generally complete under 250ms and mostly process local pipeline data without a need for external lookups are poorly served by CEWS, which presents too much invocation and data transfer overhead. Hopefully this research will provide a more performant environment for such algorithms, even if it is in an experimental, non-production setting.
Companion assets: http://code.msdn.microsoft.com/SharePoint-Search-2013-d47a613d