Přeskočit na hlavní obsah

v1.1.5

· 2 minuty čtení
John Renfrew
Programmer and data architect

Version 1.1.5 internal test release

  • Testing .docx and .pdf text extraction

Following from conversation about StudentTasks, a technology test was completed to create a service to be able to ansyncronously retrieve a base64 encoded document from a FileMaker record. This would be triggered after a submit action finalises a student task submission. There is no UI requirement, so status information becomes key to confirm completion, failure or stalled extractions. Paramters are {taskID, fileType}

{
"text": "extracted plain text content",
"charCount": 17961,
"durationMs": 209,
"library": "mammoth",
"version": "1.8.0",
"filename": "report.docx",
"lineBreaks": "single",
"hash": "c99b32954d120bf62ad818944660275f"
}

This has been extended to provide a micro-service, which can take parameters of type and base64encoded file, along with fileName and returnHash. As this is an open endpoint we shall be adding a state or session parameter to reduce fake attempts.

info

cURL https://server/extraction/api/extract/direct {"b64": "${B64}", "fileType": "pdf", "fileName": "test4.pdf", "returnHash": true}

{
"text": "extracted text",
"charCount": 2115,
"durationMs": 241,
"library": "mammoth",
"version": "1.12.0",
"filename": "test.docx",
"hash": "c99b32954d120bf62ad818944660275f"
}

Response is very fast at sub 300ms for 4 page test docx file, and this covers:

  • TLS handshake
  • nginx proxy overhead and routing
  • base64 decode
  • mammoth parsing the DOCX XML
  • JSON serialisation of the response
  • network latency both ways

A playbook is written and fully tested.

  • Upload → stores b64 of doc on StudentTask record
  • (may repeat — each upload overwrites previous b64)
  • Commit → OData PATCH (answers + timestamp + locked)
  • → fire-and-forget POST to extraction service
  • POST /api/extract
  • Node Extraction Service (Express)
  1. PATCH StudentTask → extractionStatus: 'processing'
  2. Fetch b64 from StudentTask via OData
  3. Decode → Buffer
  4. Branch: DOCX → mammoth | PDF → pdfjs-dist
  5. Compute MD5 hash of original binary
  6. PATCH StudentTask → extracted text + metadata
  7. POST to FileMaker script → archive b64 to Documents FileMaker StudentTask record — extracted text, status fields Documents record — reconstituted original binary linked to StudentTask

If questions are prefixed with a known character (§) then the text can be extracted with singke carriage returns and then substitute extra lines before teh character for presentation purposes. The extraction is written as a service, so could be called from other places in the FileMaker ecosphere.