Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3b18738
commit initial working archiver (needs splitting)
cooperka Mar 27, 2026
04c8804
export individual user csvs
cooperka Mar 30, 2026
0821dc6
idx isn't needed anymore
cooperka Mar 30, 2026
312c624
upload all without tags
cooperka Mar 30, 2026
b68417b
forked tags works
cooperka Mar 30, 2026
8e8f411
basic dynamic metadata works
cooperka Apr 1, 2026
37ba6ad
allow user-mission has-many tagging
cooperka Apr 1, 2026
a2858a9
more explicit relations + add todos
cooperka Apr 1, 2026
95429e4
include forms xlsx
cooperka Apr 1, 2026
32a7fb4
get response odata export working
cooperka Apr 23, 2026
c8db250
track all warnings
cooperka May 14, 2026
bddd583
clean up comment
cooperka May 21, 2026
ac9f078
export ODK XML alongside XLSX
cooperka Apr 29, 2026
a8f23e4
export media attachments and rework metadata for dynamic keys
cooperka May 11, 2026
6a254b8
big mediaprompt performance improvement
cooperka May 14, 2026
b71ea3c
memoize to improve performance of to_xls
cooperka May 22, 2026
6d69ad3
get response attachments parsed and exported
cooperka May 15, 2026
f86cd33
joins attached_odk_xml
cooperka May 27, 2026
4cda320
csv performance optimization trial
cooperka May 27, 2026
4578288
Revert "csv performance optimization trial"
cooperka May 27, 2026
20c8f72
stream the zip instead of buffering in memory
cooperka May 27, 2026
c823737
track manifest of uploaded files
cooperka May 21, 2026
ae647c0
fix response attachment resolver
cooperka May 27, 2026
791a5db
much more verbose logs
cooperka May 27, 2026
d50d06e
silence overwhelming logs by default
cooperka May 28, 2026
8038431
add benchmark + steps to export
cooperka May 28, 2026
353f0fe
also benchmark upload step
cooperka May 28, 2026
c382271
fix lints
cooperka May 28, 2026
35eca35
enable advanced debugging for upload headers!
cooperka May 28, 2026
2431145
read container from env
cooperka May 28, 2026
bc9ab83
log counts for tracking purposes
cooperka May 28, 2026
fe269da
recursively vanillify to handle nils in array
cooperka May 28, 2026
8969591
handle missing attachments as warnings not errors
cooperka May 29, 2026
17bcd89
allow skipping individual steps
cooperka May 29, 2026
c4368f3
output upload indexes
cooperka May 30, 2026
73c476b
handle crashes while reporting elapsed time
cooperka May 30, 2026
388cef3
cache dirty responses on export
cooperka May 30, 2026
0aff2ca
remove some unnecessary PII from users
cooperka May 31, 2026
6bd4f31
allow forms and hints to be lost (with warnings) too
cooperka May 31, 2026
e9859d3
be a bit more verbose
cooperka May 31, 2026
2549f0e
starting docs
cooperka Jun 11, 2026
f240aae
more code docs
cooperka Jun 16, 2026
90be642
fix generic metadata positioning with fields
cooperka Jun 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .idea/runConfigurations/Rails_runner__Archive.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions .idea/runConfigurations/Rails_runner__CacheODataJob.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ gem "reverse_markdown", "~> 2.0"
# Storage
gem "active_storage_validations", "~> 0.9.3"
gem "aws-sdk-s3", "~> 1.208", require: false
gem "azure-storage-blob", "~> 2.0", require: false
gem "azure-storage-blob", github: "sassafrastech/azure-storage-blob", tag: "v2.0.3-withTags"
gem "image_processing", "~> 1.12"
gem "sys-filesystem", "~> 1.4"

Expand Down
14 changes: 10 additions & 4 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,15 @@ GIT
jquery-rails
railties

GIT
remote: https://github.com/sassafrastech/azure-storage-blob.git
revision: 99cd6f7876c3814fb62716b37f5124bef2690bd3
tag: v2.0.3-withTags
specs:
azure-storage-blob (2.0.3)
azure-storage-common (~> 2.0)
nokogiri (~> 1, >= 1.10.8)

GIT
remote: https://github.com/sassafrastech/closure_tree.git
revision: 854f9292333ea0ae209376d69e4f4c18a192b22c
Expand Down Expand Up @@ -155,9 +164,6 @@ GEM
aws-sigv4 (~> 1.5)
aws-sigv4 (1.12.1)
aws-eventstream (~> 1, >= 1.0.2)
azure-storage-blob (2.0.3)
azure-storage-common (~> 2.0)
nokogiri (~> 1, >= 1.10.8)
azure-storage-common (2.0.4)
faraday (~> 1.0)
faraday_middleware (~> 1.0, >= 1.0.0.rc1)
Expand Down Expand Up @@ -706,7 +712,7 @@ DEPENDENCIES
authlogic (~> 6.1)
awesome_print (~> 1.6)
aws-sdk-s3 (~> 1.208)
azure-storage-blob (~> 2.0)
azure-storage-blob!
binding_of_caller (~> 1.0.0)
bluecloth (~> 2.2)
blueprinter (~> 0.25.1)
Expand Down
185 changes: 185 additions & 0 deletions app/models/archiving/exporter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# frozen_string_literal: true

require "fileutils"

# This script outputs a zip file archive of all useful NEMO data in human-readable format.
#
# rubocop:disable Rails/Output
module Archiving
# Outputs data to CSV ZIP bundle.
class Exporter
attr_accessor :relations, :dont_implicitly_expand

# Optionally accepts a list of relations to export and ignore.
# Hard-coded defaults generally shouldn't need to be overridden for archival.
def initialize(relations: nil, dont_implicitly_expand: nil)
self.relations = relations || [
Mission.all,
User.all,
Assignment.all,
]

self.dont_implicitly_expand = dont_implicitly_expand || [
Setting,
UserGroup,
UserGroupAssignment,
]
end

# Verbose will log SQL queries.
# Skip is an array of steps to skip ("relations", "forms", "hints", "responses", "attachments").
def export(verbose: false, skip: [])
skip = skip.map(&:to_s)
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)

begin
if verbose
perform_export(skip: skip)
else
silence_verbose_logs { perform_export(skip: skip) }
end
ensure
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Export took #{elapsed.round(2)}s"
end
end

def perform_export(skip: [])
warnings = []

FileUtils.mkdir_p(export_dir)
Zip::OutputStream.open(zipfile_path) do |out|
self.relations = [] if skip.include?("relations")
puts "Exporting relations: #{relations.map(&:klass).join(', ')}..."
expander = RelationExpander.new(relations, dont_implicitly_expand: dont_implicitly_expand)
expander.expanded.each do |klass, relations|
puts " Exporting #{klass.count} #{klass.name.pluralize}..."
col_names = klass.column_names - %w[standard_copy last_mission_id
birth_year crypted_password gender gender_custom
password_salt perishable_token persistence_token]
relations.each do |relation|
relation = relation.select(col_names.join(", ")) unless col_names == klass.column_names
relation.each do |entry|
# Filename must be in the format `ClassName 123-456.csv` with a space followed by the ID of the item,
# since this is used in the uploader script to process files.
out.put_next_entry("#{klass.name.tr(':', '_')} #{entry.id}.csv")
# Pick out this single entry (but keep the ActiveRecord relation to be able to use `copy_to`)
# and save each to disk.
relation.where(id: entry.id).copy_to { |line| out.write(line) }
end
end
end

items = skip.include?("forms") ? Form.none : Form.with_attached_odk_xml
total_count = items.count
curr_count = 0
items.find_each do |form|
puts "Exporting form #{curr_count += 1}/#{total_count}: #{form.name}..."

# XLSX version of the form
out.put_next_entry("Form #{form.id}.xlsx")
out.write(Forms::Export.new(form).to_xls)

# Equivalent ODK XML version of the form
begin
data = form.odk_xml.download
out.put_next_entry("Form #{form.id}.xml")
out.write(data)
rescue ActiveStorage::FileNotFoundError => e
warn(warnings, "Form #{form.id} XML not found (it was likely never published): #{e.message}")
end
end

items = skip.include?("hints") ? Question.none : Question.joins(:media_prompt_attachment).with_attached_media_prompt
total_count = items.count
curr_count = 0
items.find_each do |question|
puts "Exporting media_prompt #{curr_count += 1}/#{total_count}: #{question.code}..."
mp = question.media_prompt

begin
data = mp.download
# Convert the filename from e.g. "123_media_prompt.jpg" to "MediaPrompt 123.jpg"
out.put_next_entry("MediaPrompt #{question.id}.#{mp.filename.extension}")
out.write(data)
rescue ActiveStorage::FileNotFoundError => e
warn(warnings, "MediaPrompt #{question.code} not found: #{e.message}")
end
end

items = skip.include?("responses") ? Response.none : Response.all
total_count = items.count
curr_count = 0
items.find_each do |response|
# Responses to unpublished forms are generally not cached, and/or server jobs may be behind.
if response.dirty_json?
puts "Caching response #{response.shortcode}..."
CacheODataJob.cache_response(response)
end

puts "Exporting response #{curr_count += 1}/#{total_count}: #{response.shortcode}..."
out.put_next_entry("Response #{response.id}.json")
out.write(response.cached_json.to_json) # This will be the string "null" if it's not yet cached.
warn(warnings, "Response #{response.id} was dirty") if response.dirty_json
end

items = skip.include?("attachments") ? Media::Object.none : Media::Object.joins(:item_attachment).with_attached_item
total_count = items.count
curr_count = 0
items.find_each do |obj|
attachment = obj.item
puts "Exporting response attachment #{curr_count += 1}/#{total_count}: #{attachment.filename}..."

response_id = obj.answer.response_id
code = attachment.filename.base.split("-").last

begin
data = attachment.download
# Convert the filename from e.g. "nemo-foo-bar-baz-ImageQ1.jpg" to "ResponseAttachment 123 ImageQ1.jpg"
# Where foo-bar-baz represents [mission_code]-[form_code]-[response_code] and 123 is the response ID.
out.put_next_entry("ResponseAttachment #{response_id} #{code}.#{attachment.filename.extension}")
out.write(data)
rescue ActiveStorage::FileNotFoundError => e
warn(warnings, "ResponseAttachment #{response_id} #{code} not found: #{e.message}")
end
end
end

puts
Rails.logger.warn("Warnings:\n#{warnings.join("\n")}\n") unless warnings.empty?
puts "Exported #{zipfile_path}"
puts "Encountered #{warnings.count} warnings."
end

private

# Log now, but also save them all for the end since it will likely get lost.
def warn(warnings, msg)
warnings.push(msg)
Rails.logger.warn(msg)
end

def export_dir
@export_dir ||= Rails.root.join("tmp/archives")
end

def zipfile_path
@zipfile_path ||= export_dir.join("#{Time.zone.now.to_fs(:filename_datetime)}.zip")
end

def silence_verbose_logs
# We could simply `Rails.logger.silence` but that would hide warnings too.
loggers = [
ActiveRecord::Base.logger,
].compact.uniq

old_levels = loggers.index_with(&:level)
loggers.each { |logger| logger.level = Logger::WARN }

yield
ensure
old_levels.each { |logger, level| logger.level = level }
end
end
end
# rubocop:enable Rails/Output
44 changes: 44 additions & 0 deletions app/models/archiving/relation_expander.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# frozen_string_literal: true

# Note: Documented in "Server Archiving" wiki page.
module Archiving
# Expands a given set of relations to include all necessary related objects.
class RelationExpander
attr_accessor :initial_relations, :relations_by_class, :options

def initialize(relations, **options)
self.options = options
options[:dont_implicitly_expand] ||= []
self.initial_relations = relations
self.relations_by_class = relations.group_by(&:klass)
end

# Returns a hash of form {ModelClass => [Relation, Relation, ...], ...}, mapping model classes
# to arrays of Relations.
def expanded
initial_relations.each { |r| expand(r) }
relations_by_class
end

private

def expand(relation)
(relation.klass.clone_options[:follow] || []).each do |assn_name|
assn = relation.klass.reflect_on_association(assn_name)

# dont_implicitly_expand is provided if the caller wants to indicate that one of the initial_relations
# should cover all relevant rows and therefore implicit expansion is not necessary. This improves
# performance by simplifying the eventual SQL queries.
next if options[:dont_implicitly_expand].include?(assn.klass)

new_rel = if assn.belongs_to?
assn.klass.where("id IN (#{relation.select(assn.foreign_key).to_sql})")
else
assn.klass.where("#{assn.foreign_key} IN (#{relation.select(:id).to_sql})")
end
(relations_by_class[assn.klass] ||= []) << new_rel
expand(new_rel)
end
end
end
end
29 changes: 19 additions & 10 deletions app/models/forms/export.rb
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def initialize(form)
def to_csv
CSV.generate do |csv|
csv << COLUMNS
@form.preordered_items.each do |q|
preordered_items.each do |q|
csv << row(q)
end
end
Expand All @@ -74,7 +74,7 @@ def to_xls
settings = book.create_worksheet(name: "settings")

# Get languages
locales = @form.mission.setting.preferred_locales
locales = Setting.for_mission(@form.mission).preferred_locales

# Write sheet headings at row index 0
questions.row(0).push(
Expand Down Expand Up @@ -102,7 +102,9 @@ def to_xls
# Hence, we push to the row (i + index_mod)
index_mod = 1 # start at row index 1

@form.preordered_items.each_with_index do |q, i|
items = preordered_items

items.each_with_index do |q, i|
# this variable keeps track of the spreadsheet row to be written during this loop iteration
row_index = i + index_mod

Expand Down Expand Up @@ -346,7 +348,7 @@ def to_xls
end

# are we at the end of the form?
if i == @form.preordered_items.size - 1
if i == items.size - 1
row_index += 1

# do we still have unclosed groups in the tracker array?
Expand Down Expand Up @@ -390,7 +392,7 @@ def to_xls
## Settings
settings.row(0).push("form_title", "form_id", "version", "default_language", "allow_choice_duplicates")

lang = @form.mission.setting.preferred_locales[0].to_s
lang = Setting.for_mission(@form.mission).preferred_locales[0].to_s
version = if @form.current_version.present?
@form.current_version.decorate.name
else
Expand Down Expand Up @@ -448,6 +450,12 @@ def name(qing)
end
end

# Memoize slow method
def preordered_items
# TODO: This can be further improved with eager_load: ...
@preordered_items ||= @form.preordered_items
end

# Given a header like `"label"`, return an array of localized headers like `["label::English (en)"]`
def local_headers(header, locales)
locales.map do |locale|
Expand Down Expand Up @@ -592,12 +600,13 @@ def unique_level_name(os_name, level_name)
# recursively remove pesky characters and replace spaces with underscores
# for XLSForm compatibility
def vanillify(input)
return "" if input.nil?

if input.instance_of?(String)
case input
when String
input.vanilla.tr(" ", "_")
elsif input.instance_of?(Array)
input.map { |n| n.vanilla.tr(" ", "_") }
when Array
input.map { |n| vanillify(n) }
when nil
""
else
raise "Unallowed type passed to vanillify: #{input.class}"
end
Expand Down
Loading
Loading