Open Parquet files • arcMS

How to open converted files?

Parquet files created by arcMS can be easily opened with R or Python. There are also functions included in the arcMS package to recreate sample_dataset objects from Parquet files. This class of object is associated with generic functions to quickly get data, metadata and simple plots like TIC.

1. Load converted Parquet files in R without arcMS

Files created by arcMS in the Parquet format can be opened in R with the Arrow library.

Data can be loaded in RAM with the following commands:

library(arrow)
data = read_parquet("converted_file.parquet")

head(data)
#> # A tibble: 6 × 7
#>        rt scanid mslevel    mz intensity   bin    dt
#>     <dbl>  <int> <fct>   <dbl>     <dbl> <int> <dbl>
#> 1 0.00740      1 1        556.         0    19  1.35
#> 2 0.00740      1 1        556.         5    19  1.35
#> 3 0.00740      1 1        556.         7    19  1.35
#> 4 0.00740      1 1        556.         0    19  1.35
#> 5 0.00740      1 1        556.         0    23  1.63
#> 6 0.00740      1 1        556.         1    23  1.63

Data can then be filtered and quickly aggregated, e.g. to obtain TIC plot:

library(data.table)
data = as.data.table(data)
datalow = data[mslevel == 1, ]
TIC = datalow[, list(intensity = sum(intensity)), by=list(rt)]
plot(TIC$rt, TIC$intensity, type = "l")

To save RAM, data can also be manipulated directly on-disk thanks to the Arrow library (open_dataset() function):

data = open_dataset("converted_file.parquet")

Data can be filtered, rearranged, sorted and aggregated with the dplyr syntax, and only the resulting data will be loaded in RAM (with the collect() function):

library(dplyr)

TIC = data |>
  filter(mslevel == "1") |>
  arrange(rt) |>
  group_by(rt) |> 
  summarise(int = sum(intensity)) |> 
  collect()

2. Load converted Parquet files in Python

Parquet files can be opened as a DataFrame in Python with the pandas library:

import pandas as pd
data = pd.read_parquet("converted_file.parquet")
ms1 = data[data['mslevel'] == "1"]
print(ms1)

It can also be loaded as an Arrow object (ParquetDataset) with the pyarrow library:

import pyarrow.parquet as pq
data = pq.ParquetDataset("converted_file.parquet")

3. Load converted Parquet files in R with arcMS

The methods above only retrieve the main data from the Parquet file, but not metadata. Other functions must be used to retrieve metadata.

To simplify opening both data and metadata, functions are available in arcMS to load a Parquet file as a sample_dataset object, also allowing easy manipulation with some generic functions.

library(arcMS)
dataset = create_sample_dataset("converted_file.parquet")

Retrieve main data:

data = get_sample_data(dataset)

head(data)
#> # A tibble: 6 × 7
#>        rt scanid mslevel    mz intensity   bin    dt
#>     <dbl>  <int> <fct>   <dbl>     <dbl> <int> <dbl>
#> 1 0.00740      1 1        556.         0    19  1.35
#> 2 0.00740      1 1        556.         5    19  1.35
#> 3 0.00740      1 1        556.         7    19  1.35
#> 4 0.00740      1 1        556.         0    19  1.35
#> 5 0.00740      1 1        556.         0    23  1.63
#> 6 0.00740      1 1        556.         1    23  1.63

Retrieve sample metadata:

sample_metadata = get_sample_metadata(dataset)

str(sample_metadata)
#> Classes 'data.table' and 'data.frame':   1 obs. of  45 variables:
#>  $ id                                    : chr "0134efbf-c75a-411b-842a-4f35e2b76347"
#>  $ name                                  : chr "cal 1"
#>  $ description                           : chr "NaN"
#>  $ sample.description                    : chr "NaN"
#>  $ sample.originalSampleId               : chr "NaN"
#>  $ sample.id                             : chr "8c6f3993-a791-4396-a299-b4dee758a5e7"
#>  $ sample.gender                         : chr "NaN"
#>  $ sample.name                           : chr "cal 1"
#>  $ sample.assayConditions                : chr "NaN"
#>  $ sample.bracketGroup                   : chr "NaN"
#>  $ sample.dose                           : chr "NaN"
#>  $ sample.dosingRoute                    : chr "NaN"
#>  $ sample.day                            : chr "NaN"
#>  $ sample.eCordId                        : chr "NaN"
#>  $ sample.experimentalConcentration      : chr "NaN"
#>  $ sample.groupId                        : chr "NaN"
#>  $ sample.injectionId                    : chr "NaN"
#>  $ sample.matrix                         : chr "NaN"
#>  $ sample.solventDelay                   : chr "NaN"
#>  $ sample.species                        : chr "NaN"
#>  $ sample.studyId                        : chr "NaN"
#>  $ sample.subjectId                      : chr "NaN"
#>  $ sample.molForm                        : chr "NaN"
#>  $ sample.preparation                    : chr "NaN"
#>  $ sample.sampleType                     : chr "Standard"
#>  $ sample.batchId                        : chr "NaN"
#>  $ sample.studyName                      : chr "NaN"
#>  $ sample.sampleLevel                    : chr "Unspecified"
#>  $ sample.sampleWeight                   : num 1
#>  $ sample.dilution                       : num 1
#>  $ sample.replicateNumber                : int 1
#>  $ sample.wellPosition                   : chr "1:A,7"
#>  $ sample.injectionVolume                : num 10
#>  $ sample.acquisitionRunTime             : num 34
#>  $ sample.acquisitionStartTime           : chr "2021-11-17T16:14:36.1683419+01:00"
#>  $ sample.time                           : chr "NaN"
#>  $ sample.processingOptions              : chr "QuantitationStd"
#>  $ sample.processingFunction             : chr "NaN"
#>  $ sample.processingSequenceNumber       : int 0
#>  $ components.odata.navigationLink       : chr "http://localhost:50034/unifi/v1/sampleresults(0134efbf-c75a-411b-842a-4f35e2b76347)/components"
#>  $ spectra.odata.navigationLink          : chr "http://localhost:50034/unifi/v1/sampleresults(0134efbf-c75a-411b-842a-4f35e2b76347)/spectra/mass.mse"
#>  $ spectrumInfos.odata.navigationLink    : chr "http://localhost:50034/unifi/v1/sampleresults(0134efbf-c75a-411b-842a-4f35e2b76347)/spectrumInfos"
#>  $ chromatogramInfos.odata.navigationLink: chr "http://localhost:50034/unifi/v1/sampleresults(0134efbf-c75a-411b-842a-4f35e2b76347)/chromatogramInfos"
#>  $ sampleName                            : chr "cal 1_replicate_1"
#>  $ analysisName                          : chr "20211117_NORMAN_2e EIL_semi-quanti_Parent-TP products-testsAPI"
#>  - attr(*, ".internal.selfref")=<externalptr>

Retrieve spectrum metadata:

spectrum_metadata = get_spectrum_metadata(dataset)

str(spectrum_metadata)
#> Classes 'data.table' and 'data.frame':   3 obs. of  32 variables:
#>  $ id                                                      : chr  "a8d5230c-aaa3-4f34-9a20-738ae8484612" "896256ff-aa7e-4e45-b76c-5f4fb6275fd3" "b83dffcc-5817-4ff6-b535-f2ec0869e5e3"
#>  $ name                                                    : chr  "1: MS LockSpray Reference Data (546.2766-566.2766) 6V ESI+" "2: HD TOF MSe (50-1000) 6V ESI+" "3: HD TOF MSe (50-1000) 20-56V ESI+"
#>  $ isCentroidData                                          : logi  FALSE FALSE FALSE
#>  $ isRetentionData                                         : logi  TRUE TRUE TRUE
#>  $ isIonMobilityData                                       : logi  FALSE TRUE TRUE
#>  $ hasCCSCalibration                                       : logi  FALSE TRUE TRUE
#>  $ detectorType                                            : chr  "MS" "MS" "MS"
#>  $ analyticalTechnique.X.odata.type                        : chr  "#Waters.WebApi.Common.Models.MSTechnique" "#Waters.WebApi.Common.Models.MSTechnique" "#Waters.WebApi.Common.Models.MSTechnique"
#>  $ analyticalTechnique.hardwareName                        : chr  "NaN" "NaN" "NaN"
#>  $ analyticalTechnique.scanningMethod                      : chr  "MS" "MS" "MS"
#>  $ analyticalTechnique.massAnalyser                        : chr  "QTOF" "QTOF" "QTOF"
#>  $ analyticalTechnique.ionisationMode                      : chr  "+" "+" "+"
#>  $ analyticalTechnique.ionisationType                      : chr  "ESI" "ESI" "ESI"
#>  $ analyticalTechnique.lowMass                             : num  546 50 50
#>  $ analyticalTechnique.highMass                            : num  566 1000 1000
#>  $ analyticalTechnique.adcGroup.acquisitionMode            : chr  "ADC_PD" "ADC_PD" "ADC_PD"
#>  $ analyticalTechnique.adcGroup.acquisitionFrequency       : chr  "NaN" "NaN" "NaN"
#>  $ analyticalTechnique.tofGroup.nominalResolution          : num  30000 30000 30000
#>  $ analyticalTechnique.tofGroup.mseLevel                   : chr  "Unknown" "Low" "High"
#>  $ analyticalTechnique.tofGroup.pusherFrequency            : num  14085 14085 14085
#>  $ analyticalTechnique.quadGroup                           : chr  "NaN" "NaN" "NaN"
#>  $ axisX.label                                             : chr  "Observed mass" "Observed mass" "Observed mass"
#>  $ axisX.unit                                              : chr  "m/z" "m/z" "m/z"
#>  $ axisX.lowerBound                                        : num  546 50 50
#>  $ axisX.upperBound                                        : num  566 1000 1000
#>  $ axisY.label                                             : chr  "Intensity" "Intensity" "Intensity"
#>  $ axisY.unit                                              : chr  "Counts" "Counts" "Counts"
#>  $ axisY.lowerBound                                        : chr  "NaN" "NaN" "NaN"
#>  $ axisY.upperBound                                        : chr  "NaN" "NaN" "NaN"
#>  $ analyticalTechnique.adcGroup.ionResponses.ionType       : chr  "PEPTIDE" "PEPTIDE" "PEPTIDE"
#>  $ analyticalTechnique.adcGroup.ionResponses.charge        : int  1 1 1
#>  $ analyticalTechnique.adcGroup.ionResponses.averageIonArea: num  5.92 5.92 5.92
#>  - attr(*, ".internal.selfref")=<externalptr>