Looking for answers without outsourcing to LLM
Some examples of where i get limited by a specific programming language/framework’s syntax/particular way of doing things, and how I navigate it by trying to see if i can solve it myself based on what i know or good old internet search, (i.e. manually looking at links/posts) and learning a bit more in the process.
I know there is research now showing the impact of using LLMs on our brain, but to me, this is just how I want to do my work. I want to be involved.
Navigation
1
I want to get access to the subject_id within my Pytorch dataset. I know that
it is present in a dictionary for each row of my data.
(Pdb) eval_dataset
<__main__.Cardiomegaly object at 0x00000298DF1C8BC0>
(Pdb) dir(eval_dataset)
['__add__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'df', 'img_dir', 'transform']
(Pdb) eval_dataset[0]
(tensor([....]]), tensor([0.]), {'subject_id': '10000032', 'study_id': np.int64(50414267)})
(Pdb) type(eval_dataset[0])
<class 'tuple'>
(Pdb) eval_dataset[0][0]
tensor([[[ 0.9988, 0.5878, 0.1939, ..., -0.0972, -0.1143, -0.1486],
[ 0.9817, 0.6392, 0.2796, ..., -0.1143, -0.1314, -0.0801],
[ 1.0331, 0.6906, 0.4166, ..., 0.0569, 0.0398, 0.0056],
...,
...,
[-1.5430, -1.6650, -1.7522, ..., -1.7696, -1.7696, -1.7696],
[-1.5430, -1.6302, -1.7522, ..., -1.8044, -1.8044, -1.8044],
[-1.5430, -1.6476, -1.7347, ..., -1.8044, -1.8044, -1.8044]]])
(Pdb) eval_dataset[0][1]
tensor([0.])
(Pdb) eval_dataset[0][2]
{'subject_id': '10000032', 'study_id': np.int64(50414267)}
And there I have it, so now collecting the subject_ids is a matter of:
(Pdb) subject_ids = [row[2]["subject_id"] for row in eval_dataset]
Python makes it easy to inspect objects and get things. Other languages, a bit trickier perhaps, but it’s possible, we have the good old printing!
2
I have an integer and i want to just run a loop that many times.
- I search for
for loop javascript - I pick the MDN link
I preferred MDN over w3schools since i am aware of the authority of MDN - similar to w3schools, of course for me, but just went with MDN.
3
I have a dataframe which is multi indexed and i want to each each unique combination of index columns data:
(Pdb) insurance_metrics
n tp tn fp fn accuracy precision recall f1 auc pr_auc
dataset insurance
eval_1_cardiomegaly.csv.gz Medicaid 14877.0 1816.0 9472.0 2897.0 692.0 0.758755 0.385317 0.724083 0.502977 0.826120 0.481929
Medicare 45393.0 8678.0 21264.0 13126.0 2325.0 0.659617 0.398000 0.788694 0.529033 0.772568 0.493489
...
I want to get access to the data for each unique combination of dataset and insurance":
insurance_metrics.index
MultiIndex([('eval_1_cardiomegaly.csv.gz', 'Medicaid'),
('eval_1_cardiomegaly.csv.gz', 'Medicare'),
('eval_1_cardiomegaly.csv.gz', 'No charge'),
Search query: “pandas selecting by index value”
Reading, https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-label
The .loc attribute is the primary access method. The following are valid inputs: A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels ‘a’:‘f’. Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.
A boolean array.
My brain goes. omg.
So, i start hacking around:
(Pdb) insurance_metrics.get(('eval_1_cardiomegaly.csv.gz', 'Medicaid'))
No dice.
Okay, I know, loc, and i know that i want to specify a tuple, since I have a MultiIndex:
(Pdb) insurance_metrics.loc(('eval_1_cardiomegaly.csv.gz', 'Medicaid'))
*** ValueError: No axis named ('eval_1_cardiomegaly.csv.gz', 'Medicaid') for object type DataFrame
(Pdb) insurance_metrics.loc[("eval_1_cardiomegaly.csv.gz", "Medicare")]
n 45393.000000
tp 8678.000000
tn 21264.000000
fp 13126.000000
fn 2325.000000
accuracy 0.659617
precision 0.398000
recall 0.788694
f1 0.529033
auc 0.772568
pr_auc 0.493489
Name: (eval_1_cardiomegaly.csv.gz, Medicare), dtype: float64
Okay, so i have what i needed. To automate it, i am just gonna use groupby:
(Pdb) items=[(d, i) for d, i in insurance_metrics.groupby(by=['dataset', 'insurance'])]
(Pdb) type(items[0])
<class 'tuple'>
(Pdb) items[0][0]
('eval_1_cardiomegaly.csv.gz', 'Medicaid')
(Pdb) items[0][1]
n tp tn fp fn accuracy precision recall f1 auc pr_auc
dataset insurance
eval_1_cardiomegaly.csv.gz Medicaid 14877.0 1816.0 9472.0 2897.0 692.0 0.758755 0.385317 0.724083 0.502977 0.82612 0.481929
Okay, so apparentely, groupby works with index columns too.
4
I have two dataframes, new_results and baseline_results.
I want to copy all the data from a baseline_results for a specific column value into new_results. In this case,
the column value is baseline_cardiomegaly.csv.gz for dataset column.
My trial and error attempts and how i got there:
(Pdb) new_results
Unnamed: 0 y_true y_pred prob_1 ... dataset gender race insurance
0 0 1 1 0.649867 ... eval_1_cardiomegaly.csv.gz M WHITE Medicare
... ... ... ... ... ... ... ... ... ...
[142698 rows x 11 columns]
```python
(Pdb) baseline_results
Unnamed: 0 y_true y_pred prob_1 ... dataset gender race insurance
0 0 0 0 0.057635 ... baseline_cardiomegaly.csv.gz M RECORD_NOT_FOUND RECORD_NOT_FOUND
[362281 rows x 12 columns]
Awfully bad attempts, without even thinking, one might say, hands typing, brain listening to music (perhaps from the muscle memory of yesterday):
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results["baseline_cardiomegaly.csv.gz"]
*** KeyError: 'baseline_cardiomegaly.csv.gz'
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results.loc("baseline_cardiomegaly.csv.gz")
*** ValueError: No axis named baseline_cardiomegaly.csv.gz for object type DataFrame
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results.loc(["baseline_cardiomegaly.csv.gz"])
*** TypeError: unhashable type: 'list'
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = pd.Series(baseline_results.loc(["baseline_cardiomegaly.csv.gz"]))
*** TypeError: unhashable type: 'list'
(Pdb) baseline_results.loc(["baseline_cardiomegaly.csv.gz"])
*** TypeError: unhashable type: 'list'
(Pdb) baseline_results.loc(("baseline_cardiomegaly.csv.gz"))
*** ValueError: No axis named baseline_cardiomegaly.csv.gz for object type DataFrame
The above attempts are all my brain not considering the fact that I am choosing the value for a specific column and the value itself is not an index or a column name.
dataset is a column and that’s the column I must look up (not an index).
Once the brain has that updated context, I struggle with the exact syntax for filtering a bit:
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results[baseline_results[dataset == "baseline_cardiomegaly..csv.gz"]]
*** NameError: name 'dataset' is not defined
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results[baseline_results["dataset" == "baseline_cardiomegaly..csv.gz"]]
*** KeyError: False
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"]
*** ValueError: Cannot set a DataFrame with multiple columns to the single column baseline_cardiomegaly.csv.gz
At this point I realize, what i am doing wrong, i have the selection correct, but i am trying to put in multiple columns and assign it to a single column,
so I need concat which I again struggle with the right syntax:
(Pdb) new_results = new_results.concat(baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"])
*** AttributeError: 'DataFrame' object has no attribute 'concat'
(Pdb) new_results = pd.concat(new_results, baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"])
*** TypeError: concat() takes 1 positional argument but 2 were given
(Pdb) new_results = pd.concat([baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"], new_results])
Okay finally i have it! Trial and error is my favorite way to learn, the brain needs to take the paths to the solution and there is a satisfaction I derive from that process:
(Pdb) new_results
Unnamed: 0 y_true y_pred prob_1 ... dataset gender race insurance
0 0 0 0 0.057635 ... baseline_cardiomegaly.csv.gz M RECORD_NOT_FOUND RECORD_NOT_FOUND
[362281 rows x 12 columns]
Okay, so I am training myself i think. I am my favorite agent.
5
I want the equivalent of ternary operator ? (in C or javascript) in Python. I have forgotten exactly how, but i know that
i can use if..else in list comprehension, so perhaps it works outside it too?
Let’s try:
>>> import random
>>> value = 1 if random.random() > 0.5 else 0
>>> value
0
>>> value = 1 if random.random() > 0.5 else 0
>>> value
0
>>> value = 1 if random.random() > 0.5 else 0
>>> value
1
yeah, it does.
of course, random.random() > 0.5 is an example of a conditional evaluation.
6
I want to access, feature-prep directory which is at the same level as utils, from inside a file inside utils. Basically:
- Traverse one directory up from
utils - Go down another directory,
feature-prep
(Pdb) Path(__file__) / "..//feature-prep"
WindowsPath('C:/Users/amits/work/github.com/amitsaha/ml-fairness-health/mywork/experiments/mimic-cxr/utils/common_experiment.py/../feature-prep')
(Pdb) import os
(Pdb) os.path.exists(Path(__file__) / "..//feature-prep")
False
(Pdb) os.path.exists(Path(__file__))
True
(Pdb) os.path.basename(Path(__file__))
'common_experiment.py'
(Pdb) os.path.dirname(Path(__file__))
'C:\\Users\\amits\\work\\github.com\\amitsaha\\ml-fairness-health\\mywork\\experiments\\mimic-cxr\\utils'
(Pdb) os.path.dirname(os.path.dirname(Path(__file__)))
'C:\\Users\\amits\\work\\github.com\\amitsaha\\ml-fairness-health\\mywork\\experiments\\mimic-cxr'
(Pdb) os.path.exists(os.path.dirname(os.path.dirname(Path(__file__)))
*** SyntaxError: '(' was never closed
(Pdb) os.path.exists(os.path.dirname(os.path.dirname(Path(__file__))))
True
(Pdb) os.path.exists(Path(os.path.dirname(os.path.dirname(Path(__file__)))) / "feature-prep")
True
(Pdb) os.path.exists(Path(os.path.dirname(os.path.dirname(Path(__file__)))) / "feature-prep")
7
> const userQueries = ["foo", "bar"];
undefined
> console.log(userQueries)
[ 'foo', 'bar' ]
undefined
> for (const item in userQueries) {
... console.log(item)
... }
0
1
undefined
> for (const idx, item in userQueries) {
for (const idx, item in userQueries) {
^^^
Uncaught SyntaxError: Missing initializer in const declaration
> for (const {idx, item} in userQueries) {
... console.log(item)
... {
... }
... }
undefined
undefined
undefined
> for (const item in Object.entries(userQueries)) {
... console.log(item)
... }
0
1
undefined
Then, i come across https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration#for...of_statement when I search again and in the Google AI summary.
So, we have a for..of !? (why???)
> for (const item of userQueries) {
... console.log(item)
... }
foo
bar
undefined```